What is a hashing?
In machine learning, a mechanism for bucketing categorical data, particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small.
hashing explained in plain English
In machine learning, a mechanism for bucketing categorical data, particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small. For example, Earth is home to about 73,000 tree species. You could represent each of the 73,000 tree species in 73,000 separate categorical buckets. Alternatively, if only 200 of those tree species actually appear in a dataset, you could use hashing to divide tree species into perhaps 500 buckets. A single bucket could contain multiple tree species. For example, hashing could place baobab and red maple—two genetically dissimilar species—into the same bucket. Regardless, hashing is still a good way to map large categorical sets into the selected number of buckets. Hashing turns a categorical feature having a large number of possible values into a much smaller number of values by grouping values in a deterministic way. See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.
Example
Practitioners refer to hashing when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- A/B testing
A statistical way of comparing two (or more) techniques—the A and the B.
- ablation
A technique for evaluating the importance of a feature or component by temporarily removing it from a model.
- accuracy
The number of correct classification predictions divided by the total number of predictions.
- activation function
A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.
- active learning
A training approach in which the algorithm chooses some of the data it learns from.
- adaptation
Synonym for tuning or fine-tuning.
- agglomerative clustering
See hierarchical clustering.
- anomaly detection
The process of identifying outliers.
- area under the PR curve
See PR AUC (Area under the PR Curve).
- area under the ROC curve
See AUC (Area under the ROC curve).