What is a class-imbalanced dataset?
A dataset for a classification in which the total number of labels of each class differs significantly.
class-imbalanced dataset explained in plain English
A dataset for a classification in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows: - 1,000,000 negative labels - 10 positive labels The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset. In contrast, the following dataset is class-balanced because the ratio of negative labels to positive labels is relatively close to 1: - 517 negative labels - 483 positive labels Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two: - 1,000,000 labels with class "green" - 200 labels with class "purple" - 350 labels with class "orange" Training class-imbalanced datasets can present special challenges. See Imbalanced datasets in Machine Learning Crash Course for details. See also entropy, majority class, and minority class.
Example
Practitioners refer to class-imbalanced dataset when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- A/B testing
A statistical way of comparing two (or more) techniques—the A and the B.
- ablation
A technique for evaluating the importance of a feature or component by temporarily removing it from a model.
- accuracy
The number of correct classification predictions divided by the total number of predictions.
- activation function
A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.
- active learning
A training approach in which the algorithm chooses some of the data it learns from.
- adaptation
Synonym for tuning or fine-tuning.
- agglomerative clustering
See hierarchical clustering.
- anomaly detection
The process of identifying outliers.
- area under the PR curve
See PR AUC (Area under the PR Curve).
- area under the ROC curve
See AUC (Area under the ROC curve).