AIExplainer
Machine Learning Intermediate

class-imbalanced dataset

A dataset for a classification in which the total number of labels of each class differs significantly.

A dataset for a classification in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows: - 1,000,000 negative labels - 10 positive labels The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset. In contrast, the following dataset is class-balanced because the ratio of negative labels to positive labels is relatively close to 1: - 517 negative labels - 483 positive labels Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two: - 1,000,000 labels with class "green" - 200 labels with class "purple" - 350 labels with class "orange" Training class-imbalanced datasets can present special challenges. See Imbalanced datasets in Machine Learning Crash Course for details. See also entropy, majority class, and minority class.

Practitioners refer to class-imbalanced dataset when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.