Machine Learning Mathematics Intermediate

gini impurity

A metric similar to entropy.

Plain English Explanation

A metric similar to entropy. Splitters use values derived from either gini impurity or entropy to compose conditions for classification decision trees. Information gain is derived from entropy. No universally accepted equivalent term for the metric derived from gini impurity exists; however, this unnamed metric is just as important as information gain. Gini impurity is also called gini index, or simply gini.

Gini impurity is the probability of misclassifying a new piece of data taken from the same distribution. The gini impurity of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) is calculated from the following formula: I = 1 - (p2 + q2) = 1 - (p2 + (1-p)2) where: - I is the gini impurity. - p is the fraction of "1" examples. - q is the fraction of "0" examples. Note that q = 1-p For example, consider the following dataset: - 100 labels (0.25 of the dataset) contain the value "1" - 300 labels (0.75 of the dataset) contain the value "0" Therefore, the gini impurity is: - p = 0.25 - q = 0.75 - I = 1 - (0.252 + 0.752) = 0.375 Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified. A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0. ---

How is it used?

Practitioners refer to gini impurity when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.