Machine Learning Mathematics Intermediate

one-hot encoding

Representing categorical data as a vector in which: - One element is set to 1.

Plain English Explanation

Representing categorical data as a vector in which: - One element is set to 1. - All other elements are set to 0. One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named`Scandinavia` has five possible values: - "Denmark" - "Sweden" - "Norway" - "Finland" - "Iceland" One-hot encoding could represent each of the five values as follows:

0 0 | 1 0 | 0 0 | 0 0 | 0 1 | Thanks to one-hot encoding, a model can learn different connections based on each of the five countries. Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation: - "Denmark" is 0 - "Sweden" is 1 - "Norway" is 2 - "Finland" is 3 - "Iceland" is 4 With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions. See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.

How is it used?

Practitioners refer to one-hot encoding when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.