AIExplainer

What is a sparse representation?

Storing only the position(s) of nonzero elements in a sparse feature.

Storing only the position(s) of nonzero elements in a sparse feature. For example, suppose a categorical feature named`species` identifies the 36 tree species in a particular forest. Further assume that each example identifies only a single species. You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single`1`(to represent the particular tree species in that example) and 35`0` s (to represent the 35 tree species not in that example). So, the one-hot representation of`maple` might look something like the following: Alternatively, sparse representation would simply identify the position of the particular species. If`maple` is at position 24, then the sparse representation of`maple` would simply be:

Notice that the sparse representation is much more compact than the one-hot representation.

Suppose each example in your model must represent the words—but not the order of those words—in an English sentence. English consists of about 170,000 words, so English is a categorical feature with about 170,000 elements. Most English sentences use an extremely tiny fraction of those 170,000 words, so the set of words in a single example is almost certainly going to be sparse data. Consider the following sentence:

You could use a variant of one-hot vector to represent the words in this sentence. In this variant, multiple cells in the vector can contain a nonzero value. Furthermore, in this variant, a cell can contain an integer other than one. Although the words "my", "is", "a", and "great" appear only once in the sentence, the word "dog" appears twice. Using this variant of one-hot vectors to represent the words in this sentence yields the following 170,000-element vector: A sparse representation of the same sentence would simply be:

The term "sparse representation" confuses a lot of people because sparse representation is itself not a sparse vector. Rather, sparse representation is actually a dense representation of a sparse vector. The synonym index representation is a little clearer than "sparse representation." --- See Working with categorical data in Machine Learning Crash Course for more information.