What is a bag of words?
A representation of the words in a phrase or passage, irrespective of order.
bag of words explained in plain English
A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically: - the dog jumps - jumps the dog - dog jumps the Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indexes corresponding to the words the, dog, and jumps. The non-zero value can be any of the following: - A 1 to indicate the presence of a word. - A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1. - Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
Example
Practitioners refer to bag of words when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- encoder
In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.
- language model
A model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens.
- word embedding
Representing each word in a word set within an embedding vector; that is, representing each word as a vector of floating-point values between 0.
- automatic evaluation
Using software to judge the quality of a model's output.
- BERT
A model architecture for text representation.
- bidirectional language model
A language model that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.
- bigram
An N-gram in which N=2.
- BLEU
A metric between 0.
- BLEURT
A metric for evaluating machine translations from one language to another, particularly to and from English.
- Character N-gram F-score
A metric to evaluate machine translation models.