2026-04-02 18:12:56

Everything starts with a simple question: how can a machine understand text? When you train a model to distinguish spam from normal messages or to determine the sentiment of reviews, something interesting happens under the hood. The model needs to somehow convert letters and words into numbers because neural networks only work with digits.

The first naive approach is to just assign a number to each unique word. Good = 6, bad = 26, awesome = 27. It seems logical, but here’s the problem: the numbers 26 and 27 are close to each other, so the model might think that bad and awesome are similar. In reality, however, awesome and good are semantically closer. That’s the catch.

We tried One Hot Encoding — giving each word a vector as long as the entire vocabulary, where only one element is 1 indicating the specific word, and the rest are zeros. The ranking problem disappeared, but a new one appeared: if the vocabulary has 20,000 words, each vector will be 20,000 dimensions. It consumes a huge amount of memory, and the model doesn’t grasp the semantic relationships between words.

Then came Bag of Words and N-grams — counting how many times a word appears in the text. It adds context, but again — large sparse vectors, and the model doesn’t understand deep connections between words. For example, in the sentence "The librarian loves books," the words librarian and book are not adjacent, so N-grams won’t catch that they are related.

This is where proper encoding through embedding comes to the rescue. The idea is that similar words should be close to each other in the vector space. Imagine a two-dimensional plane: on one axis, the size of the animal; on the other, danger. Tiger and lion will be close to each other as (large and dangerous), while hamster is separate as (small and safe). That’s what embedding is — a dense vector that captures the meaning of a word in an n-dimensional space.

The coolest part: with such vectors, you can do math. Take the vector for "son," subtract "man," and add "woman" — you get a vector close to "daughter." Or: Madrid + Germany - Spain = Berlin. It works because the model captures relationships between concepts.

How are these embeddings trained? Google proposed Word2Vec with two approaches. In CBOW, you take context words and predict the central word. Skip-Gram does the opposite — from the central word, predict its neighbors. Both techniques work well for training word embeddings.

In modern models like GPT or BERT, it’s a bit different. The embedding layer isn’t pre-trained but trained together with the model itself. First, the text is split into tokens, then a simple neural network creates an embedding for each token. The weights of this layer are trainable parameters that learn to represent words in the desired space. Then, these embeddings pass through decoder blocks and reach the output layer, which predicts probabilities for the next token.

One small detail — positional encoding. Transformers process all tokens in parallel, unlike RNNs. So, they need a way to tell the model the order of words. They add a positional vector to the embedding vector. The result is a combo: the meaning of the word plus information about its position in the text.

After positional encoding, the embedding goes into the attention mechanism — the core of all large language models. The embedding captures the semantics of individual words, but the context is understood through attention. That’s why the word "key" in different contexts will have different contextual representations.

Thus, by combining simple ideas — tokenization, word counting, proper encoding via embedding — you gradually arrive at transformers and ChatGPT. Embeddings are everywhere now: in recommendation systems, in image similarity searches, and at the core of all modern LLMs. If you want to truly understand NLP, you need to understand how CBOW, Skip-Gram, and this architecture work. It’s the foundation from which everything begins.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.