Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Everything starts with a simple question: how can a machine understand text? When you train a model to distinguish spam from normal messages or to determine the sentiment of reviews, something interesting happens under the hood. The model needs to somehow convert letters and words into numbers because neural networks only work with digits.
The first naive approach is to just assign a number to each unique word. Good = 6, bad = 26, awesome = 27. It seems logical, but here’s the problem: the numbers 26 and 27 are close to each other, so the model might think that bad and awesome are similar. In reality, however, awesome and good are semantically closer. That’s the catch.
We tried One Hot Encoding — giving each word a vector as long as the entire vocabulary, where only one element is 1 indicating the specific word, and the rest are zeros. The ranking problem disappeared, but a new one appeared: if the vocabulary has 20,000 words, each vector will be 20,000 dimensions. It consumes a huge amount of memory, and the model doesn’t grasp the semantic relationships between words.
Then came Bag of Words and N-grams — counting how many times a word appears in the text. It adds context, but again — large sparse vectors, and the model doesn’t understand deep connections between words. For example, in the sentence "The librarian loves books," the words librarian and book are not adjacent, so N-grams won’t catch that they are related.
This is where proper encoding through embedding comes to the rescue. The idea is that similar words should be close to each other in the vector space. Imagine a two-dimensional plane: on one axis, the size of the animal; on the other, danger. Tiger and lion will be close to each other as (large and dangerous), while hamster is separate as (small and safe). That’s what embedding is — a dense vector that captures the meaning of a word in an n-dimensional space.
The coolest part: with such vectors, you can do math. Take the vector for "son," subtract "man," and add "woman" — you get a vector close to "daughter." Or: Madrid + Germany - Spain = Berlin. It works because the model captures relationships between concepts.
How are these embeddings trained? Google proposed Word2Vec with two approaches. In CBOW, you take context words and predict the central word. Skip-Gram does the opposite — from the central word, predict its neighbors. Both techniques work well for training word embeddings.
In modern models like GPT or BERT, it’s a bit different. The embedding layer isn’t pre-trained but trained together with the model itself. First, the text is split into tokens, then a simple neural network creates an embedding for each token. The weights of this layer are trainable parameters that learn to represent words in the desired space. Then, these embeddings pass through decoder blocks and reach the output layer, which predicts probabilities for the next token.
One small detail — positional encoding. Transformers process all tokens in parallel, unlike RNNs. So, they need a way to tell the model the order of words. They add a positional vector to the embedding vector. The result is a combo: the meaning of the word plus information about its position in the text.
After positional encoding, the embedding goes into the attention mechanism — the core of all large language models. The embedding captures the semantics of individual words, but the context is understood through attention. That’s why the word "key" in different contexts will have different contextual representations.
Thus, by combining simple ideas — tokenization, word counting, proper encoding via embedding — you gradually arrive at transformers and ChatGPT. Embeddings are everywhere now: in recommendation systems, in image similarity searches, and at the core of all modern LLMs. If you want to truly understand NLP, you need to understand how CBOW, Skip-Gram, and this architecture work. It’s the foundation from which everything begins.