Embeddings

Home > SEO Agency > SEO Glossary > Embeddings

Definition

Word embeddings are a ML (Machine Learning) technique which represents words as numerical vectors. LLMs (Large Language Models) and search engines use this to understand the relationships between words, images, audio and other forms of data. Embeddings transform high dimensional data into a lower dimensional vector space which helps machines understand the meaning and context of content.

With the use of embeddings, LLMs and search engines don’t have to rely on traditional keyword searching. Word embeddings are a numerical map used to help machines make sense of language. They are dense vector representations of text, images, audio and objects. Words with similar vector values are clustered together in this multidimensional space. For example the word “man” and “woman” would be located close together in this vector space as they are semantically related.

Quick Access

Why are embeddings important?

Embeddings help the machines gauge the most suitable outcomes related to the search context. Word embeddings prioritises semantic relevance rather than matching exact keywords. This technique takes the relevance of the user into account. It also allows LLMs and search engines to understand synonyms, paraphrasing, natural language and other nuances.

With the use of embeddings, LLMs and search engines don’t have to rely on traditional keyword searching. Word embedding is a numerical map used to help machines make sense of language. They are dense vector representations of text, images, audio and objects. Words with similar vector values are clustered together in this multidimensional space. For example the word “man” and “woman” would be located close together in this vector space as they semantically related.

Embeddings are important because with evolving LLMs, NLP (Natural Language Processing) and search engines, they allow machines to understand data that may otherwise be too complex for it. Embedding helps machines to understand data in text, images, audio and other high dimensional data. Embeddings can increase discoverability. They help the search engine and LLMs to understand intent, meaning and the relationships of the content being searched. It can help prioritise contextually relevant search results. For SEO it helps to include words in your website that are relevant or relate to each other. This SEO technique can be good for users searching with conversational words or sentences It can also help aid enterprise knowledge retrieval and generative AI integration primarily through RAG (Retrieval-Augmented Generation).

How embeddings work

This works in a system that mathematically ranks words with relevance to each other. Words that are most similar to each other should correspond to points that are close by and words that are different should correspond to points that are far away. This helps to measure the extent to which two pieces of text are related to each other. Word embeddings capture word similarity and other properties of the language. Embeddings allow the search engine/ LLMs to gain a deeper understanding of user intent and content meaning. It allows interpretation of context, synonyms and relationships between words. Embeddings provide a numerical representation which allows machine learning systems to measure semantic similarity between data. Semantic searching is the action of retrieving that information after the machine dismantles the code. This allows semantic search to retrieve the most relevant search results more effectively.

Vector embeddings

Vector embeddings are an integral concept in machine learning and natural language processing. They are heavily used in today’s large language models. They store large data assigned to certain images, words, audio and objects. Vector embeddings are a sequence of numbers used as numerical representations for words and sentences so NLPs and MLs can understand the meaning and contexts of inputs. So vector embeddings are a way to turn words into numerical data in order for machines to understand their meaning and relationships better. They are represented as data in a multidimensional space and words most similar in context are clustered together.

How do search engines use embeddings ?

Search engines convert queries and content into embeddings. These embeddings for content are then compared to one another in order to deliver the most accurate result. Semantic searching uses text embeddings to turn words/sentences into vectors. When a user searches for information semantic search uses similarity to find the vector among the responses which is the most similar to the vector corresponding to the query. Semantic search engines use multiple embeddings. Dot product and Cosine similarity are used to understand the similarities between texts. Dot product measures vector alignment and uses multiplication to get the best result. Cosine similarity looks at the angle between inputs/outputs. K nearest neighbor (knn) allows the search engine to look at a particular data point and assigns the data point with the most common label amongst the neighbors. Multilingual search uses semantic searching. This model supports many languages. This allows questions and answers to be searched in any language.

How SEO professionals use embeddings

Embeddings can be hugely beneficial for SEO professionals. By understanding embeddings, SEO professionals are able to optimize the concepts behind them and integrate them in order to improve websites and content.

Internal linking

Internal linking can be used to connect similar content related to the topic the user is searching for on the website. Word embeddings can help SEO professionals to find the most similar pages within the website in order to link them together and optimise the content available. Having strong internal links can help google understand what your website is about and can help strengthen topical authority of your website.

Topic clustering

Topic clustering is when you decide what content goes together. Embeddings can help understand semantic relationships of similar content on a web page and group these together in a cluster. This helps SEO as if you have a mass amount of articles on your website, embeddings can help semantically organise them which in turn can result in your web page being ranked higher on Google’s search engines. This is because by being able to use embeddings to group web pages together, Google can better understand what your web page is about. Google can then relate your webpage to others that are similar which is good for web page visibility and optimisation. Topic clustering can also help identify semantically related topics that could be missing from your webpage. This can help you understand what your content strengths and weaknesses are which can help you build a stronger SEO website with better content.

Content gap analysis

For SEO, it is important to optimise your web page in all forms. By using embeddings content gap analysis can be achieved efficiently. Content gap analysis is a system set up to tell you what your target user wants to know and what content you currently have. It can help show you what you are optimising but also where the gaps in your content are. This can help you see what keyword opportunities you are missing, competitor advantages, what content is strong or weak and what you could add to enhance your website.

Search intent matching

Embeddings can help SEO professionals understand if the web page is relevant behind a query a user may have. This can help with understanding if your website has the correct keywords to cover semantically related topics. This can help guide content and SEO strategies.

Keyword cannibalization detection

Keyword cannibalization detection is important for SEO as it finds internal competition problems. If you have very similar content pages on your website this would be a problem for how the website performs. If you have content online and they all target similar SEO strategies or keywords then Google won’t know where to rank any of the pages which will result in all pages being ranked low. This can help understand possible problems within your content which can then be fixed in order to optimise each page on your website.

Type of word embeddings

There are many different types of word embeddings used in NLP and ML. These different types are used to help distinguish the semantic relationship between words, objects, audio and images. Various types of embeddings are used in different applications. Traditional embedding focuses on one single vector for one word, without taking context into account but rather just number and similarity. Contextual embedding gives different vectors to one word and bases iron context. Both are useful types of embedding.

Traditional/ static word embeddings

Traditional/ static word embeddings give a vector code to each word in a sentence. Context is not taken into account but the direct meaning of the word being used in a sentence. Each word is represented by a fixed numerical vector containing multiple dimensions. Examples of traditional/static word embeddings are Word2Vec and GloVe.

Word2Vec

Word2Vec which was created by researchers at Google and launched in 2013, revolutionised NLP. Word2Vec makes sense of words by focusing on their local context within a sentence. Word2Vec can understand analogies as well. It worked in two ways: Continuous Bag Of Words (CBOW) and Skip-Gram.

CBOW works by learning how to predict a word based on the context of the surrounding words in a sentence. For example the sentence “the sun today is so ______”. CBOW would predict “hot” by looking at the words “the”, “sun” “today”, “is”, “so”.

Skip-Gram is similar but predicts a sentence around one word. For example if you mentioned the word “king” Skip-Gram would predict “queen” as they are close to each other in the vector space and the machine would predict this from knowing local context.

GloVe

GloVe was launched and created by a team in Stanford in 2014 as a competitor to Word2Vec. The creators of GloVe wanted to see if they could create a system where the machine would understand the global context of a word as opposed to only the local context like on Word2Vec. GloVe was designed so the machine would be able to understand words, texts, images, objects in all contexts. In order to do this, a global co-occurrence matrix was introduced. This is a table in which each cell records how often two words appear together in a text. For example if you have a list of articles, GloVe would collect data from this and check how often every word appears beside other words and use that information to create word vectors with larger context. Here, matrix factorisation is heavily relied on. This factorisation looks specifically at how often words co occur with each other within a text as opposed to Word2Vec which just looks at the neighbour words in a text. In this sense words that appear in similar contexts with have similar vector representations such as the words “astronaut” and “space” because these words often appear in similar texts they will be close in the matrix factorisation. Every pairing of words is turned into dense word vectors gathered from local contexts and global co-occurrence patterns which helps it understand the broader relationships of words and the contexts of words.

Contextual word embeddings

Contextualised word embedding creates separate vectors for each word to create context. With Traditional/ static word embedding, words share the same fixed vectors for each word despite the context. Contextual word embedding changes the vectors for each word based on the surrounding words in a text. The vectors are flexible based on the context of the text. This helps NLPs to understand semantic relationships between words more and detect nuances. The two most common contextual word embedding is BERT (Bidirectional Encoder Representations from Transformers) and ELMo (Embeddings from Language Models).

BERT

Google BERT was first introduced by researchers in Google in 2018. It works by looking at an entire sentence which allows it to understand an entire sentence’s meaning and context; this is called bidirectional context. It is a layered embedding process. It uses token, position and segment inputs to get an output. BERT is a pretrained programme and can be used in order to get specific outputs like customer query sections on websites. This is a system pre-set based on previous customer questions and answers. Tokenization breaks down text into smaller units digestible for machine reading. Words are made into tokens which help translate them into dense numerical vectors. Position helps to organise the tokens into the exact sequence so the machine can understand the order. Segment helps the machine to understand what sentences are for where and what makes them different from one another.

ELMo

ELMo was released in 2018. ELMo uses a bidirectional language model (BiLM) to understand and learn the context of sentences in order to capture semantic meaning between words and sentences as a whole. This type of word embedding allows ELMo to read the surrounding words of a text in order to understand not only the semantic relationship between words and sentences but the context of the text as well. So depending on the sentence the machine will be able to understand in what context a word is being used. ELMo uses Long Short Term Memory(LSTM) in order to form a vector representation for each word that is input. A good example of this is the word “apple”. So in a sentence that says “i ate an apple” ELMo would consider the vector for apple in this context is food, health, nutrition. But where apple occurs in a sentence like “my phone is made by apple” ELMo would consider the vector for apple in this context as the company, business, technology. Both the same word but used in different sentences and ELMo recognises the context as being different despite the word being the same.

Examples of embeddings

Text embeddings

Text embeddings work by turning words, sentences, documents, websites into numerical vectors that capture their meanings. This allows machines to understand semantic relationships between texts which helps search engines provide the most accurate and relevant search results by being able to identify semantic similarity.

For example : when searching “cheap flights to Ibiza” the search engine will also return “budget airlines Ibiza”, “Ibiza night life”, “Ibiza must see sights”. This is because the embeddings allow the search engine to match intent and not just keywords.

Image embeddings

Image embeddings translate visual content into high dimensional numerical vectors. these numerical vectors are organised based on essential semantic feature like shapes, colours, textures, etc. This allows search engines to return similar images based on visual similarity rather then file names or labels. This is used for reverse image searching and recommendations.

For example : If user uploads a picture on Google of a blue jacket, the search results would be a similar blue jacket or a branded blue jacket like an adidas blue jacket. This is because the search engine matches visual similarity.

Audio embeddings

Audio embeddings change sound signals into vectors that represent characteristics such as tone, pitch, frequency, etc. This helps search engines identify and compare audio content. Audio embeddings are used in voice assistants, music, voice search/ recognition and many other audio applications.

For example : If you have three different wav files that have different audio, the search engine cannot identify the difference between them visually or by file titles. Instead search engines differentiate between them by turning things like the sound waves into vectors. The search engine will return with sound bites that sound similar or have a similar genre.

Video embeddings

Video embeddings convert moving visual content by gathering information from multiple frames, audio and motion patterns in the video.

For example : if you input a video of people playing a soccer match, other videos will come up about playing a soccer. Other videos like “penalty videos” or “reviewing soccer boots” will show in results. This is because video embeddings aren’t only identified from objects but actions and sequences too.

Multimodal embeddings

Multimodal embeddings combine different forms of data such as text, video, audio and images, into a shared vector space. This shared vector space positions vectors based on semantic relationships. This allows search engines to connect meaning across different formats by understanding semantic meaning between different types of content. For example : Gemini Embedding 2 is an example of a multimodal embedding model created by Google. Gemini Embedding 2 allows for text to be matched with images, videos and other data types.

Embeddings and SEO

Today with advanced and evolving MLs, LLMs and NLPs, search engines have become a lot better at understanding the meaning of a web page. Search engines are no longer keyword matching. Instead, search engines like Google are giving search results by connecting topics and content with each other. Before embedding evolved search engines would give exact search results for a topic. There would not be “suggested topics” related to the theme of your search. If you looked up the colour blue, you would only get content related to the colour but now you could get images of the sea, the sky and products that were blue or related to the colour blue like green. This is because search engines look for similarities and meaning within content. They deem blue and green as similar vectors because they conclude they are similar colours. In the past web pages were ranked differently. They were ranked by things like keyword stuffing and backlinks. Now search engines look for meaning and connections within your content. It is not as simple as keyword stuffing or creating a backlink.

SEO embedding helps to separate your web page from competitors. By adding topics and context to your page this can help the search engine understand the page and connect this with other similar products. This can increase visibility, and naturally push your page up on the search engine. By adding meaning to content this can show the search engine why your content is relevant to the searcher.

Notez ce page