How does a search engine like Google work? – Part II: Relevance calculation

Ramón Saquete

Written by Ramón Saquete

As promised, here is the second part of how a search engine or Information Retrieval System (IRS) works. In the first part we learned how to build the inverse index, which tells us which documents have each word associated with them and what importance or relevance the words have in each of them. In this second part we will see how this index is used to return the documents, sorted by relevance, given a user query.

To perform this task, we will first pre-process the query as it was done in the indexing, that is, if in the indexing we eliminated stop words and calculated the root of the words, we will do the same in the query. Then we will calculate the weight of each term or keyword in the query with a formula that can be similar to the following:

calculation of search engine query term weights
“q” is the query, “t” is the term and “fqt” is the frequency of occurrence of the term in the query.

In this way, each word in the query has a numerical value associated with it. These numerical values are used to represent the query as a vector within an n-dimensional space where each dimension is a different word, let’s see it with an example:

Suppose the user enters the following query: “Human Level”. This query consists of two words and according to the above formula these would be their weights:

F(q,Human) =0.5+0.5=1
F(q,Level)=0.5+1=1.5

Since they are just two different words, we can represent them in a two-dimensional graph where, for example, Human would be the X-axis and Level the Y-axis:
graph of the terms of the consultation

This is when the inverse index comes into play, which tells us in which documents these two words appear and also what weight they have in these documents. With this information we can now represent each document within the vector space above, next to the query. Let’s assume that we have the following values in the index:

Formato del índice:
Término --> Documento, peso | Documento, peso | ...

Human   --> 1,2 | 2,1 | 3,4
Level   --> 2,1 | 3,3 | 4,2

With this information we know that we have the following vectors for each document:

Dimensión X (Human), Dimensión Y (Level)
Documento 1 =(2   ,  0         )
Documento 2 =(1   ,  1         )
Documento 3 =(4          ,  3         )
Documento 4 =(0          ,  2         )

Now we can represent them graphically together with the query vector:

how search engines work

The vectors of the documents in which the word Level is more important will be closer to the Y-axis, the vectors of the documents in which the word Human is more relevant will be closer to the X-axis and on the diagonal the two words will be equally relevant.

To understand a formula well it is always easier to visualize what it does and this graphical representation, as is logical, is not done internally by the IRS, it is simply so that you can now understand the geometric meaning of the relevance calculation.

We have moved from the textual representation from which we started to a representation with vectors in a Euclidean space on which we can apply formulas from the field of linear algebra. What we are going to do next is to apply a formula to measure the distances between the query vector and all the document vectors. These distances will tell us how similar each document is to the query and, therefore, will be our measure of relevance to finally show the user the ordered documents.

There are many ways to calculate this distance, but the most popular in IRS is the cosine distance:

cosine distance
We calculate the similarity between query “q” and a document, thedi. In summations, m is the number of terms or dimensions.

You may have noticed that in the cosine distance formula, apparently no cosine appears, this is because the similarity measure returned by this formula is actually the cosine of the angle formed by the two vectors. See image:

relevance calculation

I will explain this in more detail in the next two paragraphs, for those who are interested in the mathematical explanation:

In the numerator we are adding the weight of each term in the query multiplied by the weight of that term in the document. The more astute will have already realized that what we are actually doing here is calculating the scalar product or dot product between the query vector and the document. Knowing that the scalar product can also be calculated as follows:

scalar product
As many of you already know, the horizontal bars represent the norm of the vector, which is calculated as the square root of the sum of its squared components.
angulo-entre-vecotres
If we do not take the cosine to the other side as arccosine, we have the formula that is usually used to calculate the angle between two vectors.

Finally, we have a relevance value between 0 and 1 for each document that allows us to order them:

Sim(q,d2)=0.9976 < Most relevant document

Sim(q,d3)=0.9778

Sim(q,d4)=0.7546

Sim(q,d1)=0.6562 < Less relevant document

Although the truth is that the relationship between angle distance and actual relevance is actually a bit vague, so this formula is never used as is, but is modified and reinforced with quite a bit of additional information. In this sense, we know many of the on-page and off-page factors used by Google such as PageRank, geolocation and language of the query, your search history, social media activity, etc.

To provide richer results, you can also expand the user’ s query with synonyms, conjugations, plurals or related words or, if not expanded, you can also include the results of variations of the query.

For all of the above, not only mathematical and statistical formulas are used, but also countless artificial intelligence techniques such as fuzzy logic, neural networks, genetic algorithms, Bayesian classifiers, support vector machines, graph algorithms, etc.

On the other hand, search engines are increasingly incorporating more intelligent Question Answering Systems. These systems, unlike IRSs, return concrete answers to questions entered by users, a functionality that is enhanced by the growth of the semantic Web. We can see an example by entering the following question in Google:

How tall is the Eiffel Tower?

In short, search engines are artificial intelligences capable of continuously learning, evolving and readjusting themselves, so that not even the engineers who program and maintain them can fully understand how they work. Because of this and because SERPs now adjust to each user, honest SEO consultants, like those at Human Level, will never ensure that you can get the first position for a certain word, but they can make your pages increase positions and get more traffic.

  •  | 
  • Published on
Ramón Saquete
Ramón Saquete
Web developer and technical SEO consultant at Human Level. Graduated in Computer Engineering and Technical Engineering in Computer Systems. He is also a Technician in Computer Applications Development and later obtained the Pedagogical Aptitude Certification. Expert in WPO and indexability.

Tags

What do you think? Leave a comment

Just in case, your email will not be shown ;)

Related Posts

en