Written by Ramón Saquete

As promised, here is the second part of how a search engine or **Information Retrieval System (IRS)** works. In the first part we learned how to build the **inverse index**, which tells us which documents have each word associated with them and what importance or relevance the words have in each of them. In this second part we will see how this index is used to return the documents, **sorted by relevance**, given a user query.

To perform this task, we will first **pre-process the query** as it was done in the indexing, that is, if in the indexing we eliminated stop words and calculated the root of the words, we will do the same in the query. Then **we will calculate the weight of each term or keyword** in the query with a formula that can be similar to the following:

In this way, each word in the query has a numerical value associated with it. These numerical values are used to **represent the query as a vector within an n-dimensional space where each dimension is a different word**, let’s see it with an example:

Suppose the user enters the following query: “Human Level”. This query consists of two words and according to the above formula these would be their weights:

F(q,Human) =0.5+0.5=1

F(q,Level)=0.5+1=1.5

Since they are just two different words, we can represent them in a two-dimensional graph where, for example, Human would be the X-axis and Level the Y-axis:

This is when the **inverse index** comes into play, which tells us** in which documents these two words appear and also what weight they have in these documents**. With this information we can now **represent each document within the vector space** above, next to the query. Let’s assume that we have the following values in the index:

Formato del índice:Término --> Documento, peso | Documento, peso | ...Human --> 1,2 | 2,1 | 3,4 Level --> 2,1 | 3,3 | 4,2

With this information we know that we have the following vectors for each document:

Dimensión X (Human), Dimensión Y (Level) Documento 1 =(2 , 0 ) Documento 2 =(1 , 1 ) Documento 3 =(4 , 3 ) Documento 4 =(0 , 2 )

Now we can represent them graphically together with the query vector:

The vectors of the documents in which the word Level is more important will be closer to the Y-axis, the vectors of the documents in which the word Human is more relevant will be closer to the X-axis and on the diagonal the two words will be equally relevant.

To understand a formula well it is always easier to visualize what it does and this **graphical representation**, as is logical, is not done internally by the IRS, it **is** simply **so that you can now understand the geometric meaning of the relevance calculation**.

We have moved **from the textual representation** from which we started **to a representation with vectors** in a Euclidean space **on which we can apply formulas** from the field of linear algebra. What we are going to do next is to apply a formula to **measure the distances between the query vector and all the document vectors**. These distances will tell us **how similar each document is to the query** and, therefore, will be our measure of relevance to finally show the user the ordered documents.

There are many ways to calculate this distance, but the most popular in IRS is the **cosine distance**:

You may have noticed that in the cosine distance formula, apparently no cosine appears, this is because the similarity measure returned by this formula is actually **the cosine of the angle formed by the two vectors**. See image:

I will explain this in more detail in the next two paragraphs, for those who are interested in the mathematical explanation:

In the numerator we are adding the weight of each term in the query multiplied by the weight of that term in the document. The more astute will have already realized that what we are actually doing here **is calculating the scalar product or dot product** between the query vector and the document. Knowing that the scalar product can also be calculated as follows:

Finally, we have a relevance value between 0 and 1 for each document that allows us to order them:

Sim(q,_{d2})=0.9976 < Most relevant document

Sim(q,_{d3})=0.9778

Sim(q,_{d4})=0.7546

Sim(q,_{d1})=0.6562 < Less relevant document

Although the truth is that **the relationship between angle distance and actual relevance is actually a bit vague,** so **this formula is never used as is**, but is **modified and reinforced with quite a bit of additional information**. In this sense, **we know many of the on-page and off-page factors used by Google** such as PageRank, geolocation and language of the query, your search history, social media activity, etc.

To provide richer results, you can also **expand the** user’ **s query** with synonyms, conjugations, plurals or related words or, if not expanded, you can also **include the results of variations of the query**.

For all of the above, **not only mathematical and statistical formulas are used**, but also **countless artificial intelligence techniques** such as fuzzy logic, neural networks, genetic algorithms, Bayesian classifiers, support vector machines, graph algorithms, etc.

On the other hand, search engines are increasingly incorporating more intelligent **Question Answering Systems**. These systems, unlike IRSs, **return concrete answers to questions entered by users**, a functionality that is enhanced by the growth of the semantic Web. We can see an example by entering the following question in Google:

*How tall is the Eiffel Tower?*

In short, search engines are **artificial intelligences capable of continuously learning, evolving and readjusting** themselves, so that not even the engineers who program and maintain them can fully understand how they work. Because of this and because SERPs now adjust to each user, **honest SEO consultants**, like those at Human Level, **will never ensure that you can get the first position for a certain word**, but they can make **your pages increase positions and get more traffic**.