Implementation of search engine ranking in Python (Part 3)

previous section we learned how to query the built index, and now we can get links to documents that contain what we asked for. But there is a problem: it's just a list of documents, which might have what we need. It is not sorted by importance, for us, the information contained in the document. About this problem we discuss in this part.

Ranking of query results

The final step in building a search engine is to create a system for ranking documents by their relevance to the query. This is the most difficult part, because it has no direct technical solution: it requires creativity and your own opinion. In this we implement a TF-IDF ranking (from the English. TF — term frequency (word frequency) and IDF — inverse document frequency (inverse document frequency)), which is one of the simplest ways to sort our documents. In this part there will be no code, but you can examine the final version of the engine on GitHub. We only learn the theory TF-IDF, and its implementation is quite simple, and most of the work done during the index build.

So, the term "frequency" is the first part of our ranking system? Well, that's exactly what comes to mind when you hear it: the number of times that each word occurs in a particular document. The term frequency as a metric does not consider the query: it assumes that the document is just ambivalent set of markers, and an accurate representation it is possible to just counting how many times each token (word) is found. This is not an accurate assumption, but it is widely used in the field of document classification. Technically, he is better known as the model “bag of words”.

One simple view of the model of bag-of-words is the vectorization of documents. That is, we decompose a document into a vector of long N, where N — the total number of unique terms in the document, and each entry is the number of times a particular term occurs in the document. For multiple documents, a more convenient way of determining N is the number of unique words in all documents in the search space. This allows to represent each document as a vector and all these vectors have equal dimension.

But wait! There is, currently, a big drawback in our model. Consider these two documents: "the Way they eat cake" and "Let them eat Thor. The way they eat cake". They are the same, except that the second is just a duplicate of the first, but their vector representation are very different: [1, 1, 1, 1] compared to [2, 2, 2, 2]. To solve this problem, we convert each vector into a unit vector by dividing its magnitude (calculated by taking the square root of the sum of squares of each entry in the vector), “normalize” it. This will turn our previous vectors in [½, ½, ½, ½] and [½, ½, ½, ½], making them just what we intend.

However, this is still not enough. The frequency of the words is a fatal flaw (this is the harmony for any Greek tragedy). This disadvantage lies in the fact that "bag" treats each term as equally important in the presentation of the documents. But it is not: the word “and” tells us much less about the item than words “Shakespeare” or “chlamydia”. But the word "and" occurs much more often than the word "chlamydia" (at least in the documents that I read) that creates a false semblance of similarity between documents (since almost all of them contain the word "and").

To avoid this we need to add something to our ranking system: the inverse frequency of the document. Define the frequency with which a word occurs in a document as Dt and t — as the frequency with which the word is found in all indexed documents. Then, if we have K documents, then the inverse frequency of the document (Lt) this word will be the ratio K to Dt: total number of documents divided by the number of documents that contain that word.
There is one final caveat: it adjusts too much. If we have 10 documents (K = 10) and one word is found in these documents 10 times, and another word only 1 time, it turns out that the second word is 10 times more important than the first. The linear behavior of this function is too radical, and will artificially reduce the similarity because of too much adjustments. To fix this, we simply add the function of the natural logarithm, which will align our function, making its correction more smooth. Now our function looks like this: the set of K documents, for some words t, Lt = ln(K/Dt), where Dt is the document frequency of word t and ln is a function of the natural logarithm.

implementation note: as you can see, none of these quantities do not depend on the query, and they both can (and should) be calculated for each token (words, terms) and document in advance!

Now to combine the term frequency (in the document) and inverse document frequency into a single metric we can just multiply. That is, the weight of each token (word, term) in our set of documents is defined as Tt×Lt: the frequency with which the word in the document and inverse document frequency.

Now, if we have a set of K of documents and all these documents have the total number of N of unique terms, our documents are represented as vectors, each long N, where the value of each entry (which corresponds to the term) is equal to the frequency with which a term appears in a document multiplied by the inverse document frequency of the term in the document set. Each vector will have a value of 0 for the terms not occurring in the document (remember, our vectors represent all the unique terms in the entire document set). The inverse document frequency will never equal 0, because this is the level of metrics.

Now we will do the same for queries: we will represent it as a vector in N-dimensional space as the documents, and calculate TF-IDF for each term in the query. Obviously, this is more rare (scattered) than in the documents.

A small digression. Let us try to calculate the similarity score between the query and its result set, and rank the documents in the result set. There are many different approaches to this, but I will use here what is called the coefficient of Ochiai (cosine similarity), which essentially simply takes the dot product of query vector and document in the result set and divides it into the product of the magnitudes of these two vectors, which returns the cosine of the angle between these vectors (I could be wrong to explain the formula in words, so if you are interested, you can read the article at Wikipedia this question on StackOverflow to clarify). This is especially useful metric as it does not take into account the magnitude of two vectors when evaluating the similarity (unlike, say, Euclidean distance), which is very important when you have a very sparse vector (the query), and the other is much less sparse vector (the document).

So, for ranking results, that's what we will do:

first you need to calculate in advance the value of TF and IDF for each term, and build long N vector for each document using the TF IDF work on the quality of the recording.
Then, we compute the query and get a result set of relevant documents (using the previously described methods).
After this, we compute the vector of the query that also has length N and uses the product of the TF×IDF as each of these records.
Then, we calculate the similarity between query and each document in the result set (using the coefficient of Otiai), and get a score for each document.

And boom, we're done.

Opinion

To create a search engine that can scale to the size of Google is incredibly difficult. But to make it simple for your personal use (or even as a proof of concept) is not so difficult. In fact, the method of constructing the indexes of search engines, ranking, and query documents are quite clear, and the build engine is an exercise that should be done.

Model bag-of-words that we used here appears everywhere. Another great module for a search engine is a spam filter, or classification of documents, articles or recommendation, or any other module. This is a very cool concept, and it is worth pondering how you will use it to implement any of the above (or something cooler).

Anyway, this is the end of the article. If you have any feedback or questions you can leave a comment to original article or email the author at email/facebook/any other new-fangled social network that you used your children now.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express