Trigeminy index or "search the mistakes"

once on duty, there is a need to add to the search on the site to all known features, the service is "Maybe you meant..." or "Search for mistakes". Began to think how to implement it. Third-party services and APIs to use is not wanted, because the time to someone else's servers and back, and in General not very good. Just by the way was the pg_trgm module, which looks similar to the query word based on trigrammes index.


the

Implementation


To begin with – how it is used.
To search similar words, you need to create a list of valid options. Create a table with a text field, which later will hang trigeminy index.

CREATE TABLE "public"."tbl_words" (
"word" VARCHAR(255) NOT NULL
) WITHOUT OIDS;


* This source code was highlighted with Source Code Highlighter.


Fill in the table in various ways, for ourselves, we decided this question:
— Dictionary of Zaliznyak (~90 thousand words), dictionary of Russian literature (~160 thousand words), or any other or all together. Dictionaries are easily found in the network, typically a line by line list of words to parse this in the database is not working.
— We have a book shop, about 200 thousand products in the database, everyone has a title, author, brief annotation, and so on. As people are likely to search using a word from these data, putting it all together, divide by whitespace characters, unique words and fill in the table.

Then add the index.

CREATE INDEX "tbl_words_trgm_idx" ON "public"."tbl_words"
USING gist ("word" "public"."gist_trgm_ops");


* This source code was highlighted with Source Code Highlighter.


Like that scheme already can be used.

select word, similarity(word, 'Pushkin') from tbl_words

* This source code was highlighted with Source Code Highlighter.


A similar query will give us similar words and rate, which indicates how the word is similar to that proposed. Sorting by rating "similarity" in reverse order, get the most similar words.

select word from tbl_words where word % 'Pushkin' order by similarity(word, 'Pushkin') desc word limit 5

* This source code was highlighted with Source Code Highlighter.


Next, you will determine when the person made a mistake and he needs to offer the right option. The most obvious option – if the customer on his request found nothing, or, for example, found few results, check similar options. If you search for similar options issued some number of results that show a hint to the user.
At this stage, to achieve the correct result, you can tweak the values to give/not to give, how much betrayed the search for similar words and other properties. Should set the threshold value of "similarity rating" that word does not even need to be considered as similar.

This method, however, there are several big disadvantages:
1) the Dictionary consists of individual words, whereas search queries are typically polysyllabic. Search for similar phrases have to be completed separately by words, then combine them. That is, for example, with the search phrase "ragnaia the poetry of Pushkin", which will not give results, you have to look for similar words, respectively, for "ragnaia", "poetry", "Pushkin." If you take 2 similar words, the number of variations is equal to 8. This creates a good load on the database, which is usually not pleasant.

Total:
Pros – easy implementation, a large number of options hints.
Cons – loading the database with large queries, periodically incorrect hints.


the

Option 2.


If you are the statistics of search queries, you can use another option.
Creates a table of unique search queries, field queries on the same principle trigeminy is added to the index. And compliance is sought, not alone by words, but by the complete search phrase in the saved queries.

Total:
Pros – phrase tips compiled by users less likely to "stupid" tips.
Cons – the completeness of the database depends on your statistics queries.

After repeated testing, we decided to abandon the first method in favor of the second, it was too unpredictable were the results) the Results of the second method can be viewed SDAs.


How it works.


To the end everything became clear, tell you how it works. It's pretty simple.
Each word is divided into three-letter combinations of trigrams.
For example:



When searching for similar phrases that are searched the same trigrams. The more equal the trigrams, the more the phrase is similar to the original.


the

Similar products.


Trigeminy index has proved useful in one case. On the product page we have a variety of blocks of sentences: "this author", "good publisher", and so on. One of them – "Similar products".
On many sites found similar function, but from experience, it's usually items from the same category as the main product, or a list of manually assigned. Saw sales, in which derived goods found through full-text search, for example, two or one word from the name of the main product. But it also gives, often not very predictable results. For example to the book "application Architecture in C++" was given books on architecture and building.

Trigeminy index set on product names, gave a good "relevant" results) Example here or on every product page on the website.


If someone is interested, next time I will lay out the benchmarks and source codes.


UPD: Here's another example of search — moving letters.

Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Monitoring PostgreSQL + php-fpm + nginx + disk using Zabbix

Templates ESKD and GOST 7.32 for Lyx 1.6.x

Customize your Google