The design of the algorithms of search engines — the way to success designing websites and optimizing

Introduction

The easiest way of development of methods of promotion and development of the site-specific PS – is the development of its own PS.
I'm not talking about implementing complex algorithms, we need an abstracted solution. You can just imagine a simplified model of the algorithm and to work with her. It is important to try to get all associated parameters. For example, the estimated time to implement, the load on the server and the time of algorithms. By measuring these parameters, it is possible to obtain a lot more information and use it to their advantage.
Most novice webmasters and SEOs come from the concepts of "I want". I want the SS gave great weight to all the links, I want non-unique content, well indexed, etc. But in reality things are different, and many think this is not correct. At the same time did not think, what would the Internet be if all their suggestions worked, and even given the scale. There's only one way of struggle — the transition to the side of the enemy, in our case on the side of the SS. Developing methods and algorithms of counter-cheating, spamming and low quality sites can not only find the right methodology for the development of resources, but also to take advantage of the vulnerabilities found in places where it is really difficult to find a algorithm to solve the problem by search engines.

Start with yourself

I do not claim the uniqueness of this algorithm, I only act in the above search patterns of various algorithms, not to develop, and to use the basic ideas of these algorithms.
One interesting idea I came up with during development of the algorithm checks for uniqueness of new texts on the basis of already existing ones. The basic algorithm was as follows: there is a base text, and base buffer are placed in new texts, these texts after being placed into the buffer are checked for uniqueness on the primary database and if the text is completely unique, it is placed in the main base, and if not, then this text will be removed. The problem lies in the fact that a literal comparison of the full text didn't fit, and develop an algorithm meant not the search for a full brace, and search of fragments of new text in the texts of the main base. Fragments I put into an array check (new) of the text in the form of two intervals of the number of words corresponding to each other in one other text.

And here, I noticed one interesting phenomenon that I would like to share. After checking the new text list of text IDs of the main base and the gaps match for these identifiers are laid only in the last the text to check. You probably think that these same segments can be put in the text from the main base, but here the question arises: why do it? The main base that has already been formed and satisfied all demands and needs. Tested only the new text, and manipulation of finished base will generate additional confusion.

Thus, we have the following picture:

Suppose PS has 10 new documents, after checking for uniqueness was that the document 2 is one take, and document 8 has four double from the existing main base. Next, you add 10 more documents in which the fourth document is the brace the 8th of the top ten, therefore, in the fourth document from the second ten there is an array of five doubles, and 8-th document from the top ten and remained four takes, although takes of these five texts.

Get first metric:. the text at the time of indexing in the form of a constant.

But this is only the technical side of the text, except it has a theme – this is a short tag indicating what is narrated in the text, so let's call it a themed tag. Assume SS know about the 10 documents with thematic tag "hair care". In fact it is 10 of the same documents, however, on the technical side, takes from them is always less than 100%. Therefore, PS, the same as the number of technical duplicates of the first metric, and calculates the number of thematic takes for a new document in an existing primary database.
From this we obtain the second metric: the number of takes thematic document tag.

Now let's try to rank the documents based on the above described metrics. At first sight these two metrics, of course, is not taken as the only, and in the main, sufficient. However, this begs the third metric, which is visible at the end of the example.

Suppose that PS must be trenirovat 50 documents with the tag "hair care". To do this, we will construct the algorithm found based on the above two metrics.

The first metric will take for the principal, as the technical uniqueness of the text is preferable and will produce a multidimensional sorting available 50 documents first, on the first metric, and then by the second. Now try to draw conclusions from the resulting sorting:

After the first sort in the first position was not only technically unique documents, but also documents with a large number of duplicates, but having the oldest date.
On the last position was not only non-unique documents and unique documents from the early dates of appearance, got there because of the sorting according to the second metric.

On the basis of these insights, to get in the first place the newly fashioned website is very difficult and there's a suggestion favored by many method of increasing the amount of visitors punching large volumes of similar and odnotsepochechny documents and websites. On the basis of which just suggests the following metric to limit the number of documents at a reasonable level.

the Third metric: the number of documents required to meet the demand of visitors, calculated on the basis of search queries.

The third metric of the documents, which exceeds its quantity is removed from the end of the above sorted list. This, at least, explains the rapid departure of fresh sites on a very popular and competitive industry.

Results

Of course, the metrics are basic, but not the only one, I mentioned that in the article. If it is good to analyze the algorithm, it is possible to add another dozen secondary metrics for a more accurate analysis, but it is, as they say, is beyond the scope of this article.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express