Michal Černý - Trendy ve vyhledávání

Have you ever wondered by what Google ranks its search results? Let's do a little experiment. Let's enter the name of AJ WEI-WEI into Google, he is a major Chinese artist, musician and civic activist, but also a dissident. We simply discover that the Chinese version of Google indexes 90,000 pages less than the Czech or international versions. And what is more, it ranks the search results in a completely different way. Therefore, if you are interested in according to what Google ranks search results and why you see in your search engines whatever you see there, listen to the following video. (TOMÁŠ ČERNÝ SEARCH ENGINE TRENDS) Before we can answer the question why and how search engines work, we need to do a short excursion into the history. When in 1989 TIM BERNERS-LEE came up with his concept of the web as a network composed of documents and hyperlinks between them, the entire network was in principle very simple. It contained only a few text documents and links between them. (WEB IN EARLY 90S OF THE 20TH CENTURY) No pictures, no graphs. Nothing like that was known by the web then. Very quickly, however, it appeared that the volume of web pages is growing rapidly and the whole concept of the web as a set of hypertext links between documents, will have much greater ambitions. The original web never counted with the option that the documents could be somehow searched through or browsed in a more advanced method. Everyone had to remember the exact address of the document and enter it into the Internet browser and then the appropriate web site was displayed. The first project that tried to do something about this situation was ARCHIE, which was launched in 1990. It allowed browsing and indexing FTP archives, searching through individual files and items, and navigate between them. Each file could also be assigned specific labels or tags, which in turn facilitated the search. The key limit turned out to be the absence of automatic processing. AUTOMATIC PROCESSING means that each new entry into the system does not have to be manually entered by a human being, but there is some algorithm doing it, which is usually call a robot or agent. It is able to identify each new object, assign it proper labels, and properly classify it under the overall structure of the site so that it is easily searchable. This feature was not brought up before WORLD WIDE WEB WANDERER in 1993. Commercial systems emerged in 1994, the first that appeared on the market was ALTAVISTA. It is followed by LYCOS and especially Yahoo Search. In YAHOO SEARCH, we can see an interesting shift. It is not just a simple search engine that could search only the tags or labels entered by some of user or robot, but it contains a subject catalogue. It offers sites grouped according to logical categories, so that a typical user is able to easily find exactly the page they are looking for and that is dedicated to the topic they are interested in. Examples of now widely used catalogues are the databases of artisans. If we are looking for a plumber in Brno, we can use two approaches. We either use classical full-text search engine and try to find results by entering the keywords or reach for a subject catalogue, where, based on a special key, we easily find him. A great advantage of catalogues is primarily better validity, verified performance and quality. But back to the search engines as we know them today. In 1996, Google appears, and even though two years later it still strongly resembles an "under construction" version, the amount of indexed pages does neither increase linearly, nor quadratically, but exponentially. And this situation lasts to this day. (PROGRESS OF WEBSSITES INDEXED BY GOOGLE SEARCH AGENT -- IN BILLIONS) What we have talked about until now, was history. Search engines have tried to work based on the principle of objective search results, which means that each site was assigned a page rank, which unambiguously determined the order on the page that displayed the search results. PAGE RANK is basically a number resulting from very complicated calculations, and the basic variables of which it is determined, include the number of links leading to this website, the number of links leading from this website, site traffic, and also many other parameters, such as the number of headings, occurrence of keywords and many others. This procedure, however, is no longer enough for today's search engines and they try to offer results tailored to a particular user to the maximum extent possible. Based on our previous information behaviour, on what links we clicked, and what we actually enjoyed, they are trying to guess what we would like to get in the future, and they modify the search results accordingly. This analysis allows not only to offer better and more convincing results for each particular user, but also better targeted advertisements. This process is called PERSONALIZATION and, in principle, there are three basic ways how to do it. The simplest is to study individual user clicks and move the links clicked most frequently up in the search and, conversely, select such ones that the user is not interested in, and move them either down or replace them by entirely different ones. After a deeper analysis, software agents that work within the search engines are able to decide whether you are an IT specialist and when you enter the word LaTeX , you are interested in language for typesetting documents, or whether you are seeking information on the white liquid that arises in laticifers of some plants and from which tires or, for example, condoms were produced. Google approached the second phase of the personalization of search results in 2004. Since then it has not been interested only in the history of your searching, but connects a range of other data to it, such as information from the mailbox we have at Gmail, documents from GDrive or data from the Google+ social network. With all this information it has a perfect user profile, that may offer superior search results as well as extremely well-targeted advertising. If you want to try what can be found about you on the Internet, you can try an interesting Viennese service, which is called DATADEALER. You enter information about what profiles you have set up and what you disclose in them. You briefly describe your digital footprint and, on the basis of this information, DataDealer displays a sort of profile that shows you what one can find on the Internet about you and what all would-be attackers actually could use against you. This situation is funnily illustrated by this clip. (DATAMINING) The third option to work with personalization of search results is computer processing of emotions. Using intonation, heartbeat, skin temperature, galvanic resistance and a number of other parameters, it is relatively easy to build an emotional profile of a user and, based on it, to personalize search results and to modify them according to how the user is feeling. He or she can be displayed specific results in case he or she is angry at the moment, or when he or she is sad or annoyed. An example of these technologies are APPLE SIRI or GOOGLE NOW. These voice assistants allow search for information by voice and the very detection of emotions is extremely important to ensure that these machines understand what one actually speaks about. If one gives commands to the appropriate machine, using irony, hyperbole or other common language resource, it is necessary that the machine is able to identify these emotions, recognize them and provide the correct search results. Computer processing of emotions is also of high importance for speech synthesis that is absolutely crucial for functioning of these assistants. However, the future of search goes further. While current search engines basically handle data whose meaning they do not understand in a very clever way, future should bring search engines able to work with information or data which they will be able to assign due importance and value and derive some new features and knowledge from them. The concept that comes with the idea that web should work with information, not with data, is called semantic web. Semantic web is, however, a bit like a Yeti. Everyone talks about it, but no one has ever seen it. To teach a machine to understand data, a markup language titled RDF is used, which is, nevertheless, relatively complicated. (RESOURCE DESCRIPTION FRAMEWORK) To show how it actually works, let's use a trivial example of the sentence "Author of The Grandmother (in Czech: Babička) book is Božena Němcová." In this case, GRANDMOTHER BOOK is the SUBJECT, AUTHORSHIP is CHARACTERISTICS and BOŽENA NĚMCOVÁ is an OBJECT. However, if we reformulate the sentence with the same meaning -- „Božena Němcová is the author of The Grandmother book." -- BOŽENA NĚMCOVÁ becomes the SUBJECT and the GRANDMOTHER BOOK is the OBJECT. In fact it is much more complicated. Automated processing of natural language and derivation of knowledge must be based on somewhat different principles, although the motivation with which we approached the concept of semantic web is maintained. Currently, the only functional solutions are closed proprietary systems, of which a beautiful example is WOLFRAM ALPHA. Are you wondering what the weather was in Prague on 17 November 1989 ? Thus, on the day of Revolution? The temperature was -1° C and between 2 to 5 PM it was even windy. Another way is implemented by, for example, Google that operates its GOOGLE KNOWLEDGE GRAPH which tries to derive or distil of the search results some basic information which the user may want to find, not to have to leave the search results page at all. An example might be a google knowledge graph of Václav Klaus or information on domestic cat. If you are interested in the answer to the question from the video introduction, thus according to what Google ranks search results, so the answer is a single word. PERSONALIZATION. It is not just the analysis of previous information behaviour, but also an attempt to obtain the greatest possible amount of additional data. Data on what users like doing, their hobbies or what social and cultural background they come from. All of this is then combined with data mining and other advanced methods, such as providing users with the most personalized and therefore the most tailored search results, however, within the same breath we must also add that the thing is to be able to sell advertising as profitably as possible as well. Search is one of the most important human activities. If we live in the information society, we can say that the search for information, processing and handling them is one of the most important activities a person can do. The ability to find the right information is necessary to the economic, social, and cultural integration into the information society. If 20 % of the EU population has never approach the Internet, never searched for anything, and was unable to work with the Information, this fact presents a serious social and economic problem. Information is a source of wealth, power and prosperity. If we look at the ranking of the richest and the most successful companies of our time, we find that those that somehow handle information are almost exclusively represented among them. A number of non-democratic countries, such as China, Iran or Cuba, are closing their internets and turn them into some strange islands in which you cannot find anything. In democratic states, we then depend on the goodwill of search engines and their administrators, however, we can say that the lack of modern industrial systems of copyright protection often result in agreements such as the PIPA, SOPA and ACTA that in fact greatly restrict and complicate Internet search. Although the search may at first seem as a relatively trivial and uninteresting issue, it has a crucial influence on what information gets to us and what we think about, as well as indirectly also on the way we perceive the world around us. Therefore paying it adequate attention is definitely worth the effort. To think about how it works. If you want to know more, you can look at my article "Future Searching" Between Privacy, Technology and Legislation. THE VIDEO WAS CREATED WITHIN THE PROJECT: CENTER OF INFORMATION EDUCATION: DEVELOPMENT OF INFORMATION LITERACY AT MU, PROJECT REG. NO. CZ.1.07/2.2.00/28.0241.