How does a Search Engine work?

10:33 AM, Posted by Mini, No Comment

When a use accesses a search engine, they are presented with a graphical interface form, on which they specify what they are looking for. Whey they tell the search engine to start the search (by pressing “Enter” or on clicking on a specific button), the search engine invokes a program that queries its database (a collection of all the web pages it has access to).

The results are returned to the user as a number of possible URL’s. Often, these will be ranked in priority or success rate, which higher values meaning more likely to contain the information you request (what it really means is that it contains more occurrences of the keywords you were searching for compared to other documents).

How does a search engine know where the information is?

There are a number of ways a search engine can know about where information is to be found. Firstly, a search engine can list information by keywords or page titles. These keywords or titles (subject categories) can either be submitted by user’s that provide information on the internet, or can be extracted by accessing web pages and extracting the page title and keywords from the header of the web page. This keyword extraction relies on the appropriate HTML code in the header of the web page (it is called a meta-tag). The advantage is that it quicker to index information and less traffic is involved (only headers are requested from websites, the entire web document is not read).

The second method a search engine can use relies upon reading every page it knows about (usually pages are submitted for inclusion by web authors). This technique involves the use of programs called spiders or web robots that request every page then extracts all words from the content of the page and stores these words in a large database.

Not all search engines are the same. Some use keyword extraction via meta-tags while other use keywords via page content indexing. Obviously content indexing is a much better method because you are more likely to find specific information. However, this method has a number of problems. One is the sheer size of the resultant database and number of pages involved (which means a lot of traffic, and it might take two weeks to fully search all those pages). As the size of the www continues to grow this becomes more and more difficult.

Keeping the database up-to-date is a serious problem. It is common find that pages returned by a search engine have in fact since been moved or deleted.

No Comment