The good news about the most visible part of the Internet, the content pages of the world wide web, is that there are millions of available pages, waiting to show you information on an amazing variety of topics. The bad news about this content is that more then 50% of it is not even indexed by search engines.
When you need to find information about a particular subject, how do you know which pages to read? If you're like most people, you type in your browser the URL of one of the major search engines and start from there. Search engines have a breef list of critical operations that allows them to provide relevant web results when searchers use their system to find information. They are special sites on the net that are designed to help people find the pages stored on other websites. There are some basic differences in the ways various search engines work, but they all perform four basic tasks:
A web crawler, also known as a spider or robot, is an automated program which browses the in a constant specific, automated manner. This process is called Web crawling or spidering.Search engines run these automated programs, that use the hyperlink structure of the web to "crawl" the pages and documents that make up the Internet. Estimates are that search engines have crawled about 50% of the existing web documents.
After a page has been crawled, it's content can be "indexed" - saved in a database of documents that makes up a search engine's "index". This index has to be tightly managed, so that requests which must search and sort billions of documents can be done in fractions of a second.
When a request for information comes into the search engine, it retrieves from its index all the documents that matches the query. A match is determined if the terms or phrase is found on the page in the manner specified by the user.
Once the search engine has determined which of the results are a match for the requested query, the engine's algorithm runs calculations on each of the results to determine which is most relevant to the given query. The search engine’s ranking system lists these results ordered from most relevant to least so that users has in a better visual placement those that the engine considers the best, so users can make a choice about which to select.
Although a search engine's operations are not particularly lengthy, systems like Google, Yahoo!, AskJeeves and MSN are among the most complex, processing-intensive computers in the world, managing millions of calculations each second and funneling demands for information to an enormous group of users.