A search engine is a web-based tool that enables users to locate information on the World Wide Web. Popular examples of search engines are Google, Yahoo!, and MSN Search. Search engines utilize automated software applications (referred to as robots, bots, or spiders) that travel along the Web, following links from page to page, site to site. The information gathered by the spiders is used to create a searchable index of the Web.
To simplify, think of a search engine as having two components. First, a spider/web crawler trolls the web for content that is added to the search engine's index. Then, when a user queries a search engine, relevant results are returned based on the search engine's algorithm. Early search engines were based largely on-page content, but as websites learned to game the system, algorithms have become much more complex and search results returned can be based on literally hundreds of variables. There used to be a significant number of search engines with significant market share. Currently, Google and Microsoft's Bing control the vast majority of the market. (While Yahoo generates many queries, their back-end search technology is outsourced to Microsoft.)
How a Search Engine Works
Because large search engines contain millions and sometimes billions of pages, many search engines not only just search the pages but also display the results depending on their importance. This importance is commonly determined by using various algorithms. As illustrated in the image below, the source of all search engine data is a spider or crawler, which automatically visits pages and indexes their contents.
Once a page has been crawled, the data contained in the page is processed and indexed. Often, this can involve the steps below.
- Strip out stop words.
- Record the remaining words on the page and the frequency at which they occur.
- Record links to other pages.
- Record information about any images, audio, and embedded media on the page.
The data collected above is used to rank the page and is the primary method a search engine uses to determine if a page should be shown and in what order. Finally, once the data is processed, it is broken up into one or more files, moved to different computers, or loaded into memory where it can be accessed when a search is performed.
Common Search Engines
In addition to Web search engines other common types of search engines include the following:
- Local (or offline) Search Engine: Designed to be used for offline PC, CDROM or LAN searching usage.
- Metasearch Engine: A search engine that queries other search engines and then combines the results that are received from all.
- Blog Search Engine: A search engine for the blogosphere. Blog search engines only index and provide search results from blogs
Types of Search Engines
A search engine type is determined by how the information contained in its catalog or database is collected. There are three main types of search engine tools:
- Search directories or indexes
- The sites in the catalog or database of a search directory or index are compiled by humans; not an automated software program. The sites are submitted, then assigned to the appropriate category.
- Some search directories or indexes do not consider content when adding pages to their catalog. Others collect, rate, or rank materials. Some search directories include annotations that evaluate, review or otherwise describe the content.
- Yahoo and LookSmart are examples of search directories or indexes.
- Hybrid search engines
- Hybrid search engines will present both crawler-based results and human-powered listings. Usually, a hybrid search engine will favor one type of listings over another.
- MSN Search is more likely to present human-powered listings over its crawler-based results, especially for more obscure queries.
- Meta search engines
- A meta search engine is a tool that helps to locate information available via the WWW.
- It provides a single interface that enables users to search many different search engines, indexes, and databases.
- Therefore Meta search engines are capable of searching several search engine databases at once.
Search Engine Technology
A search engine maintains the following processes in near real-time:
- Indexing: Indexing means associating words and other definable tokens found on web pages to their domain names and HTML-based fields. The associations are made in a public database, made available for web search queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible. Some of the techniques for indexing, and caching are trade secrets, whereas web crawling is a straightforward process of visiting all sites on a systematic basis.
- Searching: Typically when a user enters a query into a search engine it is a few keywords. The index already has the names of the sites containing the keywords, and these are instantly obtained from the index. The real processing load is in generating the web pages that are the search results list: Every page in the entire list must be weighted according to information in the indexes. Then the top search result item requires the lookup, reconstruction, and markup of the snippets showing the context of the keywords matched. These are only part of the processing each search result on a web page requires, and further pages (next to the top) require more of this post-processing.