A
search engine is an information retrieval system
designed to help find information stored on a
computer system. Search engines help to minimize
the time required to find information and the
amount of information which must be consulted,
akin to other techniques for managing information
overload.
The
most public, visible form of a search engine is
a Web search engine which searches for information
on the World Wide Web. Of course, it is precisely
search engine technology that allowed Wikipedia
to revolutionize information seeking, with the
creation of an online, searchable encyclopedia.
The popular web browser Firefox has an add-on
that installs the Wikipedia search engine directly
into its Search Bar.
How
search engines work
Search
engines provide an interface to a group of items
that enables users to specify criteria about an
item of interest and have the engine find the
matching items. The criteria are referred to as
a search query. In the case of text search engines,
the search query is typically expressed as a set
of words that identify the desired concept that
one or more documents may contain. There are several
styles of search query syntax that vary in strictness.
It can also switch names with in the search engines
from previous sites. Where as some text search
engines require users to enter two or three words
separated by white space, other search engines
may enable users to specify entire documents,
pictures, sounds, and various forms of natural
language. Some search engines apply improvements
to search queries to increase the likelihood of
providing a quality set of items through a process
known as query expansion.
The
list of items that meet the criteria specified
by the query is typically sorted, or ranked, in
some regard so as to place the most relevant items
first. Ranking items by relevance (from highest
to lowest) reduces the time required to find the
desired information. Probabilistic search engines
rank items based on measures of similarity (between
each item and the query, typically on a scale
of 1 to 0, 1 being most similar) and sometimes
popularity or authority (see Bibliometrics) or
use relevance feedback. Boolean search engines
typically only return items which match exactly
without regard to order, although the term boolean
search engine may simply refer to the use of boolean-style
syntax (the use of operators AND, OR, NOT, and
XOR) in a probabilistic context.
To
provide a set of matching items that are sorted
according to some criteria quickly, a search engine
will typically collect metadata about the group
of items under consideration beforehand through
a process referred to as indexing. The index typically
requires a smaller amount of computer storage,
which is why some search engines only store the
indexed information and not the full content of
each item, and instead provide a method of navigating
to the items in the search engine result page.
Alternatively, the search engine may store a copy
of each item in a cache so that users can see
the state of the item at the time it was indexed
or for archive purposes or to make repetitive
processes work more efficiently and quickly.
Other
types of search engines do not store an index.
Crawler, or spider type search engines (a.k.a.
real-time search engines) may collect and assess
items at the time of the search query, dynamically
considering additional items based on the contents
of a starting item (known as a seed, or seed URL
in the case of an Internet crawler). Meta search
engines do not store an index nor a cache and
instead simply reuse the index or results of one
or more other search engines to provide an aggregated,
final set of results.
History
of popular Web search engines
The
very first tool used for searching on the Internet
was Archie. The name stands for "archive"
without the "vee". It was created in
1990 by Alan Emtage, a student at McGill University
in Montreal. The program downloaded the directory
listings of all the files located on public anonymous
FTP (File Transfer Protocol) sites, creating a
searchable database of file names; however, Archie
did not index the contents of these files.
The
rise of Gopher (created in 1991 by Mark McCahill
at the University of Minnesota) led to two new
search programs, Veronica and Jughead. Like Archie,
they searched the file names and titles stored
in Gopher index systems. Veronica (Very Easy Rodent-Oriented
Net-wide Index to Computerized Archives) provided
a keyword search of most Gopher menu titles in
the entire Gopher listings. Jughead (Jonzy's Universal
Gopher Hierarchy Excavation And Display) was a
tool for obtaining menu information from specific
Gopher servers. While the name of the search engine
"Archie" was not a reference to the
Archie comic book series, "Veronica"
and "Jughead" are characters in the
series, thus referencing their predecessor.
Timeline
Note: "Launch" refers only to web
availability of original crawl-based
web search engine results.
Year Engine Event
1993 Aliweb Launch
1994 WebCrawler Launch
JumpStation Launch
Infoseek Launch
Lycos Launch
1995 AltaVista Launch (part of DEC)
Excite Launch
1996 Dogpile Launch
Inktomi Founded
HotBot Founded
Ask Jeeves Founded
1997 Northern Light Launch
1998 Google Launch
1999 AlltheWeb Launch
Naver Launch
Teoma Founded
Vivisimo Founded
2000 Baidu Founded
2003 Info.com Launch
2004 Yahoo! Search Final launch
A9.com Launch
2005 MSN Search Final launch
Ask.com Launch
AskMeNow Launch
Lexxe.com Founded
2006 wikiseek Founded
Quaero Founded
Ask.com Launch
Live Search Launch
ChaCha Beta Launch
Quintura Beta Launch
Guruji.com Beta Launch
2007 wikiseek Launched
AskWiki Launched
The
first Web search engine was Wandex, a now-defunct
index collected by the World Wide Web Wanderer,
a web crawler developed by Matthew Gray at MIT
in 1993. Another very early search engine, Aliweb,
also appeared in 1993, and still runs today. JumpStation
(released in early 1994) used a crawler to find
web pages for searching, but search was limited
to the title of web pages only. One of the first
"full text" crawler-based search engine
was WebCrawler, which came out in 1994. Unlike
its predecessors, it let users search for any
word in any webpage, which became the standard
for all major search engines since. It was also
the first one to be widely known by the public.
Also in 1994 Lycos (which started at Carnegie
Mellon University) was launched, and became a
major commercial endeavor. For a more detailed
history of early search engines.
Soon
after, many search engines appeared and vied for
popularity. These included Excite, Infoseek, Inktomi,
Northern Light, and AltaVista. In some ways, they
competed with popular directories such as Yahoo!.
Later, the directories integrated or added on
search engine technology for greater functionality.
Search
engines were also known as some of the brightest
stars in the Internet investing frenzy that occurred
in the late 1990s. Several companies entered the
market spectacularly, receiving record gains during
their initial public offerings. Some have taken
down their public search engine, and are marketing
enterprise-only editions, such as Northern Light.
Google
Around
2001, the Google search engine rose to prominence.
Its success was based in part on the concept of
link popularity and PageRank. The number of other
websites and webpages that link to a given page
is taken into consideration with PageRank, on
the premise that good or desirable pages are linked
to more than others. The PageRank of linking pages
and the number of links on these pages contribute
to the PageRank of the linked page. This makes
it possible for Google to order its results by
how many websites link to each found page. Google's
minimalist user interface is very popular with
users, and has since spawned a number of imitators.
Google
and most other web engines utilize not only PageRank
but more than 150 criteria to determine relevancy.
The algorithm "remembers" where it has
been and indexes the number of cross-links and
relates these into groupings. PageRank is based
on citation analysis that was developed in the
1950s by Eugene Garfield at the University of
Pennsylvania. Google's founders cite Garfield's
work in their original paper. In this way virtual
communities of webpages are found. Teoma's search
technology uses a communities approach in its
ranking algorithm. NEC Research Institute has
worked on similar technology. Web link analysis
was first developed by Jon Kleinberg and his team
while working on the CLEVER project at IBM's Almaden
Research Center. Google is currently the most
popular Web search engine.
Yahoo! Search
The
two founders of Yahoo!, David Filo and Jerry Yang,
Ph.D. candidates in Electrical Engineering at
Stanford University, started their guide in a
campus trailer in February 1994 as a way to keep
track of their personal interests on the Internet.
Before long they were spending more time on their
home-brewed lists of favourite links than on their
doctoral dissertations. Eventually, Jerry and
David's lists became too long and unwieldy, and
they broke them out into categories. When the
categories became too full, they developed subcategories
... and the core concept behind Yahoo! was born.
In 2002, Yahoo! acquired Inktomi and in 2003,
Yahoo! acquired Overture, which owned AlltheWeb
and AltaVista. Despite owning its own search engine,
Yahoo! initially kept using Google to provide
its users with search results on its main website
Yahoo.com. However, in 2004, Yahoo! launched its
own search engine based on the combined technologies
of its acquisitions and providing a service that
gave pre-eminence to the Web search engine over
the directory.
Microsoft
The
most recent major search engine is MSN Search
(evolved into Live Search), owned by Microsoft,
which previously relied on others for its search
engine listings. In 2004, it debuted a beta version
of its own results, powered by its own web crawler
(called msnbot). In early 2005 , it started showing
its own results live, and ceased using results
from Inktomi, now owned by Yahoo!. In 2006, Microsoft
migrated to a new search platform - Live Search,
retiring the "MSN Search" name in the
process.
Baidu
Baidu
was launched in 2000 and is the leading Chinese
search engine, providing an index of over 740
million web pages, 80 million images, and 10 million
multimedia files. Its interface is very similar
to Google's. (Credit:
Wikipedia).