There are actually two distinct methods used for searching the web-
Indexes/Directories
Indexes or directories typically require a staff of researchers to collect
and catalogue links “by hand” (I.e. Yahoo uses a staff of catalogers who
basically surf all day and collate their results).
To conduct a search, the client enters a search term or phrase and
the is returned a list of categories which fit the search term(s) given.
This process continues until the client reaches a level where the actual
links are displayed. There are four general types of Indexes.
By Subject
Annotated list organised by a human indexer, according a subject criteria.
The list is organised in tree format with a drill down level greater than
two I.e. Yahoo
Geographical
Ususally regional, weaker association of links
Specialized
restricted to one or a few disciplines and there is only one main level
of classification (with some exceptions)
Directories of Directories
Some websites providing catalogues of indices organised and
sometimes ranked
I.e. Yahoo or Virtual Library (www.vlib.org)
Search Engines
In contrast, Search Engines collect their results in an entirely different
manner.
With “pure” search engines (ones that don’t use a hybrid approach combining
multiple link
gathering techniques) a database is generated by a software robot which
indexes the full test of the contents of the Web sites by following links
from page to page. The robot typically runs on a dedicated server
site with one or more usually powerful computers running a specially designed
searching software (able to supporting advanced and complex search strategies
with secret weighting algorithms). I.e. Alta Vista (www.altavista.com,
www.av.com)
Relevancy Methods
At their core, the major search engines use the location/frequency method of determining relevancy. Searching for a specific phrase will return pages primarily ranked by where and how often those words appear in each document.
To be more specific, a page titled "Bill Clinton's Life" is likely to be considered more relevant than others where the title tag doesn't mention the US president's name. That's an example of how the location of a term can be important. Similarly, a page that repeatedly mentions Bill Clinton probably will get more of a boost than a page with only one reference.
Multisearchers
There are several sites that offer the possibility of search several engines at the same time. These are sometimes sophistated front ends which pass queries to other search engines, or in the case of the newer metasearch engines, use a more sophisticated method of measuring the relevancy of the evaluating the links returned from the other search engines and reorganizing the list presented to the user in terms of the accuracy of the results (i.e. Google).
General Problems
Most crawler based (robot) search engines have problems with dynamically generated pages or pages with frames.
Since our project utilizes Alta Vista as the Search engine which is generating the results utilized by the VRML world, I have decided to focus how search engines operate and more specifically Alta Vista.
Alta Vista
AltaVista is one of the three largest search engines. It has two search modes: Simple Search and Advanced Search. I have chosen to give the details since it is the current basis for our project.
Databases:
Simple Search: results can come from any of the following databases: Suggested Relevant Searches (a database of more specific phrases); RealNames Internet Keywords; Ask Jeeves question and single answer database; Open Directory subject directory; and the main, very large, AltaVista database of millions of indexed Web pages.
AltaVista Web Page Index
Matches from the AltaVista web page index make up the bulk of the search results page. These matches are numbered.
Advanced Search: only uses the main AltaVista database of millions of indexed Web pages.
Strengths:
* Very large database
* Powerful search features
* International coverage, interfaces, and foreign
language handling
Weaknesses:
* Inconsistent results
* Only can display ten hits at a time
* No sorting other than by relevance rank
* Inability to provide the exact number of hits
* Only returns 10 records at a time
Alta-Vista allows two types of searches: simple and advanced.
Simple Search yield at the most 210 displayed records. Advanced Search
can display up to 1010 records. There is also an option to display
only the number of hits for a search..
Meta Tags & AltaVista
AltaVista considers words in both tags to be additional words on the
page, just
as if they appeared on the page in ordinary text.
AltaVista will index the description and keywords up to a limit of 1,024
characters. It doesn't specify if that's 1,024 characters for per tag
or if coding
outside the content quotes are measured. Assume the worst and you'll
be safe.
Up to 150 characters of text from the description meta tag are displayed
The default Boolean operation
Simple Search: For multiple terms entered with no special markings or operators, the Simple Search will run a phrase search if the terms are recognized as a phrase. If it does not recognize a phrase, it processes multiple terms with an automatic OR operation.
Advanced: If no operators are used in the Boolean expression box, AltaVista interprets the search as a phrase search even if it does not match a phrase in their database of phrases. If multiple terms are entered in the ranking keywords box only, they are processed as an OR operation.
Relevancy
AltaVista's help files list these factors as the main ones used to rank pages:
1.how many times query words appear
2.whether the query words are in the title or meta tags
3.the proximity of query words in the document.
Past guidelines have provided a few more details:
1.All search terms appear, rather than just some.
2.Search terms appear in the title, meta tags, and the first
few lines of the page.
3.Search terms appear near each other in the document.
4.Search terms appear frequently in the document.
Content
Default Operation: In both, single terms are searched exactly as entered; however, in any term containing a punctuation mark or symbol, the punctuation mark or symbol is removed and replace with a space and the string is searched as a phrase. Thus, a search on cd-rom is equivalent to searching "cd rom".
Accents only matter if they are entered by the user. Interse will match Interse and Intersé, while Intersé will only match Intersé.
Comments are not indexed. Also, only the first 100K of text on a page is indexed. After that, only links are indexed, up to a maximum of 4MB. Pages heavy with text in a small font size may not get listed. AltaVista will display about 150 characters for descriptions. If no meta description tag exists, then AltaVista will use the first text it finds on the page, not including ALT text.
Stop Words
AltaVista will ignore certain words in a query if they appear too many times in the entire index. That includes words like "web" and "internet.". In the simple search common words are ignored. A count for the occurrences of the word will be given along with a note as to which are ignored. In advanced searches , no stop words in the advanced mode, although the AltaVista operators of "and", "or", "and not", and "near" need to be entered with the phrase marking to be searched.
Language Detection
AltaVista automatically categorizes web pages by language. Its spider tries to determine the language of a web page at the time it is spidered. The technology is dictionary-based. AltaVista looks at a page to see if the bulk of the words match those of a particular language
Mirror Sites
AltaVista maintains a network of mirror sites outside the United States.
These
are meant to be identical copies of the main AltaVista index, but in
fact, all but
one of the mirrors list fewer pages than the main service.
This is important to search engine users outside the US. Mirrors may
provide a
slightly faster response, due to their closer physical proximity, but
they are not
as comprehensive as the main service. Related to this, they will display
substantially different results than the main index.
AltaVista's Canadian mirror is the exception. Recently launched, the
mirror
includes 12 million more Canadian web pages than the main service contains.
Canadians, and those looking for information about Canada, should certainly
compare AltaVista Canada to the main service, to see if the extra pages
produce better results.
Sorting
Simple Search: If the query matches a question in the Ask Jeeves database, the matched questions and answers are given at the top (or at the bottom, depending on the query). Then, if the query matches one or more records in the RealNames Internet Keyword database, those results are given next. Then, the actual Web database results are ranked by AltaVista's relevance ranking formula. Relevance is determined by location of the terms (such as in the title), proximity of multiple search terms to each other, and the frequency of the terms. Open Directory categories may be listed at the bottom under an AltaVista Recommends: heading.
Advanced Search: The results are NOT sorted, unless one or more terms is included in the ranking keywords box. There is no apparent order to the list when the ranking keyword box is empty, although the order may remain consistent on subsequent searches. Both: As of Oct. 1999, all results from both the Simple and Advanced Searches are sorted by site. This clustering is based on the exact host name. In other words, name.com and www.name.com would be two separate cluster. AltaVista has no option for unclustering.
Display: AltaVista displays the title, URL, file size, date, language,
and a two line extract for each hit.