DS 420 Project Write-up

Joe Gange

 

There are actually two distinct methods used for searching the web-

Indexes/Directories

Indexes or directories typically require a staff of researchers to collect and catalogue links “by hand” (I.e. Yahoo uses a staff of catalogers who basically surf all day and collate their results).
To conduct a search, the client enters a search term or phrase and the is returned a list of categories which fit the search term(s) given.  This process continues until the client reaches a level where the actual links are displayed.  There are four general types of Indexes.

By Subject
Annotated list organised by a human indexer, according a subject criteria. The list is organised in tree format with a drill down level greater than two I.e. Yahoo
 Geographical
  Ususally regional, weaker association of links
 Specialized
restricted to one or a few disciplines and there is only one main level of classification (with some exceptions)
 Directories of Directories
  Some websites providing catalogues of indices organised and sometimes ranked
  I.e. Yahoo or Virtual Library (www.vlib.org)

Search Engines

In contrast, Search Engines collect their results in an entirely different manner.
With “pure” search engines (ones that don’t use a hybrid approach combining multiple link
gathering techniques) a database is generated by a software robot which indexes the full test of the contents of the Web sites by following links from page to page.  The robot typically runs on a dedicated server site with one or more usually powerful computers running a specially designed searching software (able to supporting advanced and complex search strategies with secret weighting algorithms).  I.e. Alta Vista (www.altavista.com, www.av.com)

Relevancy Methods

At their core, the major search engines use the location/frequency method of determining relevancy. Searching for a specific phrase will return pages primarily ranked by where and how often those words appear in each document.

To be more specific, a page titled "Bill Clinton's Life" is likely to be considered more relevant than others where the title tag doesn't mention the US president's name. That's an example of how the location of a term can be important.  Similarly, a page that repeatedly mentions Bill Clinton probably will get more of a boost than a page with only one reference.

Multisearchers

There are several sites that offer the possibility of search several engines at the same time.  These are sometimes sophistated front ends which pass queries to other search engines, or in the case of the newer metasearch engines, use a more sophisticated method of measuring the relevancy of the evaluating the links returned from the other search engines and reorganizing the list presented to the user in terms of the accuracy of the results (i.e. Google).

General Problems

Most crawler based (robot) search engines have problems with dynamically generated pages or pages with frames.

Since our project utilizes Alta Vista as the Search engine which is generating the results utilized by the VRML world, I have decided to focus how search engines operate and more specifically Alta Vista.

Alta Vista

AltaVista is one of the three largest search engines. It has two search modes: Simple Search and Advanced Search.  I have chosen to give the details since it is the current basis for our project.

Databases:

Simple Search:  results can come from any of the following databases: Suggested Relevant Searches (a database of more specific phrases);   RealNames Internet Keywords; Ask Jeeves question and single answer database; Open Directory subject directory; and the main, very large, AltaVista database of millions of indexed Web pages.

AltaVista Web Page Index

Matches from the AltaVista web page index make up the bulk of the search results page. These matches are numbered.

Advanced Search: only uses the main AltaVista database of millions of indexed Web pages.

Strengths:
    * Very large database
    * Powerful search features
    * International coverage, interfaces, and foreign language handling

  Weaknesses:
    * Inconsistent results
    * Only can display ten hits at a time
    * No sorting other than by relevance rank
    * Inability to provide the exact number of hits
    * Only returns 10 records at a time

Alta-Vista allows two types of searches:  simple and advanced.  Simple Search yield at the most  210 displayed records. Advanced Search can display up to 1010 records.  There is also an option to display only the number of hits for a search..
 Meta Tags & AltaVista

AltaVista considers words in both tags to be additional words on the page, just
as if they appeared on the page in ordinary text.

AltaVista will index the description and keywords up to a limit of 1,024
characters. It doesn't specify if that's 1,024 characters for per tag or if coding
outside the content quotes are measured. Assume the worst and you'll be safe.

Up to 150 characters of text from the description meta tag are displayed

The default Boolean operation

Simple Search: For multiple terms entered with no special markings or operators, the Simple Search will run a phrase search if the terms are recognized as a phrase. If it does not recognize a phrase, it processes multiple terms with an automatic OR operation.

Advanced: If no operators are used in the Boolean expression box, AltaVista interprets the search as a phrase search even if it does not match a phrase in their database of phrases. If multiple terms are entered in the ranking keywords box only, they are processed as an OR operation.

Relevancy

AltaVista's help files list these factors as the main ones used to rank pages:

  1.how many times query words appear
  2.whether the query words are in the title or meta tags
  3.the proximity of query words in the document.

Past guidelines have provided a few more details:

  1.All search terms appear, rather than just some.
  2.Search terms appear in the title, meta tags, and the first few lines of the page.
  3.Search terms appear near each other in the document.
  4.Search terms appear frequently in the document.

Content

Default Operation: In both, single terms are searched exactly as entered; however, in any term containing a punctuation mark or symbol, the punctuation mark or symbol is removed and replace with a space and the string is searched as a phrase. Thus, a search on cd-rom is equivalent to searching "cd rom".

Accents only matter if they are entered by the user. Interse will match Interse and Intersé, while Intersé will only match Intersé.

Comments are not indexed.  Also, only the first 100K of text on a page is indexed. After that, only links are indexed, up to a maximum of 4MB. Pages heavy with text in a small font size may not get listed.  AltaVista will display about 150 characters for descriptions. If no meta description tag exists, then AltaVista will use the first text it finds on the page, not including ALT text.

Stop Words

AltaVista will ignore certain words in a query if they appear too many times in the entire index. That includes words like "web" and "internet.".  In the simple search common words are ignored. A count for the occurrences of the word will be given along with a note as to which are ignored. In advanced searches , no stop words in the advanced mode, although the AltaVista operators of "and", "or", "and not", and "near" need to be entered with the phrase marking to be searched.

Language Detection

AltaVista automatically categorizes web pages by language. Its spider tries to determine the language of a web page at the time it is spidered.  The technology is dictionary-based. AltaVista looks at a page to see if the bulk of the words match those of a particular language

Mirror Sites

AltaVista maintains a network of mirror sites outside the United States. These
are meant to be identical copies of the main AltaVista index, but in fact, all but
one of the mirrors list fewer pages than the main service.

This is important to search engine users outside the US. Mirrors may provide a
slightly faster response, due to their closer physical proximity, but they are not
as comprehensive as the main service. Related to this, they will display
substantially different results than the main index.

AltaVista's Canadian mirror is the exception. Recently launched, the mirror
includes 12 million more Canadian web pages than the main service contains.
Canadians, and those looking for information about Canada, should certainly
compare AltaVista Canada to the main service, to see if the extra pages
produce better results.

Sorting

Simple Search: If the query matches a question in the Ask Jeeves database, the matched questions and answers are given at the top (or at the bottom, depending on the query). Then, if the query matches one or more records in the RealNames Internet Keyword database, those results are given next. Then, the actual Web database results are ranked by AltaVista's relevance ranking formula. Relevance is determined by location of the terms (such as in the title), proximity of multiple search terms to each other, and the frequency of the terms. Open Directory categories may be listed at the bottom under an AltaVista Recommends: heading.

Advanced Search: The results are NOT sorted, unless one or more terms is included in the ranking keywords box. There is no apparent order to the list when the ranking keyword box is empty, although the order may remain consistent on subsequent searches.  Both: As of Oct. 1999, all results from both the Simple and Advanced Searches are sorted by site. This clustering is based on the exact host  name. In other words, name.com and www.name.com would be two separate cluster. AltaVista has no option for unclustering.

Display: AltaVista displays the title, URL, file size, date, language, and a two line extract for each hit.