Gathering Search Result from AltaVista

Deric Bertrand - November 1999

There are many search engines available on the Web today. A few examples include Yahoo, Lycos, AltaVista and WebCrawler.

A vast number of these engines utilize CGI scripting to marshall the request/response between the Web Browser and the backend Search Engine. A basic understanding of the CGI Model is all that's required to dissect the interface of these engines. Once the interface is understood, one can easily manipulate the back end CGI Script to perform the desired query. The CGI Resource Index is an excellent source of information on CGI.

Query String Examples (Search=VRML)
Yahoo - /bin/search?p=vrml
Lycos - /cgi-bin/pursuit?cat=dir&loc=searchhp&query=vrml
WebCrawler - /cgi-bin/WebQuery?searchText=vrml

The AltaVista Interface
Because the AltaVista Search Engine was, the VRML Projects' engine of choice, I'll specifically focus on it's interface. However, it should be noted that, while we chose to implement the AltaVista Search Engine, the MetaSearch Architecture provides support for multiple Search Engines.

The AltaVista engine expects name/attribute pairs to be sent, to the script, in the form of a command line argument. This command line argument, commonly called a Query String, provides a way of getting name/attribute pairs from the Web page to the backend Search Engine. An AltaVista Search on VRML with a Date Range between 21/05/99 and 15/11/99 will produce the following URL/Query String:
AltaVista - /cgi-bin/query?stype=&pg=aq&kl=en&q=vrml&r=&kl=en&d0=21%2F05%2F99&d1=15%2F11%2F99&search.x=13&search.y=11

The HTML FORM and INPUT tags can be interrogated to decipher the meaning of the name/attribute pairs. These pairs are defined as follows:

pg=Query Type (aq: Advanced Query)

kl=Language Type (en: English)

q=Search Argument

r=Sort By

d0=Start Date

d1=End Date

search.x=X coordinate (click on Search button)

search.y=Y coordinate (click on Search button)
Note: Search coordinates are not required by the engine.

Submitting the Query
Two main classes are responsible for submitting the Query (request) and retrieving the Search Results (response). The SearchResultsCollector and the SearchEngineFacade. The collector runs in a separate Thread and is responsible for passing the query request to a dynamically dispatched SearchEngine. Running the Collector as a separate thread enables us to spawn multiple collectors and run them concurrently. Each collector, then, has a specific Search Engine tied to it.

Design Pattern
In designing this architecture, we implemented the Strategy Design Pattern. This pattern provides a model solution for implementing multiple interchangeable algorithms. In our implementation, we created an abstract search engine class. This abstract class can support multiple specific concrete classes. Each concrete class represents a single unique Search Engine.

Discovery
Java provides some nice features for managing HTTP connections. Specifically, we utilized the URL and HttpURLConnection classes to connect to the desired Search Engine. For additional information see Java Tutorial (Networking).

Parsing Search Results
The response from the Search Engine takes the form of an HTML Page. In this form, the information is useless. The parsers' job is to find a way to identify and extract the desired information, from the HTML Page, and store it in a structure that the VRML understands.

There are various methods/tools for parsing HTML pages. The JavaCC HTML Parser builds a formatted tree, based on a specified grammar. The Metamata Parser is another such parser utility. While we could've incorporated these or other parser utilities, we chose to write a simplified AltaVista specific Search Engine parser.
Note: The AltaVista Search Engine separates Search Results with Table Tags.

Communicating with the ASP
The AdditionalResultsFacade is a SearchEngineFacade that's intended for internal use. It has the responsibility of obtaining serialized QueryResults from the ASP and converting them into deserialized QueryResult instances, that can be used by the VRML Page.

HTTP Headers
Because we have a consumer/producer problem, such that the Collector is producing results and the Additional Results is consuming them, we need to maintain a concept of the remote producer's state. This state is passed to the AdditionalResultsFacade class in the form of an HTTP header. Again, we found that Java provided us with HttpURLCollection methods to interrogate HTTP Header Keys and to obtain HTTP Header values.

Sample Code
  if theConnection.getHeaderFieldKey(i).equals("collector-state"))
     m_RemoteCollectorState = Integer.parseInt(theConnection.getHeaderField(i));

Maintaining State
One benefit of ASP is that it can maintain Session State. However, in order to maintain state, the ASP requires a token from the client. In our case, the client is the AdditionalResultsFacade. Again, we found that Java provided us with methods that allow us to set the HTTP Cookie.

Sample Code
  theConnection = (HttpURLConnection) theURL.openConnection();
  theConnection.setRequestProperty("Cookie", m_SessionID);
  if ( (theConnection.getResponseCode() == -1) || (theConnection.getResponseCode() !=
                                                        HttpURLConnection.HTTP_OK) )
        throw new IOException("Bad response from URLconnect");