CS 351 Lisp and Symbolic Computation
Homework 3: Heuristic Search on the Web


Due: March 30, 200 (3:15PM)

Existing Robosurfer

A Common Lisp program for using heuristic search to search the web is in the file /u/mooney/cs351-code/robosurfer.lisp. Other related files are in the same directory. The file robosurfer-trace shows a trace of loading and running the program under Allegro. Code and traces print out a little nicer using "enscript -f Courier8" (less ugly wrap-around).

The Lisp function get-url in the program uses the Allegro shell function to call the Unix program lynx (a simple web browser for dumb terminals) in order to download a page from the web given its URL and copy it to a local file. Therefore, it needs to be run on department workstations and have /lusr/bin/ in your path or otherwise have accesss to lynx.

The program assumes the search is conducted through a state space where a state is either a web-page or a hyperlink extracted from a web page. The file function web-successors generates successors for such a state. The successor of a web page is all the hyperlinks it contains. The sole successor of a hyperlink is the web page to which it points. Having each hyperlink represented as a separate state with one successor allows the system to choose which links to pursue rather than having to immediately download all referenced pages, as in a state-space where a web-page has as immediate successors all the pages to which it points. The utilities also maintain search paths by maintaining parent-page pointers for hyperlinks and parent-hyperlink pointers for web-pages as they are generated as successor states. The code also prevents the generation of a state that causes a loop (a page which points back to one of its ancestors in the search path). The code also prints a trace while it is running showing the generation of web successor states.

Starting at a particular web page, the program's goal is to find a page that contains all members of a list of desired search strings ("want-strings"). In order to guide the search, you can also provide a list of "help strings," which are other related strings that can help guide the search. For example, starting from the UT CS department home page, using the want strings "351," "Lisp," and "teaching assistant" and the help string "course info" Robosurfer successfully found our course home page.

The interface is via the following Lisp function:

(web-search start-url want-strings help-strings beam-width)

which does a beam search with the given width. The file robosurfer-trace shows running some sample problems.

Like Alta Vista, if a search string is capitalized, case (upper/lower) matters, otherwise it does not. The function char= is case senstive equality for characters and char-equal is case insensitive. If an individual search string contains multiple words, it still matches whether they are separated by a space (#\Space) or a line break (#\Newline) in the text.

The existing search heuristic considers four factors in order of importance:

  1. (WC) The number of the want-strings that are found
  2. (WT) The total number of times a want-string is found
  3. (HC) The number of the help-strings that are found
  4. (HT) The total number of times a help-string is found
A page is scored as

S = 1000*WC + 100*HC + 10*WT + HT

A link is scored partly based on the text appearing directly in the link and partly based on the surrounding page. If L is the S score for the text in the link and P is the S score for the overall page, then a link is scored as

L/2 + P/2

getting half its score from it's own text and half from its surrounding page. Note that in order to preserve the notion that a lower heuristic score is better, the negative of these "match" scores are actually used as the final heuristic evaluation.

You might first want to experiment with the existing system to get a feel for how it works. Always be prepared to kill Robosurfer execution using "C-c C-c" if it seems to start surfing aimlessly out of control.

Problem with the Existing Heuristic

Of course the existing heuristic is lacking in many cases and does not adequately direct the search. Consider the first problem in the trace file of trying to find my home page without using the keyword "Professor" (command USER(4)). Even though the system first finds the page listing faculty home pages, it goes back to look at undergrad and grad home page lists before returning to actually follow the link to my home page. This is because, just using the local information on the current page, although the page listing faculty home pages has a link with my name, it does not use the phrase "home page" or "Computer Science" while there are links to grad and undergrad lists that use the phrase "home page" on the department home page which does include the phrase "Computer Science". The problem is that although it has followed links labelled "Computer Science" and "home page" to get to the CS faculty home-page listing page, it doesn't get "credit" for these keywords since they do not actually appear on the page itself. Therefore, the problem is that although it takes into account the surrounding page when evaluating a link, it does not take into account when evaluating a page or a link the information encountered on the path followed to get to the current page, which also implicitly contains information about the content of the page.

One of the following variants of this problem in the trace in which "Computer Sciences" is used as a help string rather than a want string (USER(8)) gets lost in the business college, partially because of this problem. The trace for the "machine translation" problem in the trace (USER(15)) also wastes some time exploring some irrelevant pages because of this problem as well.

Changing the Heuristic

Your assignment is to improve the heuristic to avoid this general problem. Specifically, you should alter the heuristic to also use information in the anchor text of the links followed to get to the current page or link. For example, for the CS faculty home-page listing page, starting from the University home-page this would include "Colleges/Academic Units", "Computer Sciences" and "Faculty Home Pages". For simplicty, do not use arbitrary text on the pages on the path leading to the current page or link but only the anchor text in the path of the links themselves.

Of course, a hack that works for the specific problems in the trace and doesn't solve the general problem of using path information to rank pages and links is unacceptable. Anything exploiting the specific layout of particular university pages is inadequate.

You will need to redefine the function score-state (and/or some of the underlying functions). The information on the path can be obtained by following the parent-page and parent-hyperlink pointers.

A trace for my version of such an improved heuristic is in robosurfer-improved-trace. Of course, your solution does not have to produce the same scores and same trace on all of these problems but should at least find the optimal solution to all of the "Mooney home page" problems with no unnecessary search. Note that my version actually does worse on one version of the problem of finding the CS351 TA where it is not told that this is an undergraduate course (USER(115)) since for some reason it now gets stuck looking at graduate course pages. The path for the "machine translation" problem (USER (119)) is now a bit more directed, but ends up finding a longer path since the original search got a bit lucky and found a shorter path somewhat accidentally. If you can do better, great!

In my solution, I include in the heuristic how many want and help strings are found in the link path. I also include a preference for finding "new" want and help strings on the current page or link, where a "new" string is one that has not been encountered yet along the path (as determined by the link texts in the path). The idea behind this "newness" bonus is to encourage "progress" by weighting strings that have not yet been found along the path a bit more than ones that have already been found before. These are just suggestions based on my solution, you do not have to directly follow them.

You should submit the final commented code you write electronically using turnin by class time on the due date.

At the beginning of your submitted code, include a page or so of general comments describing the basic approach of your new heuristic and explain why it helps solves the problem discussed above. Summarize your experience with testing the new heuristic, both successes and failures. Discuss the remaining limitations of the heuristic (and/or the overall system) and present any additional ideas you have for resolving them. (Note: you may find it useful to use an alternative approach to adding long comments by including this text between a "#|" and a "|#" which marks a multi-line block as comment without having to begin each line with a ";".)

Good luck and have fun robosurfing!