A Java program for using heuristics to search the web (also called
directed spidering) is on the department networked file system in the file
/u/mooney/cs343-code/ir/webutils/BeamSearchSpider.java
. This
directory and its parent directory contain a variety of support files. The
version BeamSearchSiteSpider
is a heuristic spider that restricts
itself to a particular web host. This code is an extension of a Java
web-spidering library that was developed for another course on Information Retrieval and Web Search .
First copy the complete /u/mooney/cs343-code/ir
subdirectory to
your own directory and follow the directions
on using the Java course code. You do not need to understand all of the
code to do the assignment, just enough to properly update the search heuristic,
LinkHeuristic
.
See a trace of running the program
. Code and traces print out a little nicer using the Unix command
"enscript -f Courier8
" (less ugly wrap-around).
The program conducts a search through a space of hyperlinks. Each hyperlink is
represented by a ScoredAnchoredLink
object. An
AnchoredLink
is a representation of a link to another URL together
with the text of the link (the anchor text). A ScoredAnchoredLink
also contains heuristic scores for the link itself and for the page to which
the link points. The successor of a link is all of the links on the page to
which the link points. Searching a space of links allows the system to choose
which links to pursue rather than having to immediately download all referenced
pages, as in a state-space where a web-page has as successors all the pages to
which the page points. The program maintains search paths by maintaining
backLink
pointers from links to the links that generated them as
successor states. The code also prevents revisiting pages that have already
been explored (by storing all visited URLs in a HashSet
) and
prints a trace while it is running showing when states (links) are expanded.
Starting at a particular web page, the program's goal is to find a page that
contains all the members of a set of desired search strings called "want
strings" (specified with the -w
command line option). In order to
guide the search, you can also provide a list of "help strings," (specified
with the -h
command line option) which are other related strings
that can help guide the search. For example, starting from the UTCS department
home page, the trace presents a search with the
list of want-strings: "Mooney", "343", "heuristic search" and "robosurfer"
using the help-strings: "people" and "faculty", it successfully found this
assignment page.
The robosurfer-trace
file
shows running some sample problems. See the
JavaDoc for the main
method of
BeamSearchSiteSpider
to interpret the command-line arguments.
If a character in a search string is capitalized, case matters (i.e. it must match an upper case character), otherwise case does not matter (i.e. a lower-case character matches either upper or lower case in the text). If an individual search string contains multiple words, it still matches whether they are separated by a space or any other whitespace character (e.g. newline) in the text.
The existing search heuristic considers four factors in order of importance:
You might first want to experiment with the existing system to get a feel for
how it works. You should usually use BeamSearchSiteSpider
instead of just BeamSearchSpider
. Always be prepared to kill
Robosurfer execution by hitting
"control-c" if it seems to start surfing aimlessly out of control. If you do use
BeamSearchSpider
, the
-safe
command line flag makes the spider obey the Robot Exclusion
Policy and prevents it from surfing where robots are forbidden by convention.
The -slow
flag is also
useful; it slows down the program by pausing before every expansion in order to
allow you to easily read the trace as it crawls. The -b
flag
allows you to specify the beam width for the search, which defaults to 100.
The -c
flag allows you to specify the maximum number of pages
downloaded before the search is forced to terminate, the default is 10,000.
Of course the existing heuristic is lacking in many cases and does not
adequately direct the search. Consider the problems in robosurfer-fail-trace
, which
presents 3 problems that are not solved as effectively as they could be.
Starting from my home page trying to find the want strings "course" and "situation calculus" (a topic we will discuss in the course later and appears on the course syllabus), the system gets quite lost because it does not directly go to my course web pages since the word "course" appears in the heading above the course links but not in the link text itself. It eventually finds its way to the goal but only after exploring over 120 pages.
Trying to find the phrase "A* search" using the help strings "textbook" and
"contents" finds a solution fairly quickly but takes longer than it should since
"Textbook" appears in the section heading immediately before the link to the
textbook but not in it. It follows each link on the course page sequentially
until it fortunately stumbles on the textbook page which is fairly near the top
of the page. NOTE: This example requires using BeamSearchSpider
-safe
instead of BeamSearchSiteSpider
to allow searching
outside the cs.utexas.edu site.
Finally,trying to find the discussion about the robosurfer program from the course homepage using "robosurfer" and "program" as want strings and no help strings gets completely lost because the link to this assignment contains neither of these terms. Although the phrase "Programming Project" appears in the section heading above the link to this assignment it does not appear in the link itself so does not help. It fails to solve the problem even after exhausting the limit of 250 pages.
Ideally, the heuristic should use the relevant section headings preceding a link to guide it more or less directly to the correct page in all of these cases.
Your assignment is to improve the heuristic to avoid this general problem. Of course, a hack that works for these specific cases and doesn't solve the general problem of using information in the current section headings to rank links is unacceptable. Anything exploiting the specific layout of the course homepage or other specific pages is inadequate.
A trace for my version of an improved heuristic is in robosurfer-soln-trace
.
You should create a specialization of the class BeamSearchSpider
called HeaderBeamSearchSpider
, and a specialization of the class
BeamSearchSiteSpider
called
HeaderBeamSearchSiteSpider
, in both cases, the
constructLinkHeuristic()
method should be redefined to produce an
instance of a new heuristic HeaderLinkHeuristic
that is a
specialization of the LinkHeuristic
class.
Hint: In my solution I looked for the closest preceding section headings
relevant to the link being scored. If a want-string or help-string appears in
the title of these section headings, then the score calculated by the existing
heuristic described above is boosted. Note that in HTML there are section
headings of different "depths", from the highest heading H1 to the lowest H6.
My heuristic looks for section headings at each level that cover the current
link and awards bonus points for current section headings at each level that
contain any of the want strings or help strings. More points are awarded for
deeper (more specific) headings, e.g. matching the current H3 heading counts
more than matching the current H2 or H1 heading. The
ScoredAnchoredLink
class used to represent links in Robosurfer
already extracts and retains the start and end character positions of links in
the HTML file that will help you to easily implement such solutions. This is
only a hint of how I did it, other approaches are certainly possible in order
to get the same general effect of having section headings influence link
scoring. Your trace for the sample problems in
robosurfer-soln-trace
should be at least as direct and
efficient as mine.
You should submit the final commented code you write electronically using
turnin
by 15 minutes before the start
of class on the due date. You should submit a directory with an appropriate
set of named files. Include a file robosurfer-soln-trace
containing a trace of your system working on the three sample problems.
You should also include a one or two page report in a plain ASCII file
report.txt
or a PDF file report.pdf
. Describe the
details of your new heuristic and explain why it solves the problem discussed
above. Try some other problems of your own. Summarize your experience with
testing the new heuristic, both successes and failures. Discuss the remaining
limitations of the heuristic (and/or the overall system) and present any
additional ideas you have for resolving them.
Follow the submission instructions for submitting your project.
Good luck and have fun robosurfing!