CS 371R Information Retrieval and Web Search: Project 2

Project 2 for CS 371R: Information Retrieval and Web Search
Evaluating the Performance of Pseudo Relevance Feedback

Due: Oct. 7, 2024

Evaluating Retrieval

As discussed in class, a basic system for evaluating vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/eval/. See the Javadoc for this system. Use the main method for ExperimentRated to index a set of documents, then process queries, evaluate the results compared to known relevant documents, and finally generate a recall-precision curve and an NDCG plot.

You can use the documents in the Cystic-Fibrosis (CF) corpus (/u/mooney/ir-code/corpora/cf/) as a set of test documents. This corpus contains 1,239 "documents" (actually just medical article title and abstracts). A set of 100 queries with the correct documents determined to be relevant to these queries is in /u/mooney/ir-code/queries/cf/queries.

ExperimentRated can be used to produce recall-precision curves and NDCG results for this document/query corpus. The NDCG results are based off continuous relevance ratings. Our CF data actually comes with ratings on a 3-level scale (0:not relevant, 1:marginally relevant, 2:very relevant) from 4 judges. In order to produce a single relevance rating, I averaged the scores of the 4 judges and scaled the result to produce a real-valued rating between 0 and 1. The rated query file in /u/mooney/ir-code/queries/cf/queries-rated has the results. For each query, each relevant document is followed by a 0-1 relevance rating. Here is a trace of running such an experiment. The program also generates as output a ".gplot" and a ".ndcg.gplot" file that gnuplot can use to generate a recall-precision graph (plot) such as this and an NDCG graph (plot) such as this. To create a pdf plot file execute the following command:

gnuplot filename.gplot | ps2pdf - filename.pdf

The gnuplot command creates a postscript (*.ps) file and this output is directly piped into the ps2pdf command (Note the "-") which then produces a pdf filename.pdf.

A set of sample results files that I generated for the CF data are in /u/mooney/ir-code/results/cf/.

You can also edit the ".gplot" files yourself to create graphs combining the results of multiple runs of ExperimentRated (such as with this ".gplot" file and resulting pdf plot file) in order to compare different methods.

Relevance Feedback and Pseudo Relevance Feedback

Code for performing relevance feedback is included in the VSR system. See ir.vsr.Feedback class and the Javadoc documentation for it. It is invoked by using the "-feedback" flag when ir.vsr.InvertedIndex is run. After viewing a retrieved document, the user is asked to rate it as either relevant or irrelevant to the query. Then, by using the "r" (redo) command, this feedback will be used to revise the query vector (using the Ide_Regular method), which is then used to produce a new set of retrievals.

One problem with relevance feedback is that it comes at an expense to the user, who must spend her time on rating the initial retrieval results. A possible solution to this problem is using pseudo relevance feedback: just assuming that the top m retrieved documents are relevant, and using them to reformulate the query.

An important question that can be addressed experimentally is: what is the effect of pseudo relevance feedback on retrieval results?

Your Task

Your assignment is to modify ir.vsr.Feedback class to allow pseudo relevance feedback. Current interactive feedback capabilities should be preserved; the pseudo capability can be achieved by adding the new capability to ir.vsr.Feedback along with a way to distinguish between the interactive and pseudo feedback modes.

A new flag "-pseudofeedback" should be implemented in ir.vsr.InvertedIndex and ir.eval.ExperimentRated classes, which would be followed by the integer value of m, e.g.
"java ir.vsr.InvertedIndex -html -pseudofeedback 8 /u/mooney/ir-code/corpora/yahoo-science/".
You will also need to add necessary parameters and make corresponding changes in the constructors and the code that handles feedback in these classes.

A flag "-feedbackparams" should also be implemented in these classes, which would be followed by three floating-point values for ALPHA, BETA and GAMMA feedback parameters (default is 1.0 for each of these), e.g.
java ir.eval.ExperimentRated -pseudofeedback 5 -feedbackparams 1.0 0.5 1.0 /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated /u/mooney/ir-results/5beta05
You will need to change the classes ir.vsr.Feedback, ir.vsr.InvertedIndex, ir.eval.ExperimentRated, and ir.eval.Experiment.

You will then use this code to produce recall-precision and NDCG curves that evaluate the effect of different amounts of pseudo relevance feedback on retrieval performance on the CF corpus.

Try the following amounts of pseudo relevance feedback (values of m): {0, 1, 2, 5, 10, 15, 30} (using BETA=0.1 and the other feedback parameters as default values). This should generate 7 different recall-precision plots. You should manually combine these into one gnuplot file and final performance graph that compares recall-precision performance for all these different amounts of feedback in one graph. Put this graph in a file called "[PREFIX]_amount_results_rp.pdf". You should do the same for NDCG results which should go in a file called "[PREFIX]_amount_results_ndcg.pdf".

Using 2 top retrieved documents for relevance feedback (m=2), try the following values for BETA: {0.1, 0.5, 1.0}. ALPHA and GAMMA should remain 1.0 (explain why varying these parameters is of no interest to us). This should generate 3 different recall-precision plots. You should manually combine these plots into one gnuplot file, along with recall-precision curve for performance of the original system without pseudo relevance feedback (m=0). Put this graph in a file called "[PREFIX]_beta_results_rp.pdf". You should do the same for NDCG results which should go in a file called "[PREFIX]_beta_results_ndcg.pdf".

Discussion

Insightful discussion of your algorithm and results is very important for this assignment. Following are some of the issues that should be addressed in your writeup:

How does pseudo relevance feedback affect performance? How is this impact manifested on the precision-recall and NDCG curves that you generated?
What is the optimum amount of pseudo relevance feedback for this corpus? Explain.
What is the effect of varying BETA on precision and recall? What is the effect of varying BETA on NDCG?
What improvements or alternative implementations of pseudo relevance feedback can you suggest?

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Generate the zip file in a way that maintains the directory structure required.

Along with that, follow these specific instructions for Project 2. The following files should be submitted separately on Canvas..

code.zip zip file of the ir directory (with any subfolders like eval,vsr,etc) that contains all the classes you have made changes to for this project. Submit both .java and .class files.
soln_trace.txt The trace file of program execution. (running an ExperimentRated with pseudofeedback with m=ALPHA=BETA=GAMMA=1 OR all the configurations that we have asked you to run)
report.pdf A short document (1-2 pages) clearly describing in well-written English prose, the approach taken to the assignment, the general algorithm employed, and insightful discussion of any experimental results obtained. In pdf format.
amount_results_rp.ps final ps gnuplot output OR amount_results_rp.pdf final pdf of plot (7 precision-recall curves)
amount_results_ndcg.ps final ps gnuplot output OR amount_results_ndcg.pdf final pdf of plot (7 NDCG plots)
beta_results_rp.ps final ps gnuplot output OR beta_results_rp.pdf final pdf of plot (4 precision-recall curves)
beta_results_ndcg.ps final ps gnuplot output OR beta_results_ndcg.pdf final pdf of plot (4 NDCG plots)

Ensure that you can copy these files directly into a fresh copy of the ir project, and see your changes take effect on a CS Linux box. If it won't compile or your changes don't show up, you probably need to include something else. It might seem like a hassle to create so many directories, but it makes things much easier on the grader.

Please make sure that your code compiles and runs on the UTCS lab machines.

Grading Criteria

50% Working code that correctly implements rated feedback and experimental evaluation of feedback
15%: Efficient code with good programming style with necessary comments, intuitive variable/function names, and appropriate indentation.
35%: Quality of report, clear presentation of results, good analysis & discussion, and your answers to the questions above.

Project 2 for CS 371R: Information Retrieval and Web Search Evaluating the Performance of Pseudo Relevance Feedback