Project 4
CS 371R: Information Retrieval and Web Search
Evaluating Embeddings From Deep Language Models


Due: 11:59pm, Dec. 2, 2024

This project will explore using emdeddings from an LLM to support standard document retrieval. You will use document and query embeddings from a recent LLM specialized for scientific literature stored as precomputed dense vectors. You will then use existing Java code that augments the existing course IR code to support retrieval using pre-computed dense vectors and experimentally evaluate these deep embeddings. Finally, you will implement and test a "hybrid" approach that combines this dense-retrieval approach with the existing VSR system and experimentally evaluate whether it improves the results compared to purely sparse or dense retrieval alone.

Generated Deep Embeddings

Overview

We will evaluate deep dense embeddings generated from a pre-trained transformer language model on the Cystic Fibrosis (CF) dataset introduced in Project 2. Specifically, we will be using the SPECTER2 transformer model which is an LLM trained on scientific paper abstracts. SPECTER2 is first trained on over 6M triplets of scientific paper citations, after which it is trained with additional task-specific adapter modules. You can read more about the basic SPECTER approach in this paper. The pre-trained LM is capable of generating task specific embeddings for scientific tasks when paired with adapters. Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. Read through the model card on HuggingFace to understand the model better.

In this project, along with the SPECTER2 Base Model, we shall be using the Proximity Adapter to embed documents and the Adhoc Query Adapter to embed queries. We will compare two variants of SPECTER2 embeddings, one with just the base model and one with adapter modules attached to the base model. Each of the two cases is described below.

We have precomputed embeddings for the CF documents and queries stored as space-separated 768-dim vectors in the following files:

Deep Retrieval

The course code in /u/mooney/ir-code/ir/ has been augmented with a class ir.vsr.DeepRetriever for a simple document retriever that uses precomputed dense document embeddings. It takes a directory of pre-computed document embeddings that have the same file names as the original corpus where each file contains a simple space-separated list of real-values representing the document embedding. Another new class used by the deep retriever is DeepDocumentReference which stores a pointer to a file and stores its dense vector and the precomputed vector length (L2 norm) for this vector. The 'retrieve' method returns a list of ranked retrievals for a query represented as a DeepDocumentReference using either Euclidian distance (default) or cosine similarity (if '-cosine' flag is used) to compare the dense vectors. No indexing is used to improve efficiency, a query is compared to every document in the corpus and all of the scored results are included in the array of Retrievals. Ideally, some form of approximate nearest neighbor search like locality sensitivity hashing would be used; however, for the limited number of small documents and queries in the CF corpus, a brute-force approach is tractable.

'Deep' versions of Experiment and ExperimentRated used in Project 2 are also in ir.eval.DeepExperiment and ir.eval.DeepExperimentRated that produce precision-recall and NDCG plots evaluating the DeepRetriever. These use the normal 'queries' and 'queries-rated' files used by the normal experiment code but also takes a directory of query embeddings containing an embedding file (list of real-values). The query embedding directory should have files names Q1,..Qn giving the embeddings of the queries in the order they are in the original 'queries' file. The files will be lexicographically sorted by name and this order should correspond to the order in the queryFile so file numbers should have leading '0''s as needed to sort properly, i.e. Q001, Q002, ... Q099, Q100. A sample DeepExperimentRated trace is here.

Hybrid Retrieval

Disappointedly, my initial results (shown below) showed slightly worse results than the baseline VSR system, except for slightly improved precision for high recall values. Therefore, I hypothesized that combining the normal VSR approach with the deep-learning approach in a "hybrid" method might work best. A simple hybrid approach simply ranks retrieved documents by a simple weighted linear combination of the dense-vector deep cosine similarity with the normal VSR cosine similarity, i.e. λ D + (1 - λ) S, where D is the deep dense cosine similarity and S is the sparse TF/IDF BOW cosine similarity.

Implement and evaluate such a simple hybrid approach by writing the following classes: ir.vsr.HybridRetriever, ir.eval.HybridExperiment, and ir.eval.HybridExperimentRated. The HybridRetriever should combine a DeepRetriever with a normal InvertedIndex to produce a simple weighted linear combination of the results. The evaluation code can be fairly easily generated by properly combining code from the deep and original versions of Experiment and ExperimentRated. The main methods for the Hybrid Experiment classes should take the following args:

Command args:  [DIR] [EMBEDDIR] [QUERIES] [QUERYDIR] [LAMBDA] [OUTFILE] where:
DIR is the name of the directory whose files should be indexed.
EMDEDDIR is the name of the directory whose files have embeddings of the documents in DIR
QUERIES is a file of queries paired with relevant docs (see queryFile).
QUERYDIR is the name of the directory where the query embeddings are stored in files Q1...Qn
LAMBDA is the weight [0,1] to put on the deep cos similarity with (1-LAMBDA) on the VSR cosine sim
OUTFILE is the name of the file to put the output. The plot data for the recall precision curve is 
       stored in this file and a gnuplot file for the graph is the same name with a ".gplot" extension
For your experiments with HybridRetriever use the SPECTER2 Base embeddings and cosine similarity for the DeepRetriever (so it uses the same general metric as InvertedIndex constrained to be between 0 and 1).

Evaluation

Once you generate the embeddings, you can use the ir.eval.DeepExperimentRated class to evaluate the embeddings. The commands to be run are as follows: In addition try the '-cosine' version of DeepRetriever on the Base and Adapter models and for the Hybrid model try alternative values of the λ hyperparameter including 0.3, 0.5, 0.7, 0.8, 0.9. Note that hybrid should use cosine by default, so don't specify it as an argument.

Results

You can use these gplot files to generate the plots for your report: all-deep.gplot, all-deep.ndcg.gplot, all.gplot, all.ndcg.gplot You can check the sample plots for the base model and adapter model for PR and NDCG here: For your final results, generate one set of basic deep retrieval PR and NDCG results including VSR, SPECTER2-Base, SPECTER2-Adapt, SPECTER2-Base(cosine), and SPECTER2-Adapt(cosine). Generate another set of Hybrid PR and NDCG results including VSR, SPECTER2-Base(cosine), and Hybrid using SPECTER2-Base(cosine) combined with VSR using alternative λ values (0.3, 0.5, 0.7, 0.8, 0.9).

Report

Your report should summarize and analze the results of your experiments. Present the results in well organized graphs (you can shrink the graphs and insert them into your report). Include at least the 4 graphs (2 PR curves and 2 NDCG graphs) for the combinations of results specified above. Also answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. How..?" and then give the answer underneath so that we do not have to search for your answers.):
  1. How does using the SPECTER2 embeddings compare to the VSR baseline in terms of retrieval accuracy (both PR and NDCG) ?
  2. Is there a difference with and without using the adapters?
  3. Why do you think classic VSR might still be out-performing these modern deep-learning methods (designed specifically for scientific documents) on this particular scientific corpus?
  4. How does using Euclidian distance vs. cosine similarity affect the results using the two deep models?
  5. Does the hybrid solution improve over both methods. Why or why not?
  6. What seems to be the best value of the hyperparameter λ?
The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.

Submission

You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 4:

Please make sure that your code compiles and runs on the UTCS lab machines.

Grading Criteria