Project 4
CS 371R: Information Retrieval and Web Search
Evaluating
Embeddings From Deep Language Models
Due: 11:59pm, Dec. 2, 2024
This project will explore using emdeddings from an LLM to support
standard document retrieval. You will use document and query
embeddings from a recent LLM specialized for scientific literature
stored as precomputed dense vectors. You will then use existing Java
code that augments the existing course IR code to support retrieval
using pre-computed dense vectors and experimentally evaluate these
deep embeddings. Finally, you will implement and test a "hybrid"
approach that combines this dense-retrieval approach with the existing
VSR system and experimentally evaluate whether it improves the results
compared to purely sparse or dense retrieval alone.
Generated Deep Embeddings
Overview
We will evaluate deep dense embeddings generated from a
pre-trained transformer language model on the Cystic Fibrosis (CF)
dataset introduced in Project 2. Specifically, we will be using
the SPECTER2
transformer model which is an LLM trained on scientific
paper abstracts. SPECTER2 is first trained on over 6M triplets of
scientific paper citations, after which it is trained with additional
task-specific adapter modules. You can read more about the basic SPECTER
approach in this paper.
The pre-trained LM is capable of
generating task specific embeddings for scientific tasks when paired
with adapters. Given the combination of title and abstract of a
scientific paper or a short texual query, the model can be used to
generate effective embeddings to be used in downstream applications.
Read through
the model
card on HuggingFace to understand the model better.
In this project, along with the SPECTER2 Base Model, we shall be using the Proximity Adapter to embed documents and the Adhoc Query Adapter to embed queries.
We will compare two variants of SPECTER2 embeddings, one with just the base model and one with adapter modules attached to the base model.
Each of the two cases is described below.
- SPECTER2 Base Model - This is the SPECTER2 base model trained on a range of scientific document tasks from a dataset called called SciRepEval.
- SPECTER2 with Adapter Modules - This is the SPECTER2 model with task-specific adapter modules attached to the base model that are specialized for encoding documents and queries for search retrieval.
We have precomputed embeddings for the CF documents and queries stored as space-separated 768-dim vectors in the following files:
- Base Model Embeddings:
- Docs:
/u/mooney/ir-code/embeddings/specter2_base/docs/RN-00001...RN-01239
- Queries:
/u/mooney/ir-code/embeddings/specter2_base/queries/Q001...Q100
- Adapter Model Embeddings:
- Docs:
/u/mooney/ir-code/embeddings/specter2_adapter/docs/RN-00001...RN-01239
- Queries:
/u/mooney/ir-code/embeddings/specter2_adapter/queries/Q001...Q100
Deep Retrieval
The course code in /u/mooney/ir-code/ir/ has been augmented with a
class ir.vsr.DeepRetriever for a simple document retriever that uses
precomputed dense document embeddings. It takes a directory of
pre-computed document embeddings that have the same file names as the
original corpus where each file contains a simple space-separated list
of real-values representing the document embedding. Another new class
used by the deep retriever is DeepDocumentReference which stores a
pointer to a file and stores its dense vector and the precomputed
vector length (L2 norm) for this vector. The 'retrieve' method returns
a list of ranked retrievals for a query represented as a
DeepDocumentReference using either Euclidian distance (default) or cosine
similarity (if '-cosine' flag is used) to compare the dense vectors.
No indexing is used to improve efficiency, a query is compared to
every document in the corpus and all of the scored results are
included in the array of Retrievals. Ideally, some form of
approximate nearest neighbor search like locality sensitivity
hashing would be used; however, for the limited number of small documents
and queries in the CF corpus, a brute-force approach is tractable.
'Deep' versions of Experiment and ExperimentRated used in Project 2
are also in ir.eval.DeepExperiment and ir.eval.DeepExperimentRated
that produce precision-recall and NDCG plots evaluating the
DeepRetriever. These use the normal 'queries' and 'queries-rated'
files used by the normal experiment code but also takes a directory of
query embeddings containing an embedding file (list of
real-values). The query embedding directory should have files names
Q1,..Qn giving the embeddings of the queries in the order they are in
the original 'queries' file. The files will be lexicographically
sorted by name and this order should correspond to the order in the
queryFile so file numbers should have leading '0''s as needed to sort
properly, i.e. Q001, Q002, ... Q099, Q100.
A sample DeepExperimentRated trace is here.
Hybrid Retrieval
Disappointedly, my initial results (shown below) showed slightly worse
results than the baseline VSR system, except for slightly improved
precision for high recall values. Therefore, I hypothesized that
combining the normal VSR approach with the deep-learning approach in a
"hybrid" method might work best. A simple hybrid approach simply
ranks retrieved documents by a simple weighted linear combination of
the dense-vector deep cosine similarity with the normal VSR cosine
similarity, i.e. λ D + (1 - λ) S, where D is the deep
dense cosine similarity and S is the sparse TF/IDF BOW cosine similarity.
Implement and evaluate such a simple hybrid approach by writing the
following classes: ir.vsr.HybridRetriever, ir.eval.HybridExperiment,
and ir.eval.HybridExperimentRated.
The HybridRetriever should combine a DeepRetriever with a normal InvertedIndex to
produce a simple weighted linear combination of the results. The evaluation code can be fairly
easily generated by properly combining code from the deep and original
versions of Experiment and ExperimentRated. The main methods for the Hybrid Experiment classes should take the
following args:
Command args: [DIR] [EMBEDDIR] [QUERIES] [QUERYDIR] [LAMBDA] [OUTFILE] where:
DIR is the name of the directory whose files should be indexed.
EMDEDDIR is the name of the directory whose files have embeddings of the documents in DIR
QUERIES is a file of queries paired with relevant docs (see queryFile).
QUERYDIR is the name of the directory where the query embeddings are stored in files Q1...Qn
LAMBDA is the weight [0,1] to put on the deep cos similarity with (1-LAMBDA) on the VSR cosine sim
OUTFILE is the name of the file to put the output. The plot data for the recall precision curve is
stored in this file and a gnuplot file for the graph is the same name with a ".gplot" extension
For your experiments with HybridRetriever use the SPECTER2 Base embeddings and cosine similarity for the DeepRetriever (so it uses the same general metric as InvertedIndex constrained to be between 0 and 1).
Evaluation
Once you generate the embeddings, you can use the ir.eval.DeepExperimentRated
class to evaluate the embeddings. The commands to be run are as follows:
- VSR Model:
java ir.eval.ExperimentRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated results/vsr
- Base Model:
java ir.eval.DeepExperimentRated /u/mooney/ir-code/embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
rated /u/mooney/ir-code/embeddings/specter2_base/queries results/specter2_base
- Adapter Model:
java ir.eval.DeepExperimentRated /u/mooney/ir-code/embeddings/specter2_adapter/docs /u/mooney/ir-code/queries/cf/queries-
rated /u/mooney/ir-code/embeddings/specter2_adapter/queries results/specter2_adapter
- Hybrid Model:
java ir.eval.HybridExperimentRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
rated /u/mooney/ir-code/embeddings/specter2_base/queries 0.5 results/hybrid05
In addition try the '-cosine' version of DeepRetriever on the Base and
Adapter models and for the Hybrid model try alternative values of the λ hyperparameter including 0.3, 0.5, 0.7, 0.8, 0.9. Note that hybrid should use cosine by default, so don't specify it as an argument.
Results
You can use these gplot files to generate the plots for your report: all-deep.gplot, all-deep.ndcg.gplot, all.gplot, all.ndcg.gplot
You can check the sample plots for the base model and adapter model for PR and NDCG here:
For your final results, generate one set of basic deep retrieval PR and NDCG results including VSR, SPECTER2-Base, SPECTER2-Adapt, SPECTER2-Base(cosine), and SPECTER2-Adapt(cosine). Generate another set of Hybrid PR and NDCG results including VSR, SPECTER2-Base(cosine), and Hybrid using SPECTER2-Base(cosine) combined with VSR using alternative λ values (0.3, 0.5, 0.7, 0.8, 0.9).
Report
Your report should summarize and analze
the results of your experiments. Present the results in well
organized graphs (you can shrink the graphs and insert them into your
report). Include at least the 4 graphs (2 PR curves and 2 NDCG graphs)
for the combinations of results specified above. Also answer at least
the following questions (You should explicitly answer these
questions, i.e. in your report put "Q1. How..?" and then give the
answer underneath so that we do not have to search for your
answers.):
- How does using the SPECTER2 embeddings compare to the VSR baseline in terms of retrieval accuracy (both PR and NDCG) ?
- Is there a difference with and without using the adapters?
- Why do you think classic VSR might still be out-performing these modern deep-learning methods (designed specifically for scientific documents) on this particular scientific corpus?
- How does using Euclidian distance vs. cosine similarity affect the results using the two deep models?
- Does the hybrid solution improve over both methods. Why or why not?
- What seems to be the best value of the hyperparameter λ?
The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.
Submission
You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on
submitting projects on the course homepage. Along with that, follow these specific instructions for Project 4:
- Create at least the following new classes described above:
- ir.vsr.HybridRetriever
- ir.eval.HybridExperiment
- ir.eval.HybridExperimentRated
- For this assignment, you need to submit the following files:
-
code/
- A folder containing all your code, added and modified *.java and *.class files. Please do not modify the original java files but extend each class and override the appropriate methods.
-
report.pdf
- A PDF report of your experiment as described above with the plots referenced in the instructions.
results/
- A folder containing the data files used to generate your plots with the following contents:
vsr, vsr.ndcg
specter2_base, specter2_base.ndcg
specter2_adapter, specter2_adapter.ndcg
hybrid, hybrid.ndcg
Make sure that these files match the output of the code that you submit.
The code folder should have at least the following contents:
Name
---------------------------------------
ir/vsr/HybridRetriever.java
ir/eval/HybridExperiment.java
ir/eval/HybridExperimentRated.java
The results folder should have these contents:
Name
---------------------------------------
vsr vsr.ndcg
specter2_base specter2_base.ndcg
specter2_adapter specter2_adapter.ndcg
specter2_base_cos specter2_base_cos.ndcg
specter2_adapter_cos specter2_adapter_cos.ndcg
hybrid03 hybrid03.ndcg
hybrid05 hybrid05.ndcg
hybrid07 hybrid07.ndcg
hybrid08 hybrid08.ndcg
hybrid09 hybrid09.ndcg
Please make sure that your code compiles and runs on the UTCS lab machines.
Grading Criteria
- 50% Working code that correctly implements the hybrid method and its experimental evaluation.
- 15%: Efficient code with good programming style with necessary comments, intuitive variable/function names, and appropriate indentation.
- 35%: Quality of report, clear presentation of results, good analysis & discussion, and your answers to the questions above.