Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
The following datasets have been kindly provided for evaluating duplicate
detection, record linkage, and identity uncertainty systems. Several of
these are not yet available for downloading; please contact the authors.
If you can contribute other labeled datasets for this problem, please send them over - this would be
greatly appreciated by fellow researchers!
- UIS Database Generator: generates a
list of randomly perturbed names and US mailing addresses. Written by
Mauricio Hernández.
- Cora - a segmented citation dataset
based on the Cora research paper search engine. Provided by William Cohen.
- SecondString sets - a collection of 14 single-field
datasets provided with the SecondString package by
William Cohen.
- Restaurant - a collection of 864 restaurant records from the
Fodor's and Zagat's restaurant guides that contains 112 duplicates.
Includes both segmented and unsegmented versions. Provided by Sheila Tejada
- Citeseer - four citation datasets
from the Citeseer scientific
literature digital library. Provided by Steve Lawrence.
- DBLP - citation datasets
based on DBLP
computer science bibliography. Provided by Patrick Reuther.
Back to RIDDLE homepage
Last modified: August 25, 2003