The RecordLinkage Package: Detecting Errors in Data

Record linkage deals with detecting homonyms and mainly synonyms in data. The package RecordLinkage provides means to perform and evaluate different record linkage methods. A stochas tic framework is implemented which calculates weights through an EM algorithm. The determination of the necessary thresholds in this model can be achieved by tools of extreme value theory. Further more, machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used. Future development will concentrate on additional and refined methods, performance improvements and input/output facilities needed for real-world application.

Murat Sariyar , Andreas Borg


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Sariyar & Borg, "The RecordLinkage Package: Detecting Errors in Data", The R Journal, 2010

BibTeX citation

  author = {Sariyar, Murat and Borg, Andreas},
  title = {The RecordLinkage Package: Detecting Errors in Data},
  journal = {The R Journal},
  year = {2010},
  note = {},
  doi = {10.32614/RJ-2010-017},
  volume = {2},
  issue = {2},
  issn = {2073-4859},
  pages = {61-67}