reclin2: a Toolkit for Record Linkage and Deduplication

The goal of record linkage and deduplication is to detect which records belong to the same object in data sets where the identifiers of the objects contain errors and missing values. The main design considerations of reclin2 are: modularity/flexibility, speed and the ability to handle large data sets. The first points makes it easy for users to extend the package with custom process steps. This flexibility is obtained by using simple data structures and by following as close as possible common interfaces in R. For large problems it is possible to distribute the work over multiple worker nodes. A benchmark comparison to other record linkage packages for R, shows that for this specific benchmark, the fastLink package performs best. However, this package only performs one specific type of record linkage model. The performance of reclin2 is not far behind the of fastLink while allowing for much greater flexibility.

D. Jan van der Laan (Statistics Netherlands (CBS))
2022-10-11

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2022-038.zip

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Laan, "reclin2: a Toolkit for Record Linkage and Deduplication", The R Journal, 2022

BibTeX citation

@article{RJ-2022-038,
  author = {Laan, D. Jan van der},
  title = {reclin2: a Toolkit for Record Linkage and Deduplication},
  journal = {The R Journal},
  year = {2022},
  note = {https://doi.org/10.32614/RJ-2022-038},
  doi = {10.32614/RJ-2022-038},
  volume = {14},
  issue = {2},
  issn = {2073-4859},
  pages = {320-328}
}