The stringdist Package for Approximate String Matching

Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R’s native exact matching functions match and %in%.

Mark P.J. van der Loo

CRAN packages used

kernlab, RecordLinkage, MiscPsycho, cba, Mkmisc, deducorrect, vwr, stringdist, textcat, TraMineR

CRAN Task Views implied by cited packages

OfficialStatistics, Cluster, NaturalLanguageProcessing, Graphics, MachineLearning, Multivariate, Optimization, Survival


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Loo, "The R Journal: The stringdist Package for Approximate String Matching", The R Journal, 2014

BibTeX citation

  author = {Loo, Mark P.J. van der},
  title = {The R Journal: The stringdist Package for Approximate String Matching},
  journal = {The R Journal},
  year = {2014},
  note = {},
  doi = {10.32614/RJ-2014-011},
  volume = {6},
  issue = {1},
  issn = {2073-4859},
  pages = {111-122}