Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R’s native exact matching functions match and %in%.
kernlab, RecordLinkage, MiscPsycho, cba, Mkmisc, deducorrect, vwr, stringdist, textcat, TraMineR
OfficialStatistics, Cluster, NaturalLanguageProcessing, Graphics, MachineLearning, Multivariate, Optimization, Survival
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Loo, "The stringdist Package for Approximate String Matching", The R Journal, 2014
BibTeX citation
@article{RJ-2014-011, author = {Loo, Mark P.J. van der}, title = {The stringdist Package for Approximate String Matching}, journal = {The R Journal}, year = {2014}, note = {https://doi.org/10.32614/RJ-2014-011}, doi = {10.32614/RJ-2014-011}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {111-122} }