SimilaR: R Code Clone and Plagiarism Detection

Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes. A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

Maciej Bartoszuk , Marek Gagolewski
2020-09-10

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2020-017.zip

CRAN packages used

magrittr, SimilaR, nortest, DescTools

CRAN Task Views implied by cited packages

MissingData, WebTechnologies

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Bartoszuk & Gagolewski, "SimilaR: R Code Clone and Plagiarism Detection", The R Journal, 2020

BibTeX citation

@article{RJ-2020-017,
  author = {Bartoszuk, Maciej and Gagolewski, Marek},
  title = {SimilaR: R Code Clone and Plagiarism Detection},
  journal = {The R Journal},
  year = {2020},
  note = {https://doi.org/10.32614/RJ-2020-017},
  doi = {10.32614/RJ-2020-017},
  volume = {12},
  issue = {1},
  issn = {2073-4859},
  pages = {367-385}
}