The R Journal: article published in 2020, volume 12:1

SimilaR: R Code Clone and Plagiarism Detection PDF download
Maciej Bartoszuk and Marek Gagolewski , The R Journal (2020) 12:1, pages 367-385.

Abstract Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes. A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

Received: 2020-04-01; online 2020-09-10
CRAN packages: magrittr, SimilaR, nortest, DescTools
CRAN Task Views implied by cited CRAN packages: MissingData, WebTechnologies

CC BY 4.0
This article is licensed under a Creative Commons Attribution 4.0 International license.

  author = {Maciej Bartoszuk and Marek Gagolewski},
  title = {{SimilaR: R Code Clone and Plagiarism Detection}},
  year = {2020},
  journal = {{The R Journal}},
  doi = {10.32614/RJ-2020-017},
  url = {},
  pages = {367--385},
  volume = {12},
  number = {1}