The R Journal: article published in 2017, volume 9:2

A Tidy Data Model for Natural Language Processing using cleanNLP PDF download
Taylor Arnold , The R Journal (2017) 9:2, pages 248-267.

Abstract Recent advances in natural language processing have produced libraries that extract low level features from a collection of raw texts. These features, known as annotations, are usually stored internally in hierarchical, tree-based data structures. This paper proposes a data model to represent annotations as a collection of normalized relational data tables optimized for exploratory data analysis and predictive modeling. The R package cleanNLP, which calls one of two state of the art NLP libraries (CoreNLP or spaCy), is presented as an implementation of this data model. It takes raw text as an input and returns a list of normalized tables. Specific annotations provided include tokenization, part of speech tagging, named entity recognition, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. The package currently supports input text in English, German, French, and Spanish.

Received: 2017-03-27; online 2017-06-28, supplementary material, (3.4 KiB)
CRAN packages: dplyr, ggplot2, magrittr, broom, janitor, tidyr, cleanNLP, tidytext, StanfordCoreNLP, coreNLP, XML, spacyr, NLP, cleanNLP, lda, lsa, topicmodels, sqliter, rJava, sotu, glmnet
CRAN Task Views cited directly: NaturalLanguageProcessing
CRAN Task Views implied by cited CRAN packages: NaturalLanguageProcessing, WebTechnologies, Graphics, HighPerformanceComputing, MachineLearning, Phylogenetics, Survival

CC BY 4.0
This article and supplementary materials are licensed under a Creative Commons Attribution 4.0 International license.

  author = {Taylor Arnold},
  title = {{A Tidy Data Model for Natural Language Processing using
  year = {2017},
  journal = {{The R Journal}},
  doi = {10.32614/RJ-2017-035},
  url = {},
  pages = {248--267},
  volume = {9},
  number = {2}