Recent advances in natural language processing have produced libraries that extract low-level features from a collection of raw texts. These features, known as annotations, are usually stored internally in hierarchical, tree-based data structures. This paper proposes a data model to represent annotations as a collection of normalized relational data tables optimized for exploratory data analysis and predictive modeling. The R package cleanNLP, which calls one of two state of the art NLP libraries (CoreNLP or spaCy), is presented as an implementation of this data model. It takes raw text as an input and returns a list of normalized tables. Specific annotations provided include tokenization, part of speech tagging, named entity recognition, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. The package currently supports input text in English, German, French, and Spanish.
There has been an ongoing trend towards converting raw data into a collection of normalized tables prior to conducting further analyses. This paradigm, recently popularized by Hadley Wickham under the term “tidy data” (Wickham 2014), draws on concepts from database and visualization theory to provide a welcomed theoretical basis for data analysis. There are also many pragmatic benefits to putting data into a set of normalized tables prior to beginning an exploratory analysis or building inferential models. When working with normalized data most modeling, data manipulation, and visualization tasks can be described using a small collection of functions. This makes code more readable, less-error prone, and allows for better code reuse. As many of these simple functions reduce to basic database operations, this style of coding can simplify the task of integrating statistical models into a production codebase. Also, normalized tables can be stored unambiguously as delimited plain text flat files, allowing for interoperability between programming languages and users.
As both a cause and result of the popularity of this approach, a number of software packages have been developed to help construct and manipulate collections of normalized data tables. In R, well-known examples include dplyr (Wickham and Francois 2016), ggplot2 (Wickham 2009), magrittr (Bache and Wickham 2014), broom (Robinson 2017), janitor (Firke 2016), and tidyr (Wickham 2017). On the Python side, much of this functionality is included within the pandas (McKinney et al. 2010) and sklearn (Pedregosa et al. 2011) modules.
While cleaning messy data is often a time-consuming task, deciding on a specific normalized schema for representing a set of inputs is in most cases relatively straightforward. Outside of potentially removing outliers, missing data, and bad inputs, the process of tidying data is generally a lossless procedure. At a high-level, data tidying is often simply a reorganization of the raw inputs. However, if we are working with unstructured data such as collections of text, images, or sound, converting into a normalized tabular format is significantly more involved. The process of tidying in these cases becomes synonymous with featurization, whereby structured outputs are algorithmically extracted from a raw input. For example, from an audio music file we might extract features such as the overall length, beats per minute, and quantiles of the music’s loudness.
The featurization of raw text, known in natural language processing as text annotation, includes tasks such as tokenization (splitting text into words), part of speech tagging, and named entity recognition. Recent advancements in neural networks and heavy investment from both industry and academia have produced fast and highly accurate annotation libraries such as Stanford’s CoreNLP (Manning et al. 2014), spaCy (Honnibal and Johnson 2015), Apache’s OpenNLP (Baldridge 2005), and Google’s SyntaxNet (Petrov 2016). All of these, however, internally represent annotations using collections of complex, hierarchical, object-oriented classes. While these structures are ideal for annotation, they are not optimal for exploratory and predictive modeling.
In this paper, we present a method for uniting the cutting edge advancements in natural language processing with the popular normalized data paradigm. Specifically, we give a data schema representing the output of an NLP annotation pipeline as a collection of normalized tables. Alongside this specification, we present the R package cleanNLP that implements this specification over three distinct back ends. The package contains:
custom Java code, called by rJava (Urbanek 2016), that annotates raw text using the CoreNLP library;
a custom Python script, called by reticulate (Allaire et al. 2017), that annotates raw text using the spaCy library;
a simple, system dependency free, annotation engine using the package tokenizers (Mullen 2016).
The package cleanNLP also includes tools for converting from the normalized data model into (sparse) data matrices appropriate for exploratory and predictive modeling. Together, these contributions simplify the process of doing exploratory data analysis over a corpus of text.
There are several existing R packages that have some similar or
complementary features to those in cleanNLP. The R package
tidytext (Silge and Robinson 2016)
also offers the ability to convert raw rext into a data frame. It is
quite similar to the functionality of cleanNLP when using the
tokenizers back end, with the addition of basic sentiment analysis and
part of speech tagging for English through the use of word lists. With
all annotations occurring at the token level, results are given as a
single table rather than a normalized schema between many tables as in
cleanNLP, which simplifies its application for new users. As such,
tidytext works well for applications that do not need more advanced
annotators such as named entities, dependencies, and coreferences. Given
the overlap in general approaches, it should be relatively
straightforward for users to transition from tidytext to cleanNLP
when they find the need for these annotation tasks. There are two
existing R packages that also call functions in the CoreNLP library. The
package
StanfordCoreNLP
(Hornik 2016c), available only through the datacube website at
Vienna University, integrates into the NLP framework. A similar,
standalone approach is offered by
coreNLP (Arnold and Tilton 2016).
Both of these packages run the annotation pipeline over a corpus of
text, call the java class edu.stanford.nlp.pipeline.XMLOutputter, and
then parse the output using the
XML package. This approach
is not ideal as parsing the output XML file is computationally
time-consuming. It is also error prone because there is no published
format specifying the output of the XML.1 There is also the package
spacyr (Benoit and Matsuo 2017), which
was published after cleanNLP, that offers another way of calling the
spaCy library from R. Internally, spacyr works similarly to the spaCy
back end in cleanNLP by calling the Python library and extracting
information into R data types. However, spacyr returns results as a
single denormalized data frame and (perhaps in part as a result of
having no easy way of storing them in the one-table output) does not
support the word embeddings feature of the spaCy library.
The package has been designed to integrate into workflows that utilize
the many other packages for text processing available in R, such as
those found in the CRAN Taskview
NaturalLanguageProcessing.
For example, users may use the framework provided by tm (Feinerer et al. 2008) to
manage external corpora or the classes within
NLP (Hornik 2016a) to run
alternative parsers that can be converted into a tidy framework by way
of the from_CoNLL function. The Apache OpenNLP annotation pipeline,
available via openNLP (Hornik 2016b), for instance, provides several
languages not yet supported by spaCy or the CoreNLP pipeline. Packages
that focus on the analysis and modeling of text data can usually be used
directly with the output from
cleanNLP; these include
lda (Chang 2015),
lsa (Wild 2015), and
topicmodels
(Grün and Hornik 2011). Similarly, general-purpose database back-ends such as
sqliter (Freitas 2014) can
be used to store the tidy data tables; predictive modeling functions may
be used to do predictive analytics over generated term-frequency
matrices.
In the following section we illustrate the usage of the R package across all three back ends. Next, we give a detailed description and justification of our data model. Along the way, we give a high-level introduction to the ideas behind the underlying NLP annotators. We finish by illustrating a longer example of using the package to study a corpus of historical speeches made by Presidents of the United States.
Before describing the data model for text annotations, it is useful to understand the basic workflow provided by the R package cleanNLP. We start by writing the opening lines of Douglas Adams’ Life, the Universe and Everything to a temporary file.
> txt <- c("The regular early morning yell of horror was the sound of Authur",
+          "Dent waking up and suddenly remembering where he was. It wasn't",
+          "just that the cave was cold, it wasn't just that it was damp and",
+          "smelly. It was the fact that the cave was in the middle of",
+          "Islington and there wasn't a bus due for two million years.")
> writeLines(txt, tf <- tempfile())The package cleanNLP can be installed directly from CRAN, with
binaries available for all major operating systems. In order to annotate
raw text, an NLP back end must first be initalized. Once this is done,
annotation is done by calling the function annotate with a vector of
path(s) to the input documents. We start with an example using the
tokenizers back end.
> library(cleanNLP)
> init_tokenizers()
> anno <- run_annotators(tf)The result of the annotation is a named list of six data frames and one matrix. We can see the elements of the object by printing out their names.
> names(anno)
[1] "coreference" "dependency"  "document"    "entity"      "sentence"
[6] "token"       "vector"The individual tables can be referenced with the generic R accessor
functions (such as ‘[[‘), however the preferred method is to call the
relevant cleanNLP functions of the form get_TABLENAME(). For
example, the tokens table for this example can be accessed with the
get_token function.
> get_token(anno)
# A tibble: 61 x 8
      id   sid   tid    word lemma  upos   pos   cid
   <int> <int> <int>   <chr> <chr> <chr> <chr> <int>
1      1     0     1     The  <NA>  <NA>  <NA>    NA
2      1     0     2 regular  <NA>  <NA>  <NA>    NA
3      1     0     3   early  <NA>  <NA>  <NA>    NA
4      1     0     4 morning  <NA>  <NA>  <NA>    NA
5      1     0     5    yell  <NA>  <NA>  <NA>    NA
6      1     0     6      of  <NA>  <NA>  <NA>    NA
7      1     0     7  horror  <NA>  <NA>  <NA>    NA
8      1     0     8     was  <NA>  <NA>  <NA>    NA
9      1     0     9     the  <NA>  <NA>  <NA>    NA
10     1     0    10   sound  <NA>  <NA>  <NA>    NA
# ... with 51 more rowsThe get functions are preferable because they provide useful options
for modifying the output before returning it. Notice that the annotation
process here has split out each word in the input into its own row.
There are also several columns of ids and columns filled with missing
values. The specific schema of the tables will be the focus of
discussion in the following section.
The tokenizers back end requires no external dependencies, however it
does not support any of the advanced annotation tasks that illustrate
the utility of the cleanNLP package. This explains why most of the
columns in the example are missing. It is included primarily for testing
and demonstration purposes in cases where the other back ends cannot be
installed. The spaCy back end uses the Python library by the same name
for the purpose of extracting text annotations. Users must install
Python and the library externally (detailed instructions are provided in
the package documentation). Once installed, the only modification
required by the R code is to adjust which init_ function is being
called.
> init_spaCy()
> anno <- run_annotators(tf)
> get_token(anno)
# A tibble: 68 x 8
      id   sid   tid    word   lemma  upos   pos   cid
   <int> <int> <int>   <chr>   <chr> <chr> <chr> <int>
1      1     1     1     The     the   DET    DT     0
2      1     1     2 regular regular   ADJ    JJ     4
3      1     1     3   early   early   ADJ    JJ    12
4      1     1     4 morning morning  NOUN    NN    18
5      1     1     5    yell    yell  NOUN    NN    26
6      1     1     6      of      of   ADP    IN    31
7      1     1     7  horror  horror  NOUN    NN    34
8      1     1     8     was      be  VERB   VBD    41
9      1     1     9     the     the   DET    DT    45
10     1     1    10   sound   sound  NOUN    NN    49
# ... with 58 more rowsThe output is in the exact some format but now all of the token columns are filled in with useful information such as the lemmatized form of each word and part of speech codes. Similar details are also filled into the other fields.
The third and final back end currently available uses the Java library
coreNLP. Users must install Java version 1.8 or higher and link it to R
using the rJava. The
coreNLP models, which are over 1 GB, can then be either manually
downloaded or grabbed using the helper function download_coreNLP().
Once installed, the back end works just as with the other back ends.
> init_coreNLP()
> anno <- run_annotators(tf)
> get_token(anno)
# A tibble: 68 x 8
      id   sid   tid    word   lemma  upos   pos   cid
   <int> <int> <int>   <chr>   <chr> <chr> <chr> <int>
1      1     1     1     The     the   DET    DT     0
2      1     1     2 regular regular   ADJ    JJ     4
3      1     1     3   early   early   ADJ    JJ    12
4      1     1     4 morning morning  NOUN    NN    18
5      1     1     5    yell    yell  VERB    VB    26
6      1     1     6      of      of   ADP    IN    31
7      1     1     7  horror  horror  NOUN    NN    34
8      1     1     8     was      be  VERB   VBD    41
9      1     1     9     the     the   DET    DT    45
10     1     1    10   sound   sound  NOUN    NN    49
# ... with 58 more rowsThe token output here is similar, but not exactly the same, as that produced by the spaCy annotation engine. The only distinction in the first ten rows is whether the word yell is categorized as a noun (spaCy) or a verb (coreNLP). While yell can be either part of speech, in context the spaCy interpretation is correct.
As seen in the code-snippets here, the philosophy behind the design of the cleanNLP package is to make it as easy as possible to get raw text turned into data frames. All of the functions introduced here have optional parameters that change the way the back ends are run or how the annotations are returned. This includes which annotators to run and selecting the desired language model to use. Complete documentation is available within the R help pages.
| table name | record | primary key | foreign keys | 
|---|---|---|---|
| document | document | id | \(\cdot\) | 
| token | word / punctuation | id, sid, tid | cid | 
| dependencies | token pairs | id, sid, tid, tid_target | \(\cdot\) | 
| entity | set of tokens | id, sid, tid, tid_end | \(\cdot\) | 
| coreference | mentions | id, rid, mid | sid, tid, tid_end, tid_head | 
| sentence | sentence | id, sid | \(\cdot\) | 
| vector | word embedding | id, sid, tid | \(\cdot\) | 
An annotation object is simply a named list with each item containing a data frame. These frames should be thought of as tables living inside of a single database, with keys linking each table to one another. All tables are in the second normal form of Codd (1990). For the most part they also satisfy the third normal form, or, equivalently, the formal tidy data model of Wickham (2014). The limited departures from this more stringent requirement are justified below wherever they exist. In every case the cause is a transitive dependency that would require a complex range join to reconstruct.
Several standards have previously been proposed for representing textual
annotations. These include the linguistic Annotation Framework
(Ide and Romary 2001), NLP Interchange Format (Hellmann et al. 2012), and CoNLL-X
(Buchholz and Marsi 2006). The function from_CoNLL is included as a helper
function in cleanNLP to convert from CoNLL formats into the cleanNLP
data model. All of these, however, are concerned with representing
annotations for interoperability between systems. Our goal is instead to
create a data model well-suited to direct analysis, and therefore
requires a new approach.
In this section each table is presented and justifications for its
existence and form are given. Individual tables may be pulled out with
access functions of the form get_*. Example tables are pulled from the
dataset obama, which is included with the cleanNLP package. This
gives the annotation object obtained from the text of the annual
speeches Barack Obama made to Congress. These annual addresses, known as
The State of the Union, are mandated by the US Constitution and have
been given by every president since George Washington.
| get_document() | |
|---|---|
| id | integer. Id of the source document. | 
| time | date time. The time at which the parser was run on the text. | 
| version | character. Version of the NLP library used to parse the text. | 
| language | character. Language of the text, in ISO 639-1 format. | 
| uri | character. Description of the raw text location. | 
The documents table contains one row per document in the annotation object. What exactly constitutes a document is up to the user. It might include something as granular as a paragraph or as coarse as an entire novel. For many applications, particularly stylometry, it may be useful to simultaneously work with several hierarchical levels: sections, chapters, and an entire body of work. The solution in these cases is to define a document as the smallest unit of measurement, denoting the higher-level structures as metadata. For example, when working with a corpus of texts where each book is broken into chapters, we would make each document an individual chapter. A metadata field would be assigned to each chapter indicating which book it is a part of.
The primary key for the document table is a document id, stored as an integer index. By design, there should be no extrinsic meaning placed on this key. Other tables use it to map to one another and to the document table, but any metadata about the document is contained only in the document table rather than being forced into the document key. In other words, the temptation to use keys such as “Obama2016” is avoided because, while these look nice, trying to make use of them to extract document-level metadata is error prone and ultimately more verbose than making use of a join with the document table.
The minimal fields required by the document table are given in
Table 2. These are all filled in automatically by the
annotation function. Any number of additional corpora-specific metadata,
such as the aforementioned section and chapter designations, may be
attached as well by giving it as an option to the meta parameter of
run_annotators. The document table for the example corpus is:
> get_document(obama)
# A tibble: 8 x 5
     id                time version language      uri
  <int>              <dttm>   <chr>    <chr>    <chr>
1     1 2017-05-21 09:27:55   1.8.2       en 2009.txt
2     2 2017-05-21 09:28:00   1.8.2       en 2010.txt
3     3 2017-05-21 09:28:05   1.8.2       en 2011.txt
4     4 2017-05-21 09:28:10   1.8.2       en 2012.txt
5     5 2017-05-21 09:28:14   1.8.2       en 2013.txt
6     6 2017-05-21 09:28:18   1.8.2       en 2014.txt
7     7 2017-05-21 09:28:22   1.8.2       en 2015.txt
8     8 2017-05-21 09:28:26   1.8.2       en 2016.txtIt may seem that common fields such as year and author should be added to the formal specification but the perceived advantage is minimal. It would still be necessary for users to manually add the content of these fields at some point as any other metadata is not unambiguously extractable from the raw text.
| get_token() | |
|---|---|
| id | integer. Id of the source document. | 
| sid | integer. Sentence id, starting from 0. | 
| tid | integer. Token id, with the root of the sentence starting at 0. | 
| word | character. Raw word in the input text. | 
| lemma | character. Lemmatized form the token. | 
| upos | character. Universal part of speech code. | 
| pos | character. Language-specific part of speech code; uses the Penn Treebank codes. | 
| cid | integer. Character offset at the start of the word in the original document. | 
The token table contains one row for each unique token, usually a word
or punctuation mark, in any document in the corpus. Any annotator that
produces an output for each token has its results displayed here. These
include the lemmatizer and the part of the speech tagger
(Toutanova and Manning 2000). Table 3 shows the required
columns contained in the token table. Given the annotators selected
during the pipeline initialization, some of these columns may contain
only missing data. A composite key exists by taking together the
document id, sentence id, and token id. There is also a foreign key,
cid, giving the character offset back into the original source
document. An example of the table looks like this:
> get_token(obama, include_root = TRUE)
# A tibble: 65,758 x 8
      id   sid   tid      word     lemma  upos   pos   cid
   <int> <int> <int>     <chr>     <chr> <chr> <chr> <int>
1      1     1     0      ROOT      ROOT  <NA>  <NA>    NA
2      1     1     1     Madam     madam PROPN   NNP     0
3      1     1     2   Speaker   speaker PROPN   NNP     6
4      1     1     3         ,         , PUNCT     ,    13
5      1     1     4       Mr.       mr. PROPN   NNP    15
6      1     1     5      Vice      vice PROPN   NNP    19
7      1     1     6 President president PROPN   NNP    24
8      1     1     7         ,         , PUNCT     ,    33
9      1     1     8   Members   members PROPN  NNPS    35
10     1     1     9        of        of   ADP    IN    43
# ... with 65,748 more rowsA phantom token “ROOT” is included at the start of each sentence (it
always has tid equal to 0) if the option include_root is set to TRUE
(it is FALSE by default). This is useful so that joins from the
dependency table, which contains references to the sentence root, into
the token table have no missing values.
The field upos contains the universal part of speech code, a
language-agnostic classification, for the token. It could be argued that
in order to maintain database normalization one should simply look up
the universal part of speech code by finding the language code in the
document table and joining a table mapping the Penn Treebank codes to
the universal codes. This has not been done for several reasons. First,
universal parts of speech are very useful for exploratory data analysis
as they contain tags much more familiar to non-specialists such as
“NOUN” (noun) and “CONJ” (conjunction). Asking users to apply a three
table join just to access them seems overly cumbersome. Secondly, it is
possible for users to use other parsers or annotation engines. These may
not include granular part of speech codes and it would be difficult to
figure out how to represent these if there were not a dedicated
universal part of speech field.
| get_dependency() | |
|---|---|
| id | integer. Id of the source document. | 
| sid | integer. Sentence id of the source token. | 
| tid | integer. Id of the source token. | 
| sid_target | integer. Sentence id of the target token. | 
| tid_target | integer. Id of the target token. | 
| relation | character. Language-agnostic universal dependency type. | 
| relation_full | character. Language specific universal dependency type. | 
| word | character. The source word in the raw text. | 
| lemma | character. Lemmatized form of the source word. | 
| word_target | character. The target word in the raw text. | 
| lemma_target | character. Lemmatized form of the target word. | 
Dependencies give the grammatical relationship between pairs of tokens within a sentence (Rafferty and Manning 2008; Green et al. 2011). As they are at the level of token pairs, they must be represented as a new table. All included fields are described in Table 4. Only one dependency should exist for any pair of tokens; the document id, sentence id, and source and target token ids together serve as a composite key. As dependencies exist only within a sentence, the sentence id does not need to be defined separately for the source and target. Dependencies take significantly longer to calculate than the lemmatization and part of speech tagging tasks.
The get_dependency function has an option (set to FALSE by default)
to auto join the dependency to the target and source words and lemmas
from the token table. This is a common task and involves non-trivial
calls to the left_join function making it worthwhile to include as an
option. For example, the following code replicates the behavior of
get_dependency when set to return words and lemmas:
dep <- get_dependency(obama) %>%
  left_join(select(get_token(obama, include_root = TRUE),
                   id, sid, tid, word, lemma),
            by = c("id", "sid", "tid")) %>%
  left_join(select(get_token(obama, include_root = TRUE),
                   id, sid, tid_target = tid,
                   word_target = word, lemma_target = lemma),
            by = c("id", "sid", "tid_target"))The output, equivalently using a call to get_dependency, is given by:
> get_dependency(obama, get_token = TRUE)
# A tibble: 62,781 x 10
      id   sid   tid tid_target relation relation_full      word     lemma
   <int> <int> <int>      <int>    <chr>         <chr>     <chr>     <chr>
1      1     1     2          1 compound          <NA>   Speaker   speaker
2      1     1     0          2     ROOT          <NA>      ROOT      ROOT
3      1     1     2          3    punct          <NA>   Speaker   speaker
4      1     1     6          4 compound          <NA> President president
5      1     1     6          5 compound          <NA> President president
6      1     1     2          6    appos          <NA>   Speaker   speaker
7      1     1     6          7    punct          <NA> President president
8      1     1     6          8    appos          <NA> President president
9      1     1     8          9     prep          <NA>   Members   members
10     1     1     9         10     pobj          <NA>        of        of
   word_target lemma_target
         <chr>        <chr>
1        Madam        madam
2      Speaker      speaker
3            ,            ,
4          Mr.          mr.
5         Vice         vice
6    President    president
7            ,            ,
8      Members      members
9           of           of
10    Congress     congress
# ... with 62,771 more rowsThe word “ROOT” shows up in the first row, which would have been NA
had sentence roots not been explicitly included in the token table.
Our parser produces universal dependencies (De Marneffe et al. 2014), which have a language-agnostic set of relationship types with language-specific subsets pertaining to specific grammatical relationships with a particular language. For the same reasons that both the part of speech codes and universal part of speech codes are included, each of these relationship types have been added to the dependency table.
| get_entity() | |
|---|---|
| id | integer. Id of the source document. | 
| sid | integer. Sentence id of the entity mention. | 
| tid | integer. Token id at the start of the entity mention. | 
| tid_end | integer. Token id at the end of the entity mention. | 
| entity_type | character. Type of entity. | 
| entity | character. Raw words of the named entity in the text. | 
| entity_normalized | character. Normalized version of the entity. | 
Named entity recognition is the task of finding entities that can be defined by proper names, categorizing them, and standardizing their formats (Finkel et al. 2005). The XML output of the Stanford CoreNLP pipeline places named entity information directly into their version of the token table. Doing this repeats information over every token in an entity and gives no canonical way of extracting the entirety of a single entity mention. We instead have a separate entity table, as is demanded by the normalized database structure, and record each entity mention in its own row. The full set of fields are given in Table 5, with the combination of document id, sentence id, and token id serving as a composite key.
An example of the named entity table is given by:
> get_entity(obama)
# A tibble: 3,035 x 6
      id   sid   tid tid_end entity_type              entity
   <int> <int> <int>   <int>       <chr>               <chr>
1      1     1     1       2      PERSON       Madam Speaker
2      1     1     8      10         ORG Members of Congress
3      1     1    12      14         ORG      the First Lady
4      1     1    16      18         GPE   the United States
5      1     1    30      30        TIME             tonight
6      1     1    43      44       EVENT            Chamber,
7      1     2     6       6        NORP           Americans
8      1     4    24      25        DATE           every day
9      1     8    23      23        TIME             tonight
10     1     8    27      27        NORP            American
# ... with 3,025 more rowsThe categories available in the field entity_type are dependent on the
specific back end used. When using the coreNLP back end, the entities
‘MONEY’, ‘ORDINAL’ ‘PERCENT’, ‘DATE’ and ‘TIME’ also have a normalized
form. Entities for the spaCy backend offer more granular distinctions,
with a full list contained in the help page for the function
get_entity. As with the coreference table, a complete representation
of the entity is given as a character string due to the difficulty in
reconstructing this after the fact from the token table, so the
character string has been included as an explicit field.
| get_coreference() | |
|---|---|
| id | integer. Id of the source document. | 
| rid | integer. Relation ID. | 
| mid | integer. Mention ID; unique to each coreference within a document. | 
| mention | character. The mention as raw words from the text. | 
| mention_type | character. One of "LIST", "NOMINAL", "PRONOMINAL", or "PROPER". | 
| number | character. One of "PLURAL", "SINGULAR", or "UNKNOWN". | 
| gender | character. One of "FEMALE", "MALE", "NEUTRAL", or "UNKNOWN". | 
| animacy | character. One of "ANIMATE", "INANIMATE", or "UNKNOWN". | 
| sid | integer. Sentence id of the coreference. | 
| tid | integer. Token id at the start of the coreference. | 
| tid_end | integer. Token id at the start of the coreference. | 
| tid_head | integer. Token id of the head of the coreference. | 
Coreferences link sets of tokens that refer to the same underlying person, object, or idea (Raghunathan et al. 2010; Lee et al. 2011, 2013; Recasens et al. 2013). One common example is the linking of a noun in one sentence to a pronoun in the next sentence. The coreference table describes these relationships but is not strictly a table of coreferences. Instead, each row represents a single mention of an expression and gives a reference id indicating all of the other mentions that it also coreferences. Table 6 gives the entire schema of the coreference table. The document, reference, and mention ids serve as a composite key for the table. Links back into the token table for the start, end and head of the mention are given as well; these are pushed to the right of the table as they should be considered foreign keys within this table.
An example helps to explain exactly what the coreference table represents:
> get_coreference(obama)
# A tibble: 6,982 x 12
      id   rid   mid                      mention mention_type   number  gender
   <int> <int> <int>                        <chr>        <chr>    <chr>   <chr>
1      1  2049     7            the United States       PROPER SINGULAR NEUTRAL
2      1  2049    77 the United States of America       PROPER SINGULAR NEUTRAL
3      1  2049   102                      America       PROPER SINGULAR NEUTRAL
4      1  2049   315                      America       PROPER SINGULAR NEUTRAL
5      1  2049   742                   America 's       PROPER SINGULAR NEUTRAL
6      1  2049   782                      America       PROPER SINGULAR NEUTRAL
7      1  2049   939                      America       PROPER SINGULAR NEUTRAL
8      1  2049   991                      America       PROPER SINGULAR NEUTRAL
9      1  2049  1003                      America       PROPER SINGULAR NEUTRAL
10     1  2049  1045                      America       PROPER SINGULAR NEUTRAL
     animacy   sid   tid tid_end tid_head
       <chr> <dbl> <int>   <int>    <int>
1  INANIMATE     1    16      18       18
2  INANIMATE     8    41      45       43
3  INANIMATE    12     6       6        6
4  INANIMATE    40    12      12       12
5  INANIMATE   103     8       9        8
6  INANIMATE   109     8       8        8
7  INANIMATE   132     5       5        5
8  INANIMATE   138    27      27       27
9  INANIMATE   140    41      41       41
10 INANIMATE   147     4       4        4
# ... with 6,972 more rowsHere, these are all mentions of the same underlying entity: The United
States of America. There is a special relationship between the reference
id rid and the mention id mid. The coreference annotator selects a
specific mention for each reference that gets treated as the canonical
mention for the entire class. The mention id for this mention becomes
the reference id for the class. This relationship provides a way of
identifying the canonical mention within a reference class and a way of
treating the coreference table as pairs of mentions rather than
individual mentions joined by a given key.
The text of the mention itself is included within the table. This was done because as the mention may span several tokens it would otherwise be very difficult to extract this information from the token table. It is also possible, though not supported in the current CoreNLP pipeline, that a mention could consist of a set of non-contiguous tokens, making this field impossible to otherwise reconstruct.
| get_sentence() | |
|---|---|
| id | integer. Id of the source document. | 
| sid | integer. Sentence id. | 
| sentiment | integer. Predicted sentiment; 0 (very negative) to 4 (very positive). | 
The sentiment tagger provided by the CoreNLP pipeline predicts whether a sentence is very negative (0), negative (1), neutral (2), positive (3), or very positive (4) (Socher et al. 2013). There is no native sentiment model currently supported by spaCy. The sentiment output is placed in a separate table because it returns information exclusively at the sentence level, unlike any of the other parsers. The schema, described in Table 7, has the document and sentence ids serving as composite keys, with the only other field being an integer sentiment code. An example of the output can be seen in:
> get_sentence(obama)
# A tibble: 2,988 x 3
      id   sid sentiment
   <int> <dbl>     <int>
1      1     1         1
2      1     2         3
3      1     3         1
4      1     4         1
5      1     5         2
6      1     6         1
7      1     7         3
8      1     8         1
9      1     9         1
10     1    10         1
# ... with 2,978 more rowsThe underlying sentiment model is a neural network. While at the moment few annotators exist at the sentence level, there is currently active research in modeling features that would eventually fit well into this table such as indicators of mood (Gaikwad and Joshi 2016), levels of sarcasm (Schifanella et al. 2016) or a characterization of the sentence’s “style” (Kabbara and Cheung 2016).
Our final table in the data model stores the relatively new concept of a
word vector. Also known as word embeddings, these vectors are
deterministic maps from the set of all available words into a
high-dimensional, real valued vector space. Words with similar meanings
or themes will tend to be clustered together in this high-dimensional
space. For example, we would expect apple and pear to be very close to
one another, with vegetables such as carrots, broccoli, and asparagus
only slightly farther away. The embeddings can often be used as input
features when building models on top of textual data. For a more
detailed description of these embeddings, see the papers on either of
the most well-known examples: GloVe (Pennington et al. 2014) and word2vec
(Mikolov et al. 2013). Only the spaCy back end to cleanNLP
currently supports word vectors; these are turned off by default because
they take a significantly large amount of space to store. The embedding
model uses the fasttext embeddings (Bojanowski et al. 2016), a
modification of the GloVe embeddings, which map words into a
300-dimensional space. To compute the embeddings, set the vector_flag
parameter of init_spaCy to TRUE prior to running the annotation.
Word vectors are stored in a separate table from the tokens table out of
convenience rather than as a necessity of preserving the data model’s
normalized schema. Due to its size and the fact that the individual
components of the word embedding have no intrinsic meaning, this table
is stored as a matrix. We can see that there is exactly one row in the
word embeddings for every non-ROOT token in the token table (note that
the word embeddings for the obama dataset are not included with the
package as they are too large to be uploaded to CRAN).
> dim(get_token(obama))
[1] 62781     8
> dim(get_vector(obama))
[1] 62781   303The first three columns hold the keys id, sid, and tid,
respectively. If no embedding is computed, the function get_vector
returns an empty matrix.
The President of the United States is constitutionally obligated to
provide a report known as the State of the Union. The report
summarizes the current challenges facing the country and the president’s
upcoming legislative agenda. While historically the State of the Union
was often a written document, in recent decades it has always taken the
form of an oral address to a joint session of the United States
Congress. In this final section the utility of the package is
illustrated by showing how it can be used to study a corpus consisting
of every such address made by a United States president through 2016
(Peters 2016). It highlights some of the major benefits of the tidy data model
as it applies to the study of textual data, though by no means attempts
to give an exhaustive coverage of all the available tables and
approaches. The examples make heavy use of the table verbs provided by
dplyr, the piping notation of magrittr and ggplot2 graphics. These
are used because they best illustrate the advantages of the tidy data
model that has been built in cleanNLP for representing corpus
annotations. Relevant functions are prepended with cleanNLP:: in the
following analysis in order to be clear which functions are supplied by
the cleanNLP package.
The full text of all the State of the Union addresses through 2016 are available in the R package sotu (Arnold 2017), available on CRAN. The package also contains meta-data concerning each speech that we will add to the document table while annotating the corpus. The code to run this annotation is given by:
> library(sotu)
> library(cleanNLP)
>
> data(sotu_text)
> data(sotu_meta)
> init_spaCy()
> sotu <- cleanNLP::run_annotators(sotu_text, as_strings = TRUE,
+                                  meta = sotu_meta)The annotation object, which we will use in the example in the following
analysis, is stored in the object sotu.
Simple summary statistics are easily computed off of the token table. To see the distribution of sentence length, the token table is grouped by the document and sentence id and the number of rows within each group are computed. The percentiles of these counts give a quick summary of the distribution.
> library(ggplot2)
> library(dplyr)
> cleanNLP::get_token(sotu) %>%
+   count(id, sid) %$%
+   quantile(n, seq(0,1,0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
   1   11   16   19   23   27   31   37   44   58  681The median sentence has 28 tokens, whereas at least one has over 600 (this is due to a bulleted list in one of the written addresses being treated as a single sentence) To see the most frequently used nouns in the dataset, the token table is filtered on the universal part of speech field, grouped by lemma, and the number of rows in each group are once again calculated. Sorting the output and selecting the top \(42\) nouns, yields a high level summary of the topics of interest within this corpus.
> cleanNLP::get_token(sotu) %>%
+   filter(upos == "NOUN") %>%
+   count(lemma) %>%
+   top_n(n = 42, n) %>%
+   arrange(desc(n)) %>%
+   use_series(lemma)
 [1] "year"          "country"       "people"        "government"
 [5] "law"           "time"          "nation"        "who"
 [9] "power"         "interest"      "world"         "war"
[13] "citizen"       "service"       "duty"          "part"
[17] "system"        "peace"         "right"         "man"
[21] "program"       "policy"        "work"          "act"
[25] "state"         "condition"     "subject"       "legislation"
[29] "force"         "effort"        "treaty"        "purpose"
[33] "what"          "land"          "business"      "action"
[37] "measure"       "tax"           "way"           "question"
[41] "relation"      "consideration"The result is generally as would be expected from a corpus of government speeches, with references to proper nouns representing various organizations within the government and non-proper nouns indicating general topics of interest such as “tax”, “law”, and “peace”.
The length in tokens of each address is calculated similarly by grouping and summarizing at the document id level. The results can be joined with the document table to get the year of the speech and then piped in a ggplot2 command to illustrate how the length of the State of the Union has changed over time.
> cleanNLP::get_token(sotu) %>%
+   count(id) %>%
+   left_join(cleanNLP::get_document(sotu)) %>%
+   ggplot(aes(year, n)) +
+     geom_line(color = grey(0.8)) +
+     geom_point(aes(color = sotu_type)) +
+     geom_smooth()Here, color is used to represent whether the address was given as an oral address or a written document. The output in Figure 1 shows that their are certainly time trends to the address length, with the form of the address (written versus spoken) also having a large effect on document length.
Finding the most used entities from the entity table over the time
period of the corpus yields an alternative way to see the underlying
topics. A slightly modified version of the code snippet used to find the
top nouns in the dataset can be used to find the top entities. The
get_token function is replaced by get_entity and the table is
filtered on entity_type rather than the universal part of speech code.
> cleanNLP::get_entity(sotu) %>%
+   filter(entity_type == "GPE") %>%
+   count(entity) %>%
+   top_n(n = 26, n) %>%
+   arrange(desc(n)) %>%
+   use_series(entity)
 [1] "the United States"        "America"
 [3] "States"                   "Mexico"
 [5] "Great Britain"            "Spain"
 [7] "Washington"               "China"
 [9] "Executive"                "France"
[11] "Cuba"                     "Japan"
[13] "Texas"                    "Russia"
[15] "The United States"        "Germany"
[17] "United States"            "California"
[19] "Nicaragua"                "the Soviet Union"
[21] "Mississippi"              "Iraq"
[23] "Alaska"                   "U.S."
[25] "Philippines"              "Panama"
[27] "the District of Columbia"The ability to redo analyses from a slightly different perspective is a direct consequence of the tidy data model supplied by cleanNLP. The top locations include some obvious and some less obvious instances. Those sovereign nations included such as Great Britain, Mexico, Germany, and Japan seem as expected given either the United State’s close ties or periods of war with them. The top states include the most populous regions (New York, California, and Texas) but also smaller states (Kansas, Oregon, Mississippi), the latter being more surprising.
One of the most straightforward way of extracting a high-level summary
of the content of a speech is to extract all direct object object
dependencies where the target noun is not a very common word. In order
to do this for a particular speech, the dependency table is joined to
the document table, a particular document is selected, and relationships
of type “dobj” (direct object) are filtered out. The result is then
joined to the data set word_frequency, which is included with
cleanNLP, and pairs with a target occurring less than 0.5% of the time
are selected to give the final result. Here is an example of this using
the first address made by George W. Bush in 2001:
> cleanNLP::get_dependency(sotu, get_token = TRUE) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year == 2001, relation == "dobj") %>%
+   select(id = id, start = word, word = lemma_target) %>%
+   left_join(word_frequency) %>%
+   filter(frequency < 0.001) %>%
+   select(id, start, word) %$%
+   sprintf("%s => %s", start, word)
Joining, by = "id"
Joining, by = "word"
 [1] "take => oath"                  "using => statistic"
 [3] "increasing => layoff"          "protects => trillion"
 [5] "makes => welcoming"            "accelerating => cleanup"
 [7] "fight => homelessness"         "helping => neighbor"
 [9] "allowing => taxpayer"          "provide => mentor"
[11] "fight => illiteracy"           "promotes => compassion"
[13] "asked => ashcroft"             "end => profiling"
[15] "pay => trillion"               "throw => dart"
[17] "restores => fairness"          "promoting => internationalism"
[19] "makes => downpayment"          "discard => relic"
[21] "confronting => shortage"       "directed => cheney"
[23] "sound => footing"              "divided => conscience"
[25] "done => servant"Most of these phrases correspond with the “compassionate conservatism" that George W. Bush ran under in the preceding 2000 election. Applying the same analysis to the 2002 State of the Union, which came under the shadow of the September 11th terrorist attacks, shows a drastic shift in focus.
> cleanNLP::get_dependency(sotu, get_token = TRUE) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year == 2002, relation == "dobj") %>%
+   select(id = id, start = word, word = lemma_target) %>%
+   left_join(word_frequency) %>%
+   filter(frequency < 0.0005) %>%
+   select(id, start, word) %$%
+   sprintf("%s => %s", start, word)
Joining, by = "id"
Joining, by = "word"
 [1] "urged => follower"        "called => troop"
 [3] "brought => sorrow"        "owe => micheal"
 [5] "ticking => timebomb"      "have => troop"
 [7] "hold => hostage"          "eliminate => parasite"
 [9] "flaunt => hostility"      "develop => anthrax"
[11] "put => troop"             "increased => vigilance"
[13] "fight => anthrax"         "thank => attendant"
[15] "defeat => recession"      "want => paycheck"
[17] "set => posturing"         "enact => safeguard"
[19] "embracing => ethic"       "owns => aspiration"
[21] "containing => resentment" "erasing => rivalry"
[23] "embrace => tyranny"Here the topics have almost entirely shifted to counter-terrorism and national security efforts.
The get_tfidf function provided by cleanNLP converts a token table
into a sparse matrix representing the term-frequency inverse document
frequency matrix (or any intermediate part of that calculation). This is
particularly useful when building models from a textual corpus. The
tidy_pca, also included with the package, takes a matrix and returns a
data frame containing the desired number of principal components.
Dimension reduction involves piping the token table for a corpus into
the get_tfidif function and passing the results to tidy_pca.
> pca <- cleanNLP::get_token(sotu) %>%
+   filter(pos %in% c("NN", "NNS")) %>%
+   cleanNLP::get_tfidf(min_df = 0.05, max_df = 0.95,
+                       type = "tfidf", tf_weight = "dnorm") %$%
+   cleanNLP::tidy_pca(tfidf, get_document(sotu))In this example only non-proper nouns have been included in order to minimize the stylistic attributes of the speeches in order to focus more on their content. A scatter plot of the speeches using these components is shown in Figure 2. There is a definitive temporal pattern to the documents, with the 20th century addresses forming a distinct cluster on the right side of the plot.
Topic models are a collection of statistical models for describing
abstract themes within a textual corpus. Each theme is characterized by
a collection of words that commonly co-occur; for example, the words
‘crop’, ‘dairy’, ‘tractor’, and ‘hectare’, might define a farming
theme. One of the most popular topic models is latent Dirichlet
allocation (LDA), a Bayesian model where each topic is described by a
probability distribution over a vocabulary of words. Each document is
then characterized by a probability distribution over the available
topics. For a formal description, see (Blei et al. 2003) and
(Pritchard et al. 2000), the original papers outlining LDA. To fit LDA
on a corpus of text parsed by the cleanNLP package, the output of
get_tfidf can be piped directly to the LDA function in the package
topicmodels. The topic model function requires raw counts, so the type
variable in get_tfidf is set to “tf”.
> library(topicmodels)
> tm <- cleanNLP::get_token(sotu) %>%
+   filter(pos %in% c("NN", "NNS")) %>%
+   cleanNLP::get_tfidf(min_df = 0.05, max_df = 0.95,
+                       type = "tf", tf_weight = "raw") %
+   LDA(tf, k = 16, control = list(verbose = 1))The topics, ordered by approximate time period, are visualized in Figure 3. We describe each topic by giving the five most important words Most topics exist for a few decades and then largely disappear, though some persist over non-contiguous periods of the presidency. The “program, energy, effort, legislation, policy” topic, for example, appears during the 1950s and crops up again during the energy crisis of the 1970s. The “world, man, freedom, force, life” topic peaks during both World Wars, but is absent during the 1920s and early 1930s.
Finally, the cleanNLP data model is also convenient for building
predictive models. The State of the Union corpus does not lend itself to
an obviously applicable prediction problem. A classifier that
distinguishes speeches made by George W. Bush and Barrack Obama will be
constructed here for the purpose of illustration. As a first step, a
term-frequency matrix is extracted using the same technique as was used
with the topic modeling function. However, here the frequency is
computed for each sentence in the corpus rather than the document as a
whole. The ability to do this seamlessly with a single additional
mutate function defining a new id illustrates the flexibility of the
get_tfidf function.
> df <- get_token(sotu) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year > 2000) %>%
+   mutate(new_id = paste(id, sid, sep = "-")) %>%
+   filter(pos %in% c("NN", "NNS"))
Joining, by = "id"
> mat <- get_tfidf(df, min_df = 0, max_df = 1, type = "tf",
+                  tf_weight = "raw", doc_var = "new_id")It will be nessisary to define a response variable y indicating
whether this is a speech made by President Obama as well as a training
flag indicating which speeches were made in odd numbered years. This is
done via a separate table join and a pair of mutations.
> meta <- data_frame(new_id = mat$id) %>%
+   left_join(df[!duplicated(df$new_id),]) %>%
+   mutate(y = as.numeric(president == "Barack Obama")) %>%
+   mutate(train = year %in% seq(2001,2016, by = 2))
Joining, by = "new_id"The output may now be used as input to the elastic net function provided by the glmnet package. The response is set to the binomial family given the binary nature of the response and training is done on only those speeches occurring in odd-numbered years. Cross-validation is used in order to select the best value of the model’s tuning parameter.
> library(glmnet)
> model <- cv.glmnet(mat$tf[meta$train,], meta$y[meta$train], family = "binomial")A boxplot of the predicted classes for each address is given in Figure 4. The algorithm does a very good job of separating the speeches. Looking at the odd years versus even years (the training and testing sets, respectively) indicates that the model has not been over-fit.
One benefit of the penalized linear regression model is that it is possible to interpret the coefficients in a meaningful way. Here are the non-zero elements of the regression vector, coded as whether the have a positive (more Obama) or negative (more Bush) sign:
> beta <- coef(model, s = model[["lambda"]][11])[-1]
> sprintf("%s (%d)", mat$vocab, sign(beta))[beta != 0]
 [1] "job (1)"          "business (1)"     "citizen (-1)"
 [4] "terrorist (-1)"   "government (-1)"  "freedom (-1)"
 [7] "home (1)"         "college (1)"      "weapon (-1)"
[10] "deficit (1)"      "company (1)"      "peace (-1)"
[13] "enemy (-1)"       "terror (-1)"      "income (-1)"
[16] "drug (-1)"        "kid (1)"          "regime (-1)"
[19] "class (1)"        "crisis (1)"       "industry (1)"
[22] "need (-1)"        "fact (1)"         "relief (-1)"
[25] "bank (1)"         "liberty (-1)"     "society (-1)"
[28] "duty (-1)"        "folk (1)"         "account (-1)"
[31] "compassion (-1)"  "environment (-1)" "inspector (-1)"These generally seem as expected given the main policy topics of focus under each administration. During most of the Bush presidency, as mentioned before, the focus was on national security and foreign policy. Obama, on the other hand, inherited the recession of 2008 and was more focused on the overall economic policy.
In this paper a normalized data model for representing text annotations has been presented and rationalized. We have also demonstrated how the R package cleanNLP implements this data model using various, configurable back ends. Our focus has been to illustrate why this general approach and specific implementation is both powerful and easy to integrate into existing data pipelines. It is expected that some users will utilize the entirety of the underlying annotation pipelines, internal R structures, and helper functions. Others may use the package as a convenient wrapper around either the CoreNLP or spaCy libraries. In either extreme, or anywhere in between, our approach provides powerful tools for applying exploratory, graphical, and model-based techniques to textual data sources.
The cleanNLP package continues to be actively developed. In particular, we hope to include new sentence-level annotations as they are integrated into the spaCy and CoreNLP libraries. While major releases are available on CRAN, new features are added periodically on the development branch located at: https://github.com/statsmaths/cleanNLP. Bug reports, feature and collaboration requests can all be made using the GitHub issues page.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2017-035.zip
dplyr, ggplot2, magrittr, broom, janitor, tidyr, cleanNLP, tidytext, StanfordCoreNLP, coreNLP, XML, spacyr, NLP, lda, lsa, topicmodels, sqliter, rJava, sotu, glmnet
ChemPhys, Databases, HighPerformanceComputing, MachineLearning, MissingData, ModelDeployment, NaturalLanguageProcessing, NetworkAnalysis, Phylogenetics, Spatial, Survival, TeachingStatistics, WebTechnologies
This author, who is also the maintainer of coreNLP, has witnessed this first-hand by way of the persistent bug reports centering around the formatting of the XML output in strange edge cases or over new versions of the CoreNLP library. The coreNLP will still be maintained for users looking explicitly to access methods from the Stanford Library, whereas cleanNLP is being developed to provide a simpler interface that is consistent across various back ends.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Arnold, "A Tidy Data Model for Natural Language Processing using cleanNLP", The R Journal, 2017
BibTeX citation
@article{RJ-2017-035,
  author = {Arnold, Taylor},
  title = {A Tidy Data Model for Natural Language Processing using cleanNLP},
  journal = {The R Journal},
  year = {2017},
  note = {https://doi.org/10.32614/RJ-2017-035},
  doi = {10.32614/RJ-2017-035},
  volume = {9},
  issue = {2},
  issn = {2073-4859},
  pages = {248-267}
}