A Tidy Data Model for Natural Language Processing using cleanNLP

Taylor Arnold

doi:10.32614/RJ-2017-035

1 Introduction

There has been an ongoing trend towards converting raw data into a collection of normalized tables prior to conducting further analyses. This paradigm, recently popularized by Hadley Wickham under the term “tidy data” (Wickham 2014), draws on concepts from database and visualization theory to provide a welcomed theoretical basis for data analysis. There are also many pragmatic benefits to putting data into a set of normalized tables prior to beginning an exploratory analysis or building inferential models. When working with normalized data most modeling, data manipulation, and visualization tasks can be described using a small collection of functions. This makes code more readable, less-error prone, and allows for better code reuse. As many of these simple functions reduce to basic database operations, this style of coding can simplify the task of integrating statistical models into a production codebase. Also, normalized tables can be stored unambiguously as delimited plain text flat files, allowing for interoperability between programming languages and users.

As both a cause and result of the popularity of this approach, a number of software packages have been developed to help construct and manipulate collections of normalized data tables. In R, well-known examples include dplyr (Wickham and Francois 2016), ggplot2 (Wickham 2009), magrittr (Bache and Wickham 2014), broom (Robinson 2017), janitor (Firke 2016), and tidyr (Wickham 2017). On the Python side, much of this functionality is included within the pandas (McKinney et al. 2010) and sklearn (Pedregosa et al. 2011) modules.

While cleaning messy data is often a time-consuming task, deciding on a specific normalized schema for representing a set of inputs is in most cases relatively straightforward. Outside of potentially removing outliers, missing data, and bad inputs, the process of tidying data is generally a lossless procedure. At a high-level, data tidying is often simply a reorganization of the raw inputs. However, if we are working with unstructured data such as collections of text, images, or sound, converting into a normalized tabular format is significantly more involved. The process of tidying in these cases becomes synonymous with featurization, whereby structured outputs are algorithmically extracted from a raw input. For example, from an audio music file we might extract features such as the overall length, beats per minute, and quantiles of the music’s loudness.

The featurization of raw text, known in natural language processing as text annotation, includes tasks such as tokenization (splitting text into words), part of speech tagging, and named entity recognition. Recent advancements in neural networks and heavy investment from both industry and academia have produced fast and highly accurate annotation libraries such as Stanford’s CoreNLP (Manning et al. 2014), spaCy (Honnibal and Johnson 2015), Apache’s OpenNLP (Baldridge 2005), and Google’s SyntaxNet (Petrov 2016). All of these, however, internally represent annotations using collections of complex, hierarchical, object-oriented classes. While these structures are ideal for annotation, they are not optimal for exploratory and predictive modeling.

In this paper, we present a method for uniting the cutting edge advancements in natural language processing with the popular normalized data paradigm. Specifically, we give a data schema representing the output of an NLP annotation pipeline as a collection of normalized tables. Alongside this specification, we present the R package cleanNLP that implements this specification over three distinct back ends. The package contains:

custom Java code, called by rJava (Urbanek 2016), that annotates raw text using the CoreNLP library;
a custom Python script, called by reticulate (Allaire et al. 2017), that annotates raw text using the spaCy library;
a simple, system dependency free, annotation engine using the package tokenizers (Mullen 2016).

The package cleanNLP also includes tools for converting from the normalized data model into (sparse) data matrices appropriate for exploratory and predictive modeling. Together, these contributions simplify the process of doing exploratory data analysis over a corpus of text.

There are several existing R packages that have some similar or complementary features to those in cleanNLP. The R package tidytext (Silge and Robinson 2016) also offers the ability to convert raw rext into a data frame. It is quite similar to the functionality of cleanNLP when using the tokenizers back end, with the addition of basic sentiment analysis and part of speech tagging for English through the use of word lists. With all annotations occurring at the token level, results are given as a single table rather than a normalized schema between many tables as in cleanNLP, which simplifies its application for new users. As such, tidytext works well for applications that do not need more advanced annotators such as named entities, dependencies, and coreferences. Given the overlap in general approaches, it should be relatively straightforward for users to transition from tidytext to cleanNLP when they find the need for these annotation tasks. There are two existing R packages that also call functions in the CoreNLP library. The package StanfordCoreNLP (Hornik 2016c), available only through the datacube website at Vienna University, integrates into the NLP framework. A similar, standalone approach is offered by coreNLP (Arnold and Tilton 2016). Both of these packages run the annotation pipeline over a corpus of text, call the java class edu.stanford.nlp.pipeline.XMLOutputter, and then parse the output using the XML package. This approach is not ideal as parsing the output XML file is computationally time-consuming. It is also error prone because there is no published format specifying the output of the XML.¹ There is also the package spacyr (Benoit and Matsuo 2017), which was published after cleanNLP, that offers another way of calling the spaCy library from R. Internally, spacyr works similarly to the spaCy back end in cleanNLP by calling the Python library and extracting information into R data types. However, spacyr returns results as a single denormalized data frame and (perhaps in part as a result of having no easy way of storing them in the one-table output) does not support the word embeddings feature of the spaCy library.

The package has been designed to integrate into workflows that utilize the many other packages for text processing available in R, such as those found in the CRAN Taskview NaturalLanguageProcessing. For example, users may use the framework provided by tm (Feinerer et al. 2008) to manage external corpora or the classes within NLP (Hornik 2016a) to run alternative parsers that can be converted into a tidy framework by way of the from_CoNLL function. The Apache OpenNLP annotation pipeline, available via openNLP (Hornik 2016b), for instance, provides several languages not yet supported by spaCy or the CoreNLP pipeline. Packages that focus on the analysis and modeling of text data can usually be used directly with the output from cleanNLP; these include lda (Chang 2015), lsa (Wild 2015), and topicmodels (Grün and Hornik 2011). Similarly, general-purpose database back-ends such as sqliter (Freitas 2014) can be used to store the tidy data tables; predictive modeling functions may be used to do predictive analytics over generated term-frequency matrices.

In the following section we illustrate the usage of the R package across all three back ends. Next, we give a detailed description and justification of our data model. Along the way, we give a high-level introduction to the ideas behind the underlying NLP annotators. We finish by illustrating a longer example of using the package to study a corpus of historical speeches made by Presidents of the United States.

2 Basic usage of cleanNLP

Before describing the data model for text annotations, it is useful to understand the basic workflow provided by the R package cleanNLP. We start by writing the opening lines of Douglas Adams’ Life, the Universe and Everything to a temporary file.

> txt <- c("The regular early morning yell of horror was the sound of Authur",
+          "Dent waking up and suddenly remembering where he was. It wasn't",
+          "just that the cave was cold, it wasn't just that it was damp and",
+          "smelly. It was the fact that the cave was in the middle of",
+          "Islington and there wasn't a bus due for two million years.")
> writeLines(txt, tf <- tempfile())

The package cleanNLP can be installed directly from CRAN, with binaries available for all major operating systems. In order to annotate raw text, an NLP back end must first be initalized. Once this is done, annotation is done by calling the function annotate with a vector of path(s) to the input documents. We start with an example using the tokenizers back end.

> library(cleanNLP)
> init_tokenizers()
> anno <- run_annotators(tf)

The result of the annotation is a named list of six data frames and one matrix. We can see the elements of the object by printing out their names.

> names(anno)
[1] "coreference" "dependency"  "document"    "entity"      "sentence"
[6] "token"       "vector"

The individual tables can be referenced with the generic R accessor functions (such as ‘[[‘), however the preferred method is to call the relevant cleanNLP functions of the form get_TABLENAME(). For example, the tokens table for this example can be accessed with the get_token function.

> get_token(anno)
# A tibble: 61 x 8
      id   sid   tid    word lemma  upos   pos   cid
   <int> <int> <int>   <chr> <chr> <chr> <chr> <int>
1      1     0     1     The  <NA>  <NA>  <NA>    NA
2      1     0     2 regular  <NA>  <NA>  <NA>    NA
3      1     0     3   early  <NA>  <NA>  <NA>    NA
4      1     0     4 morning  <NA>  <NA>  <NA>    NA
5      1     0     5    yell  <NA>  <NA>  <NA>    NA
6      1     0     6      of  <NA>  <NA>  <NA>    NA
7      1     0     7  horror  <NA>  <NA>  <NA>    NA
8      1     0     8     was  <NA>  <NA>  <NA>    NA
9      1     0     9     the  <NA>  <NA>  <NA>    NA
10     1     0    10   sound  <NA>  <NA>  <NA>    NA
# ... with 51 more rows

The get functions are preferable because they provide useful options for modifying the output before returning it. Notice that the annotation process here has split out each word in the input into its own row. There are also several columns of ids and columns filled with missing values. The specific schema of the tables will be the focus of discussion in the following section.

The tokenizers back end requires no external dependencies, however it does not support any of the advanced annotation tasks that illustrate the utility of the cleanNLP package. This explains why most of the columns in the example are missing. It is included primarily for testing and demonstration purposes in cases where the other back ends cannot be installed. The spaCy back end uses the Python library by the same name for the purpose of extracting text annotations. Users must install Python and the library externally (detailed instructions are provided in the package documentation). Once installed, the only modification required by the R code is to adjust which init_ function is being called.

> init_spaCy()
> anno <- run_annotators(tf)
> get_token(anno)
# A tibble: 68 x 8
      id   sid   tid    word   lemma  upos   pos   cid
   <int> <int> <int>   <chr>   <chr> <chr> <chr> <int>
1      1     1     1     The     the   DET    DT     0
2      1     1     2 regular regular   ADJ    JJ     4
3      1     1     3   early   early   ADJ    JJ    12
4      1     1     4 morning morning  NOUN    NN    18
5      1     1     5    yell    yell  NOUN    NN    26
6      1     1     6      of      of   ADP    IN    31
7      1     1     7  horror  horror  NOUN    NN    34
8      1     1     8     was      be  VERB   VBD    41
9      1     1     9     the     the   DET    DT    45
10     1     1    10   sound   sound  NOUN    NN    49
# ... with 58 more rows

The output is in the exact some format but now all of the token columns are filled in with useful information such as the lemmatized form of each word and part of speech codes. Similar details are also filled into the other fields.

The third and final back end currently available uses the Java library coreNLP. Users must install Java version 1.8 or higher and link it to R using the rJava. The coreNLP models, which are over 1 GB, can then be either manually downloaded or grabbed using the helper function download_coreNLP(). Once installed, the back end works just as with the other back ends.

> init_coreNLP()
> anno <- run_annotators(tf)
> get_token(anno)
# A tibble: 68 x 8
      id   sid   tid    word   lemma  upos   pos   cid
   <int> <int> <int>   <chr>   <chr> <chr> <chr> <int>
1      1     1     1     The     the   DET    DT     0
2      1     1     2 regular regular   ADJ    JJ     4
3      1     1     3   early   early   ADJ    JJ    12
4      1     1     4 morning morning  NOUN    NN    18
5      1     1     5    yell    yell  VERB    VB    26
6      1     1     6      of      of   ADP    IN    31
7      1     1     7  horror  horror  NOUN    NN    34
8      1     1     8     was      be  VERB   VBD    41
9      1     1     9     the     the   DET    DT    45
10     1     1    10   sound   sound  NOUN    NN    49
# ... with 58 more rows

The token output here is similar, but not exactly the same, as that produced by the spaCy annotation engine. The only distinction in the first ten rows is whether the word yell is categorized as a noun (spaCy) or a verb (coreNLP). While yell can be either part of speech, in context the spaCy interpretation is correct.

As seen in the code-snippets here, the philosophy behind the design of the cleanNLP package is to make it as easy as possible to get raw text turned into data frames. All of the functions introduced here have optional parameters that change the way the back ends are run or how the annotations are returned. This includes which annotators to run and selecting the desired language model to use. Complete documentation is available within the R help pages.

Table 1: Tables in the data model and their (composite) primary and foreign keys. All keys are given by non-negative integers. Namely, `id` indexes the documents, `sid` the sentences within a document, and `tid` the tokens within a sentence. The `cid` gives character offsets into the raw input text. Keys `rid` and `mid` are specifically constructed by the coreference annotator.
table name	record	primary key	foreign keys
document	document	id	\(\cdot\)
token	word / punctuation	id, sid, tid	cid
dependencies	token pairs	id, sid, tid, tid_target	\(\cdot\)
entity	set of tokens	id, sid, tid, tid_end	\(\cdot\)
coreference	mentions	id, rid, mid	sid, tid, tid_end, tid_head
sentence	sentence	id, sid	\(\cdot\)
vector	word embedding	id, sid, tid	\(\cdot\)

3 A data model for the NLP pipeline

An annotation object is simply a named list with each item containing a data frame. These frames should be thought of as tables living inside of a single database, with keys linking each table to one another. All tables are in the second normal form of Codd (1990). For the most part they also satisfy the third normal form, or, equivalently, the formal tidy data model of Wickham (2014). The limited departures from this more stringent requirement are justified below wherever they exist. In every case the cause is a transitive dependency that would require a complex range join to reconstruct.

Several standards have previously been proposed for representing textual annotations. These include the linguistic Annotation Framework (Ide and Romary 2001), NLP Interchange Format (Hellmann et al. 2012), and CoNLL-X (Buchholz and Marsi 2006). The function from_CoNLL is included as a helper function in cleanNLP to convert from CoNLL formats into the cleanNLP data model. All of these, however, are concerned with representing annotations for interoperability between systems. Our goal is instead to create a data model well-suited to direct analysis, and therefore requires a new approach.

In this section each table is presented and justifications for its existence and form are given. Individual tables may be pulled out with access functions of the form get_*. Example tables are pulled from the dataset obama, which is included with the cleanNLP package. This gives the annotation object obtained from the text of the annual speeches Barack Obama made to Congress. These annual addresses, known as The State of the Union, are mandated by the US Constitution and have been given by every president since George Washington.

3.1 Documents

Table 2: Schema for the document table. The id field serves as a primary key, and other meta data fields may be appended that give domain-specific information about each document.
	`get_document()`
`id`	integer. Id of the source document.
`time`	date time. The time at which the parser was run on the text.
`version`	character. Version of the NLP library used to parse the text.
`language`	character. Language of the text, in ISO 639-1 format.
`uri`	character. Description of the raw text location.

The documents table contains one row per document in the annotation object. What exactly constitutes a document is up to the user. It might include something as granular as a paragraph or as coarse as an entire novel. For many applications, particularly stylometry, it may be useful to simultaneously work with several hierarchical levels: sections, chapters, and an entire body of work. The solution in these cases is to define a document as the smallest unit of measurement, denoting the higher-level structures as metadata. For example, when working with a corpus of texts where each book is broken into chapters, we would make each document an individual chapter. A metadata field would be assigned to each chapter indicating which book it is a part of.

The primary key for the document table is a document id, stored as an integer index. By design, there should be no extrinsic meaning placed on this key. Other tables use it to map to one another and to the document table, but any metadata about the document is contained only in the document table rather than being forced into the document key. In other words, the temptation to use keys such as “Obama2016” is avoided because, while these look nice, trying to make use of them to extract document-level metadata is error prone and ultimately more verbose than making use of a join with the document table.

The minimal fields required by the document table are given in Table 2. These are all filled in automatically by the annotation function. Any number of additional corpora-specific metadata, such as the aforementioned section and chapter designations, may be attached as well by giving it as an option to the meta parameter of run_annotators. The document table for the example corpus is:

> get_document(obama)
# A tibble: 8 x 5
     id                time version language      uri
  <int>              <dttm>   <chr>    <chr>    <chr>
1     1 2017-05-21 09:27:55   1.8.2       en 2009.txt
2     2 2017-05-21 09:28:00   1.8.2       en 2010.txt
3     3 2017-05-21 09:28:05   1.8.2       en 2011.txt
4     4 2017-05-21 09:28:10   1.8.2       en 2012.txt
5     5 2017-05-21 09:28:14   1.8.2       en 2013.txt
6     6 2017-05-21 09:28:18   1.8.2       en 2014.txt
7     7 2017-05-21 09:28:22   1.8.2       en 2015.txt
8     8 2017-05-21 09:28:26   1.8.2       en 2016.txt

It may seem that common fields such as year and author should be added to the formal specification but the perceived advantage is minimal. It would still be necessary for users to manually add the content of these fields at some point as any other metadata is not unambiguously extractable from the raw text.

3.2 Tokens

Table 3: Schema for the token table. The fields id, sid, and tid serve as a composite key for each token. A row also exist for the root of each sentence.
	`get_token()`
`id`	integer. Id of the source document.
`sid`	integer. Sentence id, starting from 0.
`tid`	integer. Token id, with the root of the sentence starting at 0.
`word`	character. Raw word in the input text.
`lemma`	character. Lemmatized form the token.
`upos`	character. Universal part of speech code.
`pos`	character. Language-specific part of speech code; uses the Penn Treebank codes.
`cid`	integer. Character offset at the start of the word in the original document.

The token table contains one row for each unique token, usually a word or punctuation mark, in any document in the corpus. Any annotator that produces an output for each token has its results displayed here. These include the lemmatizer and the part of the speech tagger (Toutanova and Manning 2000). Table 3 shows the required columns contained in the token table. Given the annotators selected during the pipeline initialization, some of these columns may contain only missing data. A composite key exists by taking together the document id, sentence id, and token id. There is also a foreign key, cid, giving the character offset back into the original source document. An example of the table looks like this:

> get_token(obama, include_root = TRUE)
# A tibble: 65,758 x 8
      id   sid   tid      word     lemma  upos   pos   cid
   <int> <int> <int>     <chr>     <chr> <chr> <chr> <int>
1      1     1     0      ROOT      ROOT  <NA>  <NA>    NA
2      1     1     1     Madam     madam PROPN   NNP     0
3      1     1     2   Speaker   speaker PROPN   NNP     6
4      1     1     3         ,         , PUNCT     ,    13
5      1     1     4       Mr.       mr. PROPN   NNP    15
6      1     1     5      Vice      vice PROPN   NNP    19
7      1     1     6 President president PROPN   NNP    24
8      1     1     7         ,         , PUNCT     ,    33
9      1     1     8   Members   members PROPN  NNPS    35
10     1     1     9        of        of   ADP    IN    43
# ... with 65,748 more rows

A phantom token “ROOT” is included at the start of each sentence (it always has tid equal to 0) if the option include_root is set to TRUE (it is FALSE by default). This is useful so that joins from the dependency table, which contains references to the sentence root, into the token table have no missing values.

The field upos contains the universal part of speech code, a language-agnostic classification, for the token. It could be argued that in order to maintain database normalization one should simply look up the universal part of speech code by finding the language code in the document table and joining a table mapping the Penn Treebank codes to the universal codes. This has not been done for several reasons. First, universal parts of speech are very useful for exploratory data analysis as they contain tags much more familiar to non-specialists such as “NOUN” (noun) and “CONJ” (conjunction). Asking users to apply a three table join just to access them seems overly cumbersome. Secondly, it is possible for users to use other parsers or annotation engines. These may not include granular part of speech codes and it would be difficult to figure out how to represent these if there were not a dedicated universal part of speech field.

3.3 Dependencies

Table 4: Schema for the dependency table. The final four variables are only provided when the option `get_token` is set to `TRUE`. The first five fields together create a composite key for the table.
	`get_dependency()`
`id`	integer. Id of the source document.
`sid`	integer. Sentence id of the source token.
`tid`	integer. Id of the source token.
`sid_target`	integer. Sentence id of the target token.
`tid_target`	integer. Id of the target token.
`relation`	character. Language-agnostic universal dependency type.
`relation_full`	character. Language specific universal dependency type.
`word`	character. The source word in the raw text.
`lemma`	character. Lemmatized form of the source word.
`word_target`	character. The target word in the raw text.
`lemma_target`	character. Lemmatized form of the target word.

Dependencies give the grammatical relationship between pairs of tokens within a sentence (Rafferty and Manning 2008; Green et al. 2011). As they are at the level of token pairs, they must be represented as a new table. All included fields are described in Table 4. Only one dependency should exist for any pair of tokens; the document id, sentence id, and source and target token ids together serve as a composite key. As dependencies exist only within a sentence, the sentence id does not need to be defined separately for the source and target. Dependencies take significantly longer to calculate than the lemmatization and part of speech tagging tasks.

The get_dependency function has an option (set to FALSE by default) to auto join the dependency to the target and source words and lemmas from the token table. This is a common task and involves non-trivial calls to the left_join function making it worthwhile to include as an option. For example, the following code replicates the behavior of get_dependency when set to return words and lemmas:

dep <- get_dependency(obama) %>%
  left_join(select(get_token(obama, include_root = TRUE),
                   id, sid, tid, word, lemma),
            by = c("id", "sid", "tid")) %>%
  left_join(select(get_token(obama, include_root = TRUE),
                   id, sid, tid_target = tid,
                   word_target = word, lemma_target = lemma),
            by = c("id", "sid", "tid_target"))

The output, equivalently using a call to get_dependency, is given by:

> get_dependency(obama, get_token = TRUE)
# A tibble: 62,781 x 10
      id   sid   tid tid_target relation relation_full      word     lemma
   <int> <int> <int>      <int>    <chr>         <chr>     <chr>     <chr>
1      1     1     2          1 compound          <NA>   Speaker   speaker
2      1     1     0          2     ROOT          <NA>      ROOT      ROOT
3      1     1     2          3    punct          <NA>   Speaker   speaker
4      1     1     6          4 compound          <NA> President president
5      1     1     6          5 compound          <NA> President president
6      1     1     2          6    appos          <NA>   Speaker   speaker
7      1     1     6          7    punct          <NA> President president
8      1     1     6          8    appos          <NA> President president
9      1     1     8          9     prep          <NA>   Members   members
10     1     1     9         10     pobj          <NA>        of        of
   word_target lemma_target
         <chr>        <chr>
1        Madam        madam
2      Speaker      speaker
3            ,            ,
4          Mr.          mr.
5         Vice         vice
6    President    president
7            ,            ,
8      Members      members
9           of           of
10    Congress     congress
# ... with 62,771 more rows

The word “ROOT” shows up in the first row, which would have been NA had sentence roots not been explicitly included in the token table.

Our parser produces universal dependencies (De Marneffe et al. 2014), which have a language-agnostic set of relationship types with language-specific subsets pertaining to specific grammatical relationships with a particular language. For the same reasons that both the part of speech codes and universal part of speech codes are included, each of these relationship types have been added to the dependency table.

3.4 Named entities

Table 5: Schema for the entity table. The first three fields serve as a composite key.
	`get_entity()`
`id`	integer. Id of the source document.
`sid`	integer. Sentence id of the entity mention.
`tid`	integer. Token id at the start of the entity mention.
`tid_end`	integer. Token id at the end of the entity mention.
`entity_type`	character. Type of entity.
`entity`	character. Raw words of the named entity in the text.
`entity_normalized`	character. Normalized version of the entity.

Named entity recognition is the task of finding entities that can be defined by proper names, categorizing them, and standardizing their formats (Finkel et al. 2005). The XML output of the Stanford CoreNLP pipeline places named entity information directly into their version of the token table. Doing this repeats information over every token in an entity and gives no canonical way of extracting the entirety of a single entity mention. We instead have a separate entity table, as is demanded by the normalized database structure, and record each entity mention in its own row. The full set of fields are given in Table 5, with the combination of document id, sentence id, and token id serving as a composite key.

An example of the named entity table is given by:

> get_entity(obama)
# A tibble: 3,035 x 6
      id   sid   tid tid_end entity_type              entity
   <int> <int> <int>   <int>       <chr>               <chr>
1      1     1     1       2      PERSON       Madam Speaker
2      1     1     8      10         ORG Members of Congress
3      1     1    12      14         ORG      the First Lady
4      1     1    16      18         GPE   the United States
5      1     1    30      30        TIME             tonight
6      1     1    43      44       EVENT            Chamber,
7      1     2     6       6        NORP           Americans
8      1     4    24      25        DATE           every day
9      1     8    23      23        TIME             tonight
10     1     8    27      27        NORP            American
# ... with 3,025 more rows

The categories available in the field entity_type are dependent on the specific back end used. When using the coreNLP back end, the entities ‘MONEY’, ‘ORDINAL’ ‘PERCENT’, ‘DATE’ and ‘TIME’ also have a normalized form. Entities for the spaCy backend offer more granular distinctions, with a full list contained in the help page for the function get_entity. As with the coreference table, a complete representation of the entity is given as a character string due to the difficulty in reconstructing this after the fact from the token table, so the character string has been included as an explicit field.

3.5 Coreference

Table 6: Schema for the coreference table. Each row is best thought of as a coreference mention, rather than the coreference itself.
	`get_coreference()`
`id`	integer. Id of the source document.
`rid`	integer. Relation ID.
`mid`	integer. Mention ID; unique to each coreference within a document.
`mention`	character. The mention as raw words from the text.
`mention_type`	character. One of "LIST", "NOMINAL", "PRONOMINAL", or "PROPER".
`number`	character. One of "PLURAL", "SINGULAR", or "UNKNOWN".
`gender`	character. One of "FEMALE", "MALE", "NEUTRAL", or "UNKNOWN".
`animacy`	character. One of "ANIMATE", "INANIMATE", or "UNKNOWN".
`sid`	integer. Sentence id of the coreference.
`tid`	integer. Token id at the start of the coreference.
`tid_end`	integer. Token id at the start of the coreference.
`tid_head`	integer. Token id of the head of the coreference.

Coreferences link sets of tokens that refer to the same underlying person, object, or idea (Raghunathan et al. 2010; Lee et al. 2011, 2013; Recasens et al. 2013). One common example is the linking of a noun in one sentence to a pronoun in the next sentence. The coreference table describes these relationships but is not strictly a table of coreferences. Instead, each row represents a single mention of an expression and gives a reference id indicating all of the other mentions that it also coreferences. Table 6 gives the entire schema of the coreference table. The document, reference, and mention ids serve as a composite key for the table. Links back into the token table for the start, end and head of the mention are given as well; these are pushed to the right of the table as they should be considered foreign keys within this table.

An example helps to explain exactly what the coreference table represents:

> get_coreference(obama)
# A tibble: 6,982 x 12
      id   rid   mid                      mention mention_type   number  gender
   <int> <int> <int>                        <chr>        <chr>    <chr>   <chr>
1      1  2049     7            the United States       PROPER SINGULAR NEUTRAL
2      1  2049    77 the United States of America       PROPER SINGULAR NEUTRAL
3      1  2049   102                      America       PROPER SINGULAR NEUTRAL
4      1  2049   315                      America       PROPER SINGULAR NEUTRAL
5      1  2049   742                   America 's       PROPER SINGULAR NEUTRAL
6      1  2049   782                      America       PROPER SINGULAR NEUTRAL
7      1  2049   939                      America       PROPER SINGULAR NEUTRAL
8      1  2049   991                      America       PROPER SINGULAR NEUTRAL
9      1  2049  1003                      America       PROPER SINGULAR NEUTRAL
10     1  2049  1045                      America       PROPER SINGULAR NEUTRAL
     animacy   sid   tid tid_end tid_head
       <chr> <dbl> <int>   <int>    <int>
1  INANIMATE     1    16      18       18
2  INANIMATE     8    41      45       43
3  INANIMATE    12     6       6        6
4  INANIMATE    40    12      12       12
5  INANIMATE   103     8       9        8
6  INANIMATE   109     8       8        8
7  INANIMATE   132     5       5        5
8  INANIMATE   138    27      27       27
9  INANIMATE   140    41      41       41
10 INANIMATE   147     4       4        4
# ... with 6,972 more rows

Here, these are all mentions of the same underlying entity: The United States of America. There is a special relationship between the reference id rid and the mention id mid. The coreference annotator selects a specific mention for each reference that gets treated as the canonical mention for the entire class. The mention id for this mention becomes the reference id for the class. This relationship provides a way of identifying the canonical mention within a reference class and a way of treating the coreference table as pairs of mentions rather than individual mentions joined by a given key.

The text of the mention itself is included within the table. This was done because as the mention may span several tokens it would otherwise be very difficult to extract this information from the token table. It is also possible, though not supported in the current CoreNLP pipeline, that a mention could consist of a set of non-contiguous tokens, making this field impossible to otherwise reconstruct.

3.6 Sentence level annoations

Table 7: Schema for the setence table. The document and sentence ids serve as a composite key.
	`get_sentence()`
`id`	integer. Id of the source document.
`sid`	integer. Sentence id.
`sentiment`	integer. Predicted sentiment; 0 (very negative) to 4 (very positive).

The sentiment tagger provided by the CoreNLP pipeline predicts whether a sentence is very negative (0), negative (1), neutral (2), positive (3), or very positive (4) (Socher et al. 2013). There is no native sentiment model currently supported by spaCy. The sentiment output is placed in a separate table because it returns information exclusively at the sentence level, unlike any of the other parsers. The schema, described in Table 7, has the document and sentence ids serving as composite keys, with the only other field being an integer sentiment code. An example of the output can be seen in:

> get_sentence(obama)
# A tibble: 2,988 x 3
      id   sid sentiment
   <int> <dbl>     <int>
1      1     1         1
2      1     2         3
3      1     3         1
4      1     4         1
5      1     5         2
6      1     6         1
7      1     7         3
8      1     8         1
9      1     9         1
10     1    10         1
# ... with 2,978 more rows

The underlying sentiment model is a neural network. While at the moment few annotators exist at the sentence level, there is currently active research in modeling features that would eventually fit well into this table such as indicators of mood (Gaikwad and Joshi 2016), levels of sarcasm (Schifanella et al. 2016) or a characterization of the sentence’s “style” (Kabbara and Cheung 2016).

3.7 Word vectors

Our final table in the data model stores the relatively new concept of a word vector. Also known as word embeddings, these vectors are deterministic maps from the set of all available words into a high-dimensional, real valued vector space. Words with similar meanings or themes will tend to be clustered together in this high-dimensional space. For example, we would expect apple and pear to be very close to one another, with vegetables such as carrots, broccoli, and asparagus only slightly farther away. The embeddings can often be used as input features when building models on top of textual data. For a more detailed description of these embeddings, see the papers on either of the most well-known examples: GloVe (Pennington et al. 2014) and word2vec (Mikolov et al. 2013). Only the spaCy back end to cleanNLP currently supports word vectors; these are turned off by default because they take a significantly large amount of space to store. The embedding model uses the fasttext embeddings (Bojanowski et al. 2016), a modification of the GloVe embeddings, which map words into a 300-dimensional space. To compute the embeddings, set the vector_flag parameter of init_spaCy to TRUE prior to running the annotation.

Word vectors are stored in a separate table from the tokens table out of convenience rather than as a necessity of preserving the data model’s normalized schema. Due to its size and the fact that the individual components of the word embedding have no intrinsic meaning, this table is stored as a matrix. We can see that there is exactly one row in the word embeddings for every non-ROOT token in the token table (note that the word embeddings for the obama dataset are not included with the package as they are too large to be uploaded to CRAN).

> dim(get_token(obama))
[1] 62781     8
> dim(get_vector(obama))
[1] 62781   303

The first three columns hold the keys id, sid, and tid, respectively. If no embedding is computed, the function get_vector returns an empty matrix.

4 Using cleanNLP to study State of the Union addresses

The President of the United States is constitutionally obligated to provide a report known as the State of the Union. The report summarizes the current challenges facing the country and the president’s upcoming legislative agenda. While historically the State of the Union was often a written document, in recent decades it has always taken the form of an oral address to a joint session of the United States Congress. In this final section the utility of the package is illustrated by showing how it can be used to study a corpus consisting of every such address made by a United States president through 2016 (Peters 2016). It highlights some of the major benefits of the tidy data model as it applies to the study of textual data, though by no means attempts to give an exhaustive coverage of all the available tables and approaches. The examples make heavy use of the table verbs provided by dplyr, the piping notation of magrittr and ggplot2 graphics. These are used because they best illustrate the advantages of the tidy data model that has been built in cleanNLP for representing corpus annotations. Relevant functions are prepended with cleanNLP:: in the following analysis in order to be clear which functions are supplied by the cleanNLP package.

4.1 Loading and parsing the data

The full text of all the State of the Union addresses through 2016 are available in the R package sotu (Arnold 2017), available on CRAN. The package also contains meta-data concerning each speech that we will add to the document table while annotating the corpus. The code to run this annotation is given by:

> library(sotu)
> library(cleanNLP)
>
> data(sotu_text)
> data(sotu_meta)
> init_spaCy()
> sotu <- cleanNLP::run_annotators(sotu_text, as_strings = TRUE,
+                                  meta = sotu_meta)

The annotation object, which we will use in the example in the following analysis, is stored in the object sotu.

4.2 Exploratory analysis

Simple summary statistics are easily computed off of the token table. To see the distribution of sentence length, the token table is grouped by the document and sentence id and the number of rows within each group are computed. The percentiles of these counts give a quick summary of the distribution.

> library(ggplot2)
> library(dplyr)
> cleanNLP::get_token(sotu) %>%
+   count(id, sid) %$%
+   quantile(n, seq(0,1,0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
   1   11   16   19   23   27   31   37   44   58  681

The median sentence has 28 tokens, whereas at least one has over 600 (this is due to a bulleted list in one of the written addresses being treated as a single sentence) To see the most frequently used nouns in the dataset, the token table is filtered on the universal part of speech field, grouped by lemma, and the number of rows in each group are once again calculated. Sorting the output and selecting the top \(42\) nouns, yields a high level summary of the topics of interest within this corpus.

> cleanNLP::get_token(sotu) %>%
+   filter(upos == "NOUN") %>%
+   count(lemma) %>%
+   top_n(n = 42, n) %>%
+   arrange(desc(n)) %>%
+   use_series(lemma)
 [1] "year"          "country"       "people"        "government"
 [5] "law"           "time"          "nation"        "who"
 [9] "power"         "interest"      "world"         "war"
[13] "citizen"       "service"       "duty"          "part"
[17] "system"        "peace"         "right"         "man"
[21] "program"       "policy"        "work"          "act"
[25] "state"         "condition"     "subject"       "legislation"
[29] "force"         "effort"        "treaty"        "purpose"
[33] "what"          "land"          "business"      "action"
[37] "measure"       "tax"           "way"           "question"
[41] "relation"      "consideration"

The result is generally as would be expected from a corpus of government speeches, with references to proper nouns representing various organizations within the government and non-proper nouns indicating general topics of interest such as “tax”, “law”, and “peace”.

graphic without alt text — Figure 1: Length of each State of the Union address, in total number of tokens. Color shows whether the address was given as a speech or delivered as a written document.

The length in tokens of each address is calculated similarly by grouping and summarizing at the document id level. The results can be joined with the document table to get the year of the speech and then piped in a ggplot2 command to illustrate how the length of the State of the Union has changed over time.

> cleanNLP::get_token(sotu) %>%
+   count(id) %>%
+   left_join(cleanNLP::get_document(sotu)) %>%
+   ggplot(aes(year, n)) +
+     geom_line(color = grey(0.8)) +
+     geom_point(aes(color = sotu_type)) +
+     geom_smooth()

Here, color is used to represent whether the address was given as an oral address or a written document. The output in Figure 1 shows that their are certainly time trends to the address length, with the form of the address (written versus spoken) also having a large effect on document length.

Finding the most used entities from the entity table over the time period of the corpus yields an alternative way to see the underlying topics. A slightly modified version of the code snippet used to find the top nouns in the dataset can be used to find the top entities. The get_token function is replaced by get_entity and the table is filtered on entity_type rather than the universal part of speech code.

> cleanNLP::get_entity(sotu) %>%
+   filter(entity_type == "GPE") %>%
+   count(entity) %>%
+   top_n(n = 26, n) %>%
+   arrange(desc(n)) %>%
+   use_series(entity)
 [1] "the United States"        "America"
 [3] "States"                   "Mexico"
 [5] "Great Britain"            "Spain"
 [7] "Washington"               "China"
 [9] "Executive"                "France"
[11] "Cuba"                     "Japan"
[13] "Texas"                    "Russia"
[15] "The United States"        "Germany"
[17] "United States"            "California"
[19] "Nicaragua"                "the Soviet Union"
[21] "Mississippi"              "Iraq"
[23] "Alaska"                   "U.S."
[25] "Philippines"              "Panama"
[27] "the District of Columbia"

The ability to redo analyses from a slightly different perspective is a direct consequence of the tidy data model supplied by cleanNLP. The top locations include some obvious and some less obvious instances. Those sovereign nations included such as Great Britain, Mexico, Germany, and Japan seem as expected given either the United State’s close ties or periods of war with them. The top states include the most populous regions (New York, California, and Texas) but also smaller states (Kansas, Oregon, Mississippi), the latter being more surprising.

One of the most straightforward way of extracting a high-level summary of the content of a speech is to extract all direct object object dependencies where the target noun is not a very common word. In order to do this for a particular speech, the dependency table is joined to the document table, a particular document is selected, and relationships of type “dobj” (direct object) are filtered out. The result is then joined to the data set word_frequency, which is included with cleanNLP, and pairs with a target occurring less than 0.5% of the time are selected to give the final result. Here is an example of this using the first address made by George W. Bush in 2001:

> cleanNLP::get_dependency(sotu, get_token = TRUE) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year == 2001, relation == "dobj") %>%
+   select(id = id, start = word, word = lemma_target) %>%
+   left_join(word_frequency) %>%
+   filter(frequency < 0.001) %>%
+   select(id, start, word) %$%
+   sprintf("%s => %s", start, word)
Joining, by = "id"
Joining, by = "word"
 [1] "take => oath"                  "using => statistic"
 [3] "increasing => layoff"          "protects => trillion"
 [5] "makes => welcoming"            "accelerating => cleanup"
 [7] "fight => homelessness"         "helping => neighbor"
 [9] "allowing => taxpayer"          "provide => mentor"
[11] "fight => illiteracy"           "promotes => compassion"
[13] "asked => ashcroft"             "end => profiling"
[15] "pay => trillion"               "throw => dart"
[17] "restores => fairness"          "promoting => internationalism"
[19] "makes => downpayment"          "discard => relic"
[21] "confronting => shortage"       "directed => cheney"
[23] "sound => footing"              "divided => conscience"
[25] "done => servant"

Most of these phrases correspond with the “compassionate conservatism" that George W. Bush ran under in the preceding 2000 election. Applying the same analysis to the 2002 State of the Union, which came under the shadow of the September 11th terrorist attacks, shows a drastic shift in focus.

> cleanNLP::get_dependency(sotu, get_token = TRUE) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year == 2002, relation == "dobj") %>%
+   select(id = id, start = word, word = lemma_target) %>%
+   left_join(word_frequency) %>%
+   filter(frequency < 0.0005) %>%
+   select(id, start, word) %$%
+   sprintf("%s => %s", start, word)
Joining, by = "id"
Joining, by = "word"
 [1] "urged => follower"        "called => troop"
 [3] "brought => sorrow"        "owe => micheal"
 [5] "ticking => timebomb"      "have => troop"
 [7] "hold => hostage"          "eliminate => parasite"
 [9] "flaunt => hostility"      "develop => anthrax"
[11] "put => troop"             "increased => vigilance"
[13] "fight => anthrax"         "thank => attendant"
[15] "defeat => recession"      "want => paycheck"
[17] "set => posturing"         "enact => safeguard"
[19] "embracing => ethic"       "owns => aspiration"
[21] "containing => resentment" "erasing => rivalry"
[23] "embrace => tyranny"

Here the topics have almost entirely shifted to counter-terrorism and national security efforts.

4.3 Models

The get_tfidf function provided by cleanNLP converts a token table into a sparse matrix representing the term-frequency inverse document frequency matrix (or any intermediate part of that calculation). This is particularly useful when building models from a textual corpus. The tidy_pca, also included with the package, takes a matrix and returns a data frame containing the desired number of principal components. Dimension reduction involves piping the token table for a corpus into the get_tfidif function and passing the results to tidy_pca.

> pca <- cleanNLP::get_token(sotu) %>%
+   filter(pos %in% c("NN", "NNS")) %>%
+   cleanNLP::get_tfidf(min_df = 0.05, max_df = 0.95,
+                       type = "tfidf", tf_weight = "dnorm") %$%
+   cleanNLP::tidy_pca(tfidf, get_document(sotu))

In this example only non-proper nouns have been included in order to minimize the stylistic attributes of the speeches in order to focus more on their content. A scatter plot of the speeches using these components is shown in Figure 2. There is a definitive temporal pattern to the documents, with the 20th century addresses forming a distinct cluster on the right side of the plot.

Topic models are a collection of statistical models for describing abstract themes within a textual corpus. Each theme is characterized by a collection of words that commonly co-occur; for example, the words ‘crop’, ‘dairy’, ‘tractor’, and ‘hectare’, might define a farming theme. One of the most popular topic models is latent Dirichlet allocation (LDA), a Bayesian model where each topic is described by a probability distribution over a vocabulary of words. Each document is then characterized by a probability distribution over the available topics. For a formal description, see (Blei et al. 2003) and (Pritchard et al. 2000), the original papers outlining LDA. To fit LDA on a corpus of text parsed by the cleanNLP package, the output of get_tfidf can be piped directly to the LDA function in the package topicmodels. The topic model function requires raw counts, so the type variable in get_tfidf is set to “tf”.

> library(topicmodels)
> tm <- cleanNLP::get_token(sotu) %>%
+   filter(pos %in% c("NN", "NNS")) %>%
+   cleanNLP::get_tfidf(min_df = 0.05, max_df = 0.95,
+                       type = "tf", tf_weight = "raw") %
+   LDA(tf, k = 16, control = list(verbose = 1))

The topics, ordered by approximate time period, are visualized in Figure 3. We describe each topic by giving the five most important words Most topics exist for a few decades and then largely disappear, though some persist over non-contiguous periods of the presidency. The “program, energy, effort, legislation, policy” topic, for example, appears during the 1950s and crops up again during the energy crisis of the 1970s. The “world, man, freedom, force, life” topic peaks during both World Wars, but is absent during the 1920s and early 1930s.

Finally, the cleanNLP data model is also convenient for building predictive models. The State of the Union corpus does not lend itself to an obviously applicable prediction problem. A classifier that distinguishes speeches made by George W. Bush and Barrack Obama will be constructed here for the purpose of illustration. As a first step, a term-frequency matrix is extracted using the same technique as was used with the topic modeling function. However, here the frequency is computed for each sentence in the corpus rather than the document as a whole. The ability to do this seamlessly with a single additional mutate function defining a new id illustrates the flexibility of the get_tfidf function.

> df <- get_token(sotu) %>%
+   left_join(get_document(sotu)) %>%
+   filter(year > 2000) %>%
+   mutate(new_id = paste(id, sid, sep = "-")) %>%
+   filter(pos %in% c("NN", "NNS"))
Joining, by = "id"
> mat <- get_tfidf(df, min_df = 0, max_df = 1, type = "tf",
+                  tf_weight = "raw", doc_var = "new_id")

It will be nessisary to define a response variable y indicating whether this is a speech made by President Obama as well as a training flag indicating which speeches were made in odd numbered years. This is done via a separate table join and a pair of mutations.

> meta <- data_frame(new_id = mat$id) %>%
+   left_join(df[!duplicated(df$new_id),]) %>%
+   mutate(y = as.numeric(president == "Barack Obama")) %>%
+   mutate(train = year %in% seq(2001,2016, by = 2))
Joining, by = "new_id"

The output may now be used as input to the elastic net function provided by the glmnet package. The response is set to the binomial family given the binary nature of the response and training is done on only those speeches occurring in odd-numbered years. Cross-validation is used in order to select the best value of the model’s tuning parameter.

> library(glmnet)
> model <- cv.glmnet(mat$tf[meta$train,], meta$y[meta$train], family = "binomial")

A boxplot of the predicted classes for each address is given in Figure 4. The algorithm does a very good job of separating the speeches. Looking at the odd years versus even years (the training and testing sets, respectively) indicates that the model has not been over-fit.

One benefit of the penalized linear regression model is that it is possible to interpret the coefficients in a meaningful way. Here are the non-zero elements of the regression vector, coded as whether the have a positive (more Obama) or negative (more Bush) sign:

> beta <- coef(model, s = model[["lambda"]][11])[-1]
> sprintf("%s (%d)", mat$vocab, sign(beta))[beta != 0]
 [1] "job (1)"          "business (1)"     "citizen (-1)"
 [4] "terrorist (-1)"   "government (-1)"  "freedom (-1)"
 [7] "home (1)"         "college (1)"      "weapon (-1)"
[10] "deficit (1)"      "company (1)"      "peace (-1)"
[13] "enemy (-1)"       "terror (-1)"      "income (-1)"
[16] "drug (-1)"        "kid (1)"          "regime (-1)"
[19] "class (1)"        "crisis (1)"       "industry (1)"
[22] "need (-1)"        "fact (1)"         "relief (-1)"
[25] "bank (1)"         "liberty (-1)"     "society (-1)"
[28] "duty (-1)"        "folk (1)"         "account (-1)"
[31] "compassion (-1)"  "environment (-1)" "inspector (-1)"

These generally seem as expected given the main policy topics of focus under each administration. During most of the Bush presidency, as mentioned before, the focus was on national security and foreign policy. Obama, on the other hand, inherited the recession of 2008 and was more focused on the overall economic policy.

5 Conclusions

In this paper a normalized data model for representing text annotations has been presented and rationalized. We have also demonstrated how the R package cleanNLP implements this data model using various, configurable back ends. Our focus has been to illustrate why this general approach and specific implementation is both powerful and easy to integrate into existing data pipelines. It is expected that some users will utilize the entirety of the underlying annotation pipelines, internal R structures, and helper functions. Others may use the package as a convenient wrapper around either the CoreNLP or spaCy libraries. In either extreme, or anywhere in between, our approach provides powerful tools for applying exploratory, graphical, and model-based techniques to textual data sources.

The cleanNLP package continues to be actively developed. In particular, we hope to include new sentence-level annotations as they are integrated into the spaCy and CoreNLP libraries. While major releases are available on CRAN, new features are added periodically on the development branch located at: https://github.com/statsmaths/cleanNLP. Bug reports, feature and collaboration requests can all be made using the GitHub issues page.

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2017-035.zip

This author, who is also the maintainer of coreNLP, has witnessed this first-hand by way of the persistent bug reports centering around the formatting of the XML output in strange edge cases or over new versions of the CoreNLP library. The coreNLP will still be maintained for users looking explicitly to access methods from the Stanford Library, whereas cleanNLP is being developed to provide a simpler interface that is consistent across various back ends.
↩︎