TableToLongForm automatically converts hierarchical Tables intended for a human reader into a simple LongForm dataframe that is machine readable, making it easier to access and use the data for analysis. It does this by recognising positional cues present in the hierarchical Table (which would normally be interpreted visually by the human brain) to decompose, then reconstruct the data into a LongForm dataframe. The article motivates the benefit of such a conversion with an example Table, followed by a short user manual, which includes a comparison between the simple one argument call to TableToLongForm
, with code for an equivalent manual conversion. The article then explores the types of Tables the package can convert by providing a gallery of all recognised patterns. It finishes with a discussion of available diagnostic methods and future work.
TableToLongForm
is an R package that automatically converts hierarchical Tables1
intended for a human reader into a LongForm R "data.frame"
that is
machine readable.2 While several tools exist that can aid in
manipulation of data, such as OpenRefine (OpenRefine 2013), which can be
used to clean messy data, the speedr R package (Visne et al. 2012), which can
aid in filtering data, and R packages like
reshape2 (Wickham 2007)
and plyr (Wickham 2011), which
enable restructuring of the data to focus on specific aspects of the
data, for these tools to work their magic you first need machine
readable data. However, data released as Tables are not machine
readable. At best, such tools will provide some aid in manually
converting the data to something that is machine readable, a process
that is costly in terms of both time and effort. TableToLongForm is an
automatic tool for converting a family of Tables to a machine readable
form, and once so converted the user is free to use their favourite
tool, R or otherwise, to make full use of the data.
The article motivates the benefit of such a conversion with an example
Table, followed by a short user manual, which includes a comparison
between the simple one argument call to TableToLongForm
, with code for
an equivalent manual conversion. The article then explores the types of
Tables the package can convert by providing a gallery of all recognised
patterns. It finishes with a discussion of available diagnostic methods
and future work.
There is still a prevalence of data releases being made for direct human consumption in formats that are not machine readable, a significant barrier to effective utilisation of the data. One symptom of this is the release of data in tabular form that relies on a hierarchy that can only be understood after identifying patterns and discerning the structure of the Table, a task easy for a human brain but rather difficult for a computer.
An example of such a Table is shown in 1. For such a
Table, the computer will be unable to easily read in the data due to the
difficulty in finding all information related to a piece of data. Take
the number 876
in cell (5, 9) for instance; to collect all the
information linked to that number we must look at cell (5, 1) for the
time period (2007Q4
), cell (4, 9) for the data heading
(Total Labour Force
), cell (3, 2) for the ethnic category
(European Only
) and cell (2, 2) for the gender category (Male
). Note
that, aside from the time period and the data heading, the other
information related to cell (5, 9) were neither in the same row nor the
same column. The human brain can interpret the positional cues to
understand the hierarchy fairly easily; the computer requires a lot more
work.
Preparing such data for use would normally require a non-trivial time investment to restructure the data in a manner that can be machine read and used. If such preparatory work was done manually, such work will have to be repeated multiple times as the data is updated. In some cases the data will be spread across multiple files, which means that much more preparatory work. Even if the work is scripted, small changes in the format can easily throw a wrench into the works and break it. All of this quickly adds up to a significant time cost to make use of data released in Tables.
LongForm is a simple alternative data format that most R users will find
familiar as an R "data.frame"
object, the format that most R functions
require of their data. TableToLongForm automatically converts Tables
to LongForm dataframes,3 which can mean significant savings in time
while enabling much greater utilisation of the data.
2 shows 1 after automatic
conversion using TableToLongForm, which took around a tenth of a
second on the author’s machine. In particular, note that the same data
we examined above, 876
now in cell (2, 11), has all related
information in the same row (except for the column heading, which is in
the same column), making it easy for the computer to understand and
manipulate.
TableToLongForm’s preferred argument is a "matrix"
of mode
"character"
. If a "data.frame"
is supplied instead, it is coerced to
a "matrix"
with a warning. Empty cells should be classed as "NA"
for
correct operation of the algorithms. Currently TableToLongForm does
not distinguish between missing values and empty space, both are treated
as "NA"
values.
As the Labour Force Status data used in Figure 1
classifies missing values as ..
, we must ensure R correctly reads
these, in addition to empty cells, as "NA"
values.4
= as.matrix(read.csv("StatsNZLabourForce.csv",
LabourForce header = FALSE, na.strings = c("", "..")))
If the Table can be recognised by TableToLongForm, a simple call to
TableToLongForm
with just a single argument is all that is needed.
TableToLongForm
has additional optional arguments used primarily for
diagnostic purposes, which are covered in the diagnostics section at the
end of the article.
= TableToLongForm(LabourForce) LabourForce.converted
For comparison the code for manual conversion of the table is provided below. We note after careful observation of the data that:
There are 3 gender categories: Male
, Female
and
Total Both Sexes
, each 80 columns in width.
There are 10 ethnic categories, each a consistent 8 columns in width.
The data are found in rows 5 to 26.
Armed with this knowledge, we can write the above code that, with some
trial and error and cross-checking of results, will successfully convert
the Table to a LongForm. This code is fairly compact and efficiency-wise
beats TableToLongForm, taking a little over a thousandth of a second
to make the conversion (compared to about a hundredth of a second for a
call to TableToLongForm
) on the author’s machine. However, it took a
non-trivial investment of time to code and test the results (it took the
author about 30 minutes), is mostly useless for any other Table, and if
any of the many strong assumptions it makes are violated (e.g. a new row
of data is added), it breaks and requires fixing, which means even more
time consumed. All this work and hassle to just read in the data in a
useful format.
= NULL
LFout = LabourForce[5:26, 1]
chYear for(Gender in 0:2)
for(Ethni in 0:9){
= LabourForce[2, 2 + Gender * 80]
chGender = LabourForce[3, 2 + Ethni * 8]
chEthni = rbind(LFout,
LFout cbind(chGender, chEthni, chYear,
5:26, 2 + Gender * 80 + (Ethni * 8):((Ethni + 1) * 8 - 1)])
LabourForce[
)
}colnames(LFout) = c("Gender", "Ethnicity", "Time.Period", LabourForce[4, 2:9])
For a successful conversion, TableToLongForm must first find the
Table, that is, it must Identify the rows and columns in which the
labels and data values can be found. This task can be surprisingly
difficult, requiring many fringe-case checks and exception handling. The
current core identification algorithm searches for blocks (rectangular
regions) of numbers in the supplied "matrix"
. This region is assumed
to contain the data and from it TableToLongForm infers the locations
of the corresponding labels. The result, after some extra work to handle
fringe-cases and the like, is the IdentResult
, a "list"
which
specifies the rows and columns in which the labels and the data can be
found.
If TableToLongForm fails to correctly Identify the correct rows and
columns, it is possible to manually specify the IdentResult
as an
argument. This is the case for the Table in 3, where one
of the row label columns is a Year column consisting only of numbers.
TableToLongForm’s numeric label detection algorithm is still quite
primitive and fails to correctly identify column 3 as a label, but by
manually specifying the IdentResult
, TableToLongForm can still
successfully convert the Table; the resulting "data.frame"
is shown in
4. Even for cases such as this where the
IdentResult
must be manually specified, the work required for the
conversion with TableToLongForm will be strictly less than for a
manual conversion as we would need the same information, and more, to
convert manually.
TableToLongForm(NEET, IdentResult = list(rows = list(label = 3:4, data = 5:46),
cols = list(label = 2:3, data = 4:24)))
TableToLongForm consists of a number of algorithms that can
collectively process a variety of so-called recognised patterns of
hierarchical structure (also called the parentage of the labels). Any
Table that consists of some combination of the recognised patterns can
be automatically converted with TableToLongForm. It is not strictly
necessary for a user to know what the patterns are, as they can simply
try calling TableToLongForm
on the Table to see if it converts. All
the recognised patterns are listed here for reference.5 For an
example of a real Table that demonstrates a combination of the
recognised patterns, refer to Real Example - NZQA located at the end
of this section.
For each pattern an example table is first shown using toy data, that displays the pattern, followed by a short description of the pattern, and ending with the example table converted with TableToLongForm.
Many of the recognised patterns apply only for row labels. Column labels are recognised by noticing that the transpose of column labels can often be processed as row labels, though there are several fringe cases that must be corrected for.
Above, we have an example of the Empty Below pattern, the most simple
type of parentage. Here the parent and children are in different
columns and we can see which of the children belong to which parent
through the use of empty space below each parent. The Table after
conversion to a LongForm follows.
Above, we have an example of the most basic form of the Empty Right
pattern. In this situation we have children in the same column as their
parent. We can still recognise these as children if the children have
children (Child-Child) in a different column, while the parent does
not (and hence the parent is Empty Right). Note the values pertaining to
the Parent (if any) are discarded. This is because they are assumed to
simply represent the sum of their children’s values. The Table after
conversion to a LongForm follows.
Above, we have an example of both Empty Below and Empty Right. Either
algorithm can handle this situation, but simply due to the ordering of
the algorithms such situations are handled as Empty Right. The Table
after conversion to a LongForm follows.
Above, we have an example of a complex version of the Empty Right
pattern. The “parent-child in the same column” situation has been
extended further and we now have parents (Super-Parent) who have
children (Parent), who each further have children (Child), all in
the same column. Such situations can still be recognised if the
lowest-level children in the column (Child) have children in a
different column (Child-Child), while its direct parents (Parent)
each have children in the same column (Child) but not in a different
column (is Empty Right), and the top-most parents (Super-Parents) also
have no children in a different column (is also Empty Right). The
algorithm cannot currently handle super-super-parents. The Table after
conversion to a LongForm follows.
Above, we have an example of Multi-row Column Labels. Often column
labels are physically split over multiple rows rather than making use of
line breaks in the same cell. In such occurrences, any row not
identified as a parent are collapsed into a single row of labels. The
Table after conversion to a LongForm follows.
Above, we have an example of Mismatched Column Labels. Sometimes the
column labels are in a different column to the data, usually due to a
misguided attempt at visual alignment of labels to the data. As long as
the correct rows and columns were identified for the data and the labels
(see 3 subsection on IdentResult
), and if there are
the same number of data columns as label columns, these mismatched
column labels will be paired with the data columns. The Table after
conversion to a LongForm follows.
Above, we have an example of Misaligned Column Labels. Often column
parents are physically centred over their children (N.B. where a
spreadsheet’s cell-merge feature is used to do the centering, the actual
value is usually stored in the top-left cell and hence causes no
problems). TableToLongForm makes use of pattern recognition to
identify repeating patterns in the labels, or in empty cells surrounding
the labels, to correct for the misalignment. For the Column Parents
row, we find (starting from column 2, the first data column) a pattern
of Empty-NonEmpty-Empty-Empty, with the pattern occurring twice. In the
Col Child row, we also find a pattern of length 4 occurring twice.
This can be used to correctly align the Column Parents to its
children. The Table after conversion to a LongForm follows.
Above, we have a generalised example of Misaligned Column Labels. We now
have Column Super-Parent which is misaligned to both its direct
children, the Column Parents, and to the lowest-level children. The
Table after conversion to a LongForm follows.
Above, we have an example of real data released in a Table6 that
demonstrates a combination of many of the patterns listed above. The
data comes from the New Zealand Qualifications Authority (NZQA 2012)
regarding the entries and results for their Scholarship Exams. The Table
demonstrates Empty Below, Empty Right (including Type 3, as
All Subjects
is a super-parent), Multi-row Column Labels, and
Misaligned Column Labels. The Table is substantially more complex than
the Labour Force Status data used in 1, and will require
considerably more work to convert manually. Worse, each year of the data
is stored in a separate Table, each with slight differences in format.
Thus the manual conversion code would have to either be individually
tweaked for each year (requiring yet more work), or be flexible enough
to handle these differences (requiring substantially more work). Other
data from NZQA faces bigger problems; though the Scholarships data can
all be obtained in a mere 8 Tables (for 8 years from 2004 to 2011), the
Subjects data not only requires a separate Table for each year, but also
for each subject (of which there are 91). Obtaining the greatest
breakdown possible for the Subjects data across all other variables
requires thousands of individual Tables. Without automatic conversion
with TableToLongForm, simply reading in such data for use would
require too much work to be practical. The Table after conversion to a
LongForm follows.
The primary limitation of TableToLongForm is that the function will be
a black box to most users. After running the function on a Table, the
user will either be given back a "data.frame"
with no easy way of
verifying if the result is correct, or be confronted with an error with
little idea of what went wrong. Based on ad hoc tests conducted so far,
TableToLongForm will either succeed, or fail catastrophically in a
manner that is easily recognised as utter failure. However methods for
verifying correct operation (or to understand failures) would be
desirable.
The simplest method currently available is to examine the additional
output returned when TableToLongForm
is called with the optional
argument fulloutput = TRUE
. This will return the ‘final product’ of
TableToLongForm’s algorithms in the form of IdentResult
, rowplist
and colplist
.
was covered above in the 3 section and contains information on where the data and labels are found.
stand for Row/Column Parentage List which are nested "list"
objects that represents all the hierarchical relationships in the
Table, as identified by TableToLongForm. For easier reading they
are assigned the "plist"
class which has a custom print
method.
An example of a colplist
is shown in 5.
Unfortunately this output has two key limitations. First, it is not obvious from this output what went wrong (or if nothing went wrong), requiring some detective work to piece together the evidence. Second, if anything did go wrong, the user still does not know why.
The option with the potential to provide the most information is calling
TableToLongForm
with the optional argument diagnostics = TRUE
, which
will write diagnostic output to a file, printing key variables at each
major stage of the conversion process. This output can thus be used to
track TableToLongForm’s progress as it works to convert the Table,
enabling fairly accurate assessment of where exactly it went wrong. Some
example diagnostic output is shown in 6. Unfortunately,
understanding this output requires familiarity with the workings of the
code and is unlikely to be of much use to anyone other than the author.
This article has introduced TableToLongForm, an R package that can automatically convert hierarchical Tables that would normally rely on the discerning powers of a human brain, to a simple LongForm dataframe that any decent software package can easily manipulate and use. While it can handle a variety of Tables automatically, and an even greater variety with some aid from the human user, it is not without limitations. Ultimately, TableToLongForm still uses algorithms to detect a known set of recognised patterns and any Table that deviates from these patterns will break TableToLongForm.
There is work to be done in refining and generalising existing
algorithms and creating new algorithms so that TableToLongForm can
successfully handle more cases, while also reducing the possibility of a
false positive. These range from adding more robust checks to the
algorithms to verify correct detection, such as sum-checks or pattern
recognition, to more fundamental changes, such as altering the "NA"
classification to distinguish between empty space and missing values. A
recent addition to the package (introduced in version 1.3.0) is to
enable new, custom algorithms to be used in place of the default ones
included with the package. More information on this can be found in the
official webpage for the package.
There is also work to be done in diagnostics output, not only in the
formal diagnostic output, but also in the form of error and warning
messages. Consider for instance the following error message if we call
TableToLongForm
on the Table in 3 without specifying the
correct IdentResult
. From these messages it is not at all obvious that
the problem is an incorrect IdentResult
, which is a problem that is
relatively easy to address if only it can correctly be identified by the
user.
in 1:ncol(datbit) : argument of length 0
Error : Warning message:
In additionrbind(matColLabel[!fullrows, , drop = FALSE], collapsedlabels) :
In length (arg 2) number of columns of result is not a multiple of vector
In terms of formal diagnostic output, various ideas are being tried such
as graphical representations of the information provided by
fulloutput = TRUE
by drawing the original Table with the regions being
highlighted in some way. Such a method would, for example, make it
easier to see a problem with IdentResult
, as it should become apparent
on the drawn Table that the incorrect regions are being highlighted.
This article has been written for TableToLongForm version 1.3.1. The code for reproducing the figures in this article, as well as more detailed documentation on the code itself, can be found at https://www.stat.auckland.ac.nz/~joh024/Research/TableToLongForm/. The development version can be found on github at https://github.com/joh024/TableToLongForm.
TableToLongForm, reshape2, plyr
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Oh, "Automatic Conversion of Tables to LongForm Dataframes", The R Journal, 2014
BibTeX citation
@article{RJ-2014-021, author = {Oh, Jimmy}, title = {Automatic Conversion of Tables to LongForm Dataframes}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {2}, issn = {2073-4859}, pages = {16-26} }