GrpString : An R Package for Analysis of Groups of Strings by

The R package GrpString was developed as a comprehensive toolkit for quantitatively analyzing and comparing groups of strings. It offers functions for researchers and data analysts to prepare strings from event sequences, extract common patterns from strings, and compare patterns between string vectors. The package also finds transition matrices and complexity of strings, determines clusters in a string vector, and examines the statistical difference between two groups of strings.


Introduction
In domains such as psychology and social science, participants' actions can be listed as sequences of events.These event sequences can be recorded as strings in which a single character represents an event or a state.For example, in an eye-tracking study, the eye gazes on some regions in the order of "Question-Figure-Answer1-Answer2-Answer3-Answer4-Figure -Answer3" can be recorded as a string "QF1234F3" (also called a scanpath in eye-tracking studies).Another example is using the log files of an online learning system to generate event sequences from student actions.Processing and analyzing strings helps researchers explore the features of event sequences.In R, a string is a sequence of characters.It is generated using single or double quotes, and can contain zero or more characters.The R base package offers various functions for string operations including character count, case conversion, pattern detection, substring extraction and replacement, and string split and concatenation.These methods are also provided by packages stringr (Wickham, 2010(Wickham, , 2017)), stringb (Meissner, 2016), and stringi (Gagolewski and Tartanus, 2017), which are built for the purpose of string manipulation.In addition, other packages, such as gsubfn (Grothendieck, 2014) and uniqtag (Jackman, 2015), focus on particular functionalities of handling strings.Despite a number of packages for string manipulation, there are few packages designed for the analysis of strings, especially groups of character strings.A common method for string analysis in the existing packages is string comparison by using distances between two strings.For example, the function adist() in the R utils package calculates a generalized Levenshtein distance between two strings, and the function stringdist() in package stringdist (van der Loo, 2014(van der Loo, , 2016) ) provides options of computing ten different types of distances.
Broadly speaking, the text of natural language and DNA sequences are also strings.However, these two types of strings differ from simple character strings, which are continuous and generally short (at most several hundred characters in a string).Compared to simple character strings, DNA sequences usually are much longer while natural language text may be segmented by spaces or punctuation.There has been a relatively large collection of packages for processing and analyzing these two complex types of sequences (https://cran.r-project.org/web/views/NaturalLanguageProcessing.html; https://www.bioconductor.org/).Other packages for manipulating and analyzing event or text sequences include TraMineR (Gabadinho et al., 2011(Gabadinho et al., , 2017) ) and informR (Marcum and Butts, 2015).On the other hand, there is no package built specifically for analyzing groups of typical character strings from the quantitative perspective.Therefore, a comprehensive R package will fill this gap by providing functions to deal with one or multiple groups of simple character strings, including quantitatively describing and statistically examining differences between groups of strings ("string groups" or "string vectors" are also used in this paper).
Package GrpString (Tang and Pienta, 2017) is developed with this purpose, and it emphasizes quantifying the features of a string group as well as the differences between two string groups.First, the package provides functions for the users to convert raw event sequences to strings and then to "collapsed" strings.Next, there are functions that extract common patterns shared by the strings in a vector.The functions return the frequency of each pattern, as well as the number and starting position of each pattern in each string.When patterns from two string vectors are compared, featured patterns for each vector are listed to distinguish the two vectors of strings.The package also contains functions for finding the transition matrices of a single string or a group of strings and for calculating the complexity of strings based on transitions.A transition is a two-character sub-sequence within a string or group of strings.A transition matrix in this package is a two-dimensional matrix that lists the numbers of transitions.Cluster analysis is also included with both hierarchical and k-means methods.Lastly, users can employ a permutation test to statistically compare and visualize the difference between two string vectors.Figure 1 shows the main functions in the current version (0.3.2) with a detailed description for each function.

Functions in GrpString
Converting event sequences to strings Encoding each event name to its corresponding character is often a necessary step before string analysis.Automatic and simultaneous conversion of event names to string vectors can reduce the users' workload, especially when multiple sequences of event names need to be converted repeatedly.The GrpString package offers three functions to accomplish this task, based on a conversion key created by the user.The first function, EveS(), converts a sequence of event names to a single string.Here, the two input vectors, eve.names and labels, which must have the same length, form a conversion key.An element in labels can be a letter, digit, or a special character.
> event.df<-data.frame(c("aoi_1","aoi_2"), c("aoi_1", "aoi_3"), c("aoi_3", "aoi_5")) > event.name.vec<-c("aoi_1", "aoi_2", "aoi_3", "aoi_4", "aoi_5") > label.vec<-c("a", "b", "c", "d", "e") > EveStr (event.df, event.name.vec, label.vec)The R Journal Vol.10/1, July 2018 ISSN 2073-4859 [1] "aac" "bce" The third function, EveString(), is a generalized version of EveStr().It deals with the situation when the user stores event names in a file (e.g., '.txt' or '.csv') in which different rows may have different numbers of elements.It is generally not convenient to read such a file into a data frame.Thus, a '.txt' or '.csv' file for event names is used directly in the function to save the user's effort.The following command converts an array of event names to a vector containing 45 strings with different lengths.The conversion key is stored in data frame eventChar.df, in which the first column contains event names and the second column contains the characters to be converted.The object event1d holds the directory of file 'eve1d.txt'(located in the user's R library in this example), which has 45 rows, each with different numbers of event names.Note that it is common to use a file name (and its file path if the file is not in the current directory) directly instead of an object (like event1d) in the function, and the users should not forget the quote sign.

Detecting patterns
In GrpString, a pattern (also called a common pattern) is defined as a substring with a minimum length of three that occurs at least twice among a group of strings.For a single string of length m, the total number of substrings of length 3 or more is n = (m -1) × (m -2) / 2. Note that a string itself is also considered as a substring.By using exhaustive search of substrings in the string vector, this package provides two functions to detect and organize patterns.The simplified version, CommonPatt(), returns one data frame.Because there could be thousands of substrings in a string vector, the function utilizes a cutoff to display more frequent or important patterns.The cutoff is the minimum percentage of the occurrence of patterns or substrings and is selected by the user.If the number of strings in a group is num, and the cutoff is selected as c, then only patterns with the minimum number of occurrence f min = num × c% will be returned by function CommonPatt().In the following example, num = 6, c = 30.Thus, f min = 6 × 30% = 1.8.Patterns with frequency (i.e., number of occurrence) ≥ 2 are displayed in column Freq_grp.
> strs.vec<-c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") > CommonPatt(strs.vec,low = 30) The exported data frame is sorted by the pattern length and then by the frequency or percentage of the pattern.Note that the result contains two sets of pattern frequencies and percentages; one is of the overall occurrence (Freq_grp and Percent_grp) in a string group and the other excludes duplicated occurrences in the same string (Freq_str and Percent_str).As a consequence, in some cases Percent_grp can be larger than 100%, while Percent_str will not exceed 100%.The full version of The R Journal Vol.10/1, July 2018 ISSN 2073-4859 this pair, CommonPattern(), offers more options.In addition to the lowest minimum cutoff as described in function CommonPatt(), the user can select the highest minimum cutoff and the interval between the two cutoffs.Furthermore, the user can choose using a conversion key to convert patterns back to sequences of event names, which makes it easier to interpret the patterns.This function exports a set of '.txt' files in the current directory instead of a data frame.This is because the number of files and the numbers of rows in some files could be large, which might be difficult for the user to view the results in R directly.The names of these '.txt' files consist of the name of the input string vector and the percentages resulted from the cutoffs and the interval in the function.Patterns that occur at least twice are exported in a separate '.txt' file with "_f2up" appended to the name of the input string vector.For example, the following command exports three files: 'strs.vec_30up.txt','strs.vec_50up.txt',and 'strs.vec_f2up.txt'.Each file contains a table that has the same columns as shown in the above result from function CommonPatt().

Pattern information and featured patterns
The first function in this pair, PatternInfo(), lists some basic information about patterns in each string.The information includes the length of each string and the starting position of each pattern in a string.If a pattern does not appear in a string, "-1" will be returned.If a pattern occurs at least twice in a string the default position is for the first occurrence, although there is an option for the users to choose the last occurrence of duplicated patterns in the same string.
> strs.vec<-c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") > patts <-c("ABC", "123") > PatternInfo(patts, strs.vec) 6 -1 2 The second function in this pair, FeaturedPatt(), is developed to distinguish the pattern characteristics of two string groups.It compares two groups of strings and discovers featured patterns in each of the groups.It also lists basic information about the featured patterns in each string.In this function, "featured" means patterns in a resulting pattern group can only exist in one of the two input-pattern groups.Note that "featured patterns" shared by strings in both string groups are allowed.This is because, in practice, it is difficult to find a pattern that is exclusively present in all the strings within only one group.As a result, in this function, "featured patterns" are probably obtained from two pattern vectors, each of which contains patterns that are shared by a certain percentage of strings in a group.The following simple example uses two groups (or vectors) of strings, s_grp1 and s_grp2, which are very similar to those in Figure 1.The two pattern vectors, p1 and p2, are obtained from s_grp1 and s_grp2, respectively by applying function CommonPatt() with cutoff = 60%.Thus, p1 contains patterns that are shared by at least 60% of the strings in s_grp1 and p2 contains patterns that are shared by at least 60% of the strings in s_grp2.To apply FeaturedPatt(), the first two arguments are p1 and p2, which serve as the "input-pattern groups", and the last two arguments are the original string groups s_grp1 and s_grp2.The function exports five text files: 'uni_p1-p2.txt','p1-vs-p2_in_s_grp1.txt', 'p1-vs-p2_in_s_grp2.txt', 'p2-vs-p1_in_s_grp1.txt', and 'p2-vs-p1_in_s_grp2.txt'.Figure 2 shows the content of the first three files.The resulting featured patterns are listed in File 1 ('uni_p1_p2.txt').It can be seen that "ABC" (featured pattern group 1) does not appear in p2 (input-pattern group 2), but can be found in s_grp2 (20% of string group 2); "xyz" (featured pattern group 2) does not appear in p1 (input-pattern group 1), but can be found in s_grp1 (40% of string group 1).In File 2 ('p1-vs-p2_in_s_grp1.txt') and File 3 ('p1-vs-p2_in_s_grp2.txt'), the four columns are (original) string group, length of string, number of featured patterns in string, and the starting position of featured pattern in string.If p1 had n patterns instead of one, the number of columns in File 2 and File 3 should have 3 + n columns; starting from the 4th column, each column lists the starting position of a featured pattern.
> p1 <-c("123", "ABC") > p2 <-c("123", "xyz") > FeaturedPatt(p1, p2, s_grp1, s_grp2) Ideally, a featured pattern presents in 100% of the strings of one string group but in none of the strings of the other group.However, obtaining featured patterns based on cutoffs smaller than 100% as described above still has significance.If two groups of strings are different, then patterns in featured pattern vector 1 are likely to present more frequently in string group 1 than in string group 2. For example, although the featured pattern "ABC" exists in both string groups, it can be found in 60% of the strings in String Group 1, but in only 20% in string group 2 (Figure 1 and Figure 2).Note that Figure 1 and Figure 2 only show a simplified example.In reality, each string may contain multiple featured patterns.Therefore, featured patterns, including their numbers and positions in a string (Figure 2), in turn can be used to categorize the string, i.e., to classify or predict to which group the string belongs.

Transition matrix and information
In addition to patterns, transitions are also often reported in string analysis.For example, researchers in eye-tracking studies may be interested in the gaze transitions within a specific location or between two different locations.A transition is a substring with length of 2. For a single string of length m, the total number of transitions n = m -1.According to the definition of transition, transitions are not applied to strings with length < 2. The first function in this pair, TransInfo(), returns the numbers of two types of transitions -letter and digit by default -in a group of strings.Letters and digits are used as default because they are common components in strings and can represent two different types of events.The function also reports the number of transitions that do not belong to either of the two types.To be more flexible, the users can define any two types of transitions (see the following example).
> strs.vec<-c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") > TransInfo(strs.vec)transition_name transition_number 1 type1 24 The R Journal Vol.10/1, July 2018 ISSN 2073-4859 The second function in this pair, TransMx(), returns grand transition matrices in a group of strings.The first matrix contains the numbers of transitions between two characters in a string vector.Normalized transition numbers are stored in the second matrix.A data frame that contains transitions sorted by frequency is also returned.Moreover, this function provides an option to export a transition matrix for each individual string into the current directory.The following shows the usage and results of TransMx().The results are stored in a list.Only one of the components in the list, Transition_Matrix, is shown.The matrix of normalized transitions Transition_Normalized_Matrix and the data frame of the sorted transitions Transition_Organized are not shown.As an example, the readers can understand transition matrix $Transition_Matrix by looking at the transition "12".In vector strs.vec, the transition "12" occurs three times (in the 2nd, 3rd and 6th strings).Thus, the value is 3 in the cell of row 1, column 2 of Transition_Matrix, which is the total number of the transition "From 1 To 2".

Transition entropy
Transition entropy measures the diversity of the transitions in a string or a group of strings.It can be used as an estimate of string complexity.Larger entropy reflects more evenly distributed transitions and smaller entropy reflects more biased distribution of transitions.Entropy in the current version of package GrpString is calculated using the Shannon entropy formula (Shannon, 1948): Here, f reqs i is the ith frequency of a transition in a string or a string vector.These are normalized transition numbers, which can be obtained in the normalized transition matrix exported by function TransMx() in this package.The formula is equivalent to the function entropy.empirical() in the entropy package (Hausser and Strimmer, 2014) when setting unit = " log 2 ".Strings with length < 2 are not counted for calculating entropies because transitions are not applied to those strings.
One function in this pair, TransEntro(), computes the overall transition entropy for a group of strings.The other function, TransEntropy(), returns the transition entropy for each of the strings in a group.Note that the third string ("A") is skipped in the result because the length of this string is 1.This function provides another way to quantitatively and/or statistically compare two groups of strings.For example, with the entropy values obtained from TransEntropy(), the users can conduct a t-test to examine whether the difference in complexity of two string groups is statistically significant.
The p value of the permutation test is: 0.41000

Cluster analysis
The pair of functions, StrHclust() and StrKclust(), perform string clustering based on Levenshtein distance matrices.Function StrHclust() utilizes hierarchical clustering and exports a hierarchical dendrogram (Figure 4), which may suggest a number of clusters.When the user selects a number of clusters, the function assigns each string to a corresponding cluster.

Summary
This article describes the GrpString package for analyzing and comparing groups of strings.Most functions in the package GrpString were initially developed for analyzing groups of scanpaths (i.e., sequences of eye gazes) in eye-tracking studies.Nevertheless, as described above, this R package can be applied in analysis of any type of character strings, and the strings do not have to be associated with any event or state data.It should be noted that the sole usage of one function may not be sufficient to draw conclusion when answering a research question.An example is StrDif().There could be various factors, such as string lengths, that can affect the p-value of a permutation test for string comparison.Therefore, when the users claim a statistically significant or non-significant difference between two string groups, the results from CommonPattern() and/or TransEntropy() may be used to support the conclusion.One limitation of the current version of GrpString is that it only provides basic and common options in some functions.For instance, only Levenshtein distance is available when distances between strings are computed (StrDif(), StrHclust() and StrKclust()); the updated versions should include an argument for different types of distance.Finally, there are many other advanced analytical methods that have not been included in this package, such as determining the centroid or median string in a string vector (de la Higuera and Casacuberta, 2000;Martínez-Hinarejos et al., 2000) or using Markov chains for string modeling and classification (Krejtz et al., 2014).We plan to build more functions into GrpString in the future and welcome feedback from the users to improve this package.

Figure 1 :
Figure 1: Flow chart of the main functions in the package GrpString.There are alternatives for some functions in the flow chart.For example, EveStr() can be replaced with EveString(), CommonPatt() can be replaced with CommonPattern(), and StrHclust() and StrKclust() are interchangeable.The cutoff 60% is chosen as an example showing how to obtain common patterns and then featured patterns.In practice, the users may select different cutoffs when using CommonPatt() or CommonPattern().

Figure 3 :
Figure 3: Histogram from function StrDif(): distribution of the differences in the average normalized Levenshtein Distance between between-group and within-group strings.The original difference ("Observed difference") d * is marked in a blue line with the p-value labeled beside.Because a subset of all the possible permutations are selected randomly, the actual p-value may be slightly different each time the user runs the function.

Figure 4 :
Figure 4: Dendrogram from function StrHclust().Function StrKclust() utilizes k-means clustering and assigns strings to their clusters based on the number of clusters the user chooses.In addition, a cluster plot is produced to visualize the clusters