The R package GrpString was developed as a comprehensive toolkit for quantitatively analyzing and comparing groups of strings. It offers functions for researchers and data analysts to prepare strings from event sequences, extract common patterns from strings, and compare patterns between string vectors. The package also finds transition matrices and complexity of strings, determines clusters in a string vector, and examines the statistical difference between two groups of strings.
In domains such as psychology and social science, participants’ actions
can be listed as sequences of events. These event sequences can be
recorded as strings in which a single character represents an event or a
state. For example, in an eye-tracking study, the eye gazes on some
regions in the order of
"Question-Figure-Answer1-Answer2-Answer3-Answer4-Figure-Answer3" can
be recorded as a string "QF1234F3" (also called a scanpath in
eye-tracking studies). Another example is using the log files of an
online learning system to generate event sequences from student actions.
Processing and analyzing strings helps researchers explore the features
of event sequences. In R, a string is a sequence of characters. It is
generated using single or double quotes, and can contain zero or more
characters. The R base package offers various functions for string
operations including character count, case conversion, pattern
detection, substring extraction and replacement, and string split and
concatenation. These methods are also provided by packages
stringr
(Wickham 2010, 2017),
stringb
(Meissner 2016), and
stringi
(Gagolewski and Tartanus 2017), which are built for the purpose of string
manipulation. In addition, other packages, such as
gsubfn
(Grothendieck 2014) and
uniqtag
(Jackman 2015), focus on particular functionalities of handling
strings. Despite a number of packages for string manipulation, there are
few packages designed for the analysis of strings, especially groups of
character strings. A common method for string analysis in the existing
packages is string comparison by using distances between two strings.
For example, the function adist()
in the R utils package
calculates a generalized Levenshtein distance between two strings, and
the function stringdist()
in package
stringdist
(van der Loo 2014, 2016) provides options of computing ten
different types of distances.
Broadly speaking, the text of natural language and DNA sequences are
also strings. However, these two types of strings differ from simple
character strings, which are continuous and generally short (at most
several hundred characters in a string). Compared to simple character
strings, DNA sequences usually are much longer while natural language
text may be segmented by spaces or punctuation. There has been a
relatively large collection of packages for processing and analyzing
these two complex types of sequences
(https://cran.r-project.org/web/views/NaturalLanguageProcessing.html;
https://www.bioconductor.org/). Other packages for manipulating and
analyzing event or text sequences include
TraMineR
(Gabadinho et al. 2011, 2017) and
informR
(Marcum and Butts 2015). On the other hand, there is no package built specifically
for analyzing groups of typical character strings from the quantitative
perspective. Therefore, a comprehensive R package will fill this gap by
providing functions to deal with one or multiple groups of simple
character strings, including quantitatively describing and statistically
examining differences between groups of strings ("string groups" or
"string vectors" are also used in this paper).
Package GrpString
(Tang and Pienta 2017) is developed with this purpose, and it emphasizes
quantifying the features of a string group as well as the differences
between two string groups. First, the package provides functions for the
users to convert raw event sequences to strings and then to “collapsed”
strings. Next, there are functions that extract common patterns shared
by the strings in a vector. The functions return the frequency of each
pattern, as well as the number and starting position of each pattern in
each string. When patterns from two string vectors are compared,
featured patterns for each vector are listed to distinguish the two
vectors of strings. The package also contains functions for finding the
transition matrices of a single string or a group of strings and for
calculating the complexity of strings based on transitions. A transition
is a two-character sub-sequence within a string or group of strings. A
transition matrix in this package is a two-dimensional matrix that lists
the numbers of transitions. Cluster analysis is also included with both
hierarchical and k-means methods. Lastly, users can employ a permutation
test to statistically compare and visualize the difference between two
string vectors. Figure 1 shows the main functions in
the current version (0.3.2) with a detailed description for each
function.
Encoding each event name to its corresponding character is often a
necessary step before string analysis. Automatic and simultaneous
conversion of event names to string vectors can reduce the users’
workload, especially when multiple sequences of event names need to be
converted repeatedly. The GrpString package offers three functions
to accomplish this task, based on a conversion key created by the user.
The first function, EveS()
, converts a sequence of event names to a
single string. Here, the two input vectors, eve.names
and labels
,
which must have the same length, form a conversion key. An element in
labels
can be a letter, digit, or a special character.
> event.vec <- c("aoi_1", "aoi_2", "aoi_3", "aoi_2", "aoi_1")
> eve.names <- c("aoi_1", "aoi_2", "aoi_3")
> labels <- c("a", "b", "c")
> EveS(event.vec, eve.names, labels)
[1] "abcba"
The second function, EveStr()
, is applicable when event names are
stored in a data frame in which each row has the same number of event
names, and hence each converted string has the same length in the
exported vector. In this example, event.name.vec
and label.vec
form
a conversion key.
> event.df <- data.frame(c("aoi_1", "aoi_2"),
c("aoi_1", "aoi_3"),
c("aoi_3", "aoi_5"))
> event.name.vec <- c("aoi_1", "aoi_2", "aoi_3", "aoi_4", "aoi_5")
> label.vec <- c("a", "b", "c", "d", "e")
> EveStr(event.df, event.name.vec, label.vec)
[1] "aac" "bce"
The third function, EveString()
, is a generalized version of
EveStr()
. It deals with the situation when the user stores event names
in a file (e.g., .txt
or .csv
) in which different rows may have
different numbers of elements. It is generally not convenient to read
such a file into a data frame. Thus, a .txt
or .csv
file for event
names is used directly in the function to save the user’s effort. The
following command converts an array of event names to a vector
containing 45 strings with different lengths. The conversion key is
stored in data frame eventChar.df
, in which the first column contains
event names and the second column contains the characters to be
converted. The object event1d holds the directory of file eve1d.txt
(located in the user’s R library in this example), which has 45 rows,
each with different numbers of event names. Note that it is common to
use a file name (and its file path if the file is not in the current
directory) directly instead of an object (like event1d
) in the
function, and the users should not forget the quote sign.
> data(eventChar.df)
> event1d <- paste(path.package("GrpString"), "/extdata/eve1d.txt", sep = "")
> EveString(event1d, eventChar.df$event, eventChar.df$char)
[1] "D02F0E20DEDC0C30BDC0E45G050A0B5050A06BG0BA5607BA"
[2] "A1ABC21EF0230E03G032C30CBABGBA5G5A7G"
[3] "B1G1GEG10CEF2BAC3DEBA404B5G6F6A"
…
[45] "DCB010A0Q1EF21F0FGF0G0GF0B0BC20D0D0303DF04030CF45050B0CBCB05607B0"
After the event-string conversion, a string may contain consecutive
repeating characters, but sometimes the users are only interested in
"collapsed" strings in which successively duplicated characters are
removed. Function DupRm()
allows users to obtain collapsed strings.
> dup1 <- "000<<<<<DDDFFF333333qqqqqKKKKK33FFF"
> dup3 <- "aaBB111^^~~~555667777!!!###$$$$$$&&&(((((***)))))@@@@@>>>>99"
> dup13 <- c(dup1, dup3)
> DupRm(dup13)
[1] "0<DF3qK3F" "aB1^~5670!#$&(*)@>9"
In GrpString, a pattern (also called a common pattern) is defined
as a substring with a minimum length of three that occurs at least twice
among a group of strings. For a single string of length m, the total
number of substrings of length 3 or more is n = (m - 1) × (m - 2) / 2.
Note that a string itself is also considered as a substring. By using
exhaustive search of substrings in the string vector, this package
provides two functions to detect and organize patterns. The simplified
version, CommonPatt()
, returns one data frame. Because there could be
thousands of substrings in a string vector, the function utilizes a
cutoff to display more frequent or important patterns. The cutoff is the
minimum percentage of the occurrence of patterns or substrings and is
selected by the user. If the number of strings in a group is CommonPatt()
. In the following example, Freq_grp
.
> strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> CommonPatt(strs.vec, low = 30)
Pattern Freq_grp Percent_grp Length Freq_str Percent_str
ABCD 3 50.00% 4 2 33.33%
ABC 4 66.67% 3 3 50.00%
123 3 50.00% 3 3 50.00%
BCD 3 50.00% 3 2 33.33%
def 2 33.33% 3 2 33.33%
The exported data frame is sorted by the pattern length and then by the
frequency or percentage of the pattern. Note that the result contains
two sets of pattern frequencies and percentages; one is of the overall
occurrence (Freq_grp
and Percent_grp
) in a string group and the
other excludes duplicated occurrences in the same string (Freq_str
and
Percent_str
). As a consequence, in some cases Percent_grp can be
larger than 100%, while Percent_str will not exceed 100%. The full
version of this pair, CommonPattern()
, offers more options. In
addition to the lowest minimum cutoff as described in function
CommonPatt()
, the user can select the highest minimum cutoff and the
interval between the two cutoffs. Furthermore, the user can choose using
a conversion key to convert patterns back to sequences of event names,
which makes it easier to interpret the patterns. This function exports a
set of .txt
files in the current directory instead of a data
frame.This is because the number of files and the numbers of rows in
some files could be large, which might be difficult for the user to view
the results in R directly. The names of these .txt
files consist of
the name of the input string vector and the percentages resulted from
the cutoffs and the interval in the function. Patterns that occur at
least twice are exported in a separate .txt
file with "_f2up"
appended to the name of the input string vector. For example, the
following command exports three files: strs.vec_30up.txt
,
strs.vec_50up.txt
, and strs.vec_f2up.txt
. Each file contains a table
that has the same columns as shown in the above result from function
CommonPatt()
.
> strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> CommonPattern(strs.vec, low = 30, high = 50, interval = 20)
The first function in this pair, PatternInfo()
, lists some basic
information about patterns in each string. The information includes the
length of each string and the starting position of each pattern in a
string. If a pattern does not appear in a string, "-1"
will be
returned. If a pattern occurs at least twice in a string the default
position is for the first occurrence, although there is an option for
the users to choose the last occurrence of duplicated patterns in the
same string.
> strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> patts <- c("ABC", "123")
> PatternInfo(patts, strs.vec)
length ABC 123
ABCDdefABCDa 12 1 -1
def123DC 8 -1 4
123aABCD 8 5 1
ACD13 5 -1 -1
AC1ABC 6 4 -1
3123fe 6 -1 2
The second function in this pair, FeaturedPatt()
, is developed to
distinguish the pattern characteristics of two string groups. It
compares two groups of strings and discovers featured patterns in each
of the groups. It also lists basic information about the featured
patterns in each string. In this function, “featured” means patterns in
a resulting pattern group can only exist in one of the two input-pattern
groups. Note that “featured patterns” shared by strings in both string
groups are allowed. This is because, in practice, it is difficult to
find a pattern that is exclusively present in all the strings within
only one group. As a result, in this function, “featured patterns” are
probably obtained from two pattern vectors, each of which contains
patterns that are shared by a certain percentage of strings in a group.
The following simple example uses two groups (or vectors) of strings,
s_grp1
and s_grp2
, which are very similar to those in Figure
1. The two pattern vectors, p1
and p2
, are
obtained from s_grp1
and s_grp2
, respectively by applying function
CommonPatt()
with cutoff = 60%. Thus, p1
contains patterns that are
shared by at least 60% of the strings in s_grp1
and p2
contains
patterns that are shared by at least 60% of the strings in s_grp2
. To
apply FeaturedPatt()
, the first two arguments are p1
and p2
, which
serve as the “input-pattern groups”, and the last two arguments are the
original string groups s_grp1
and s_grp2
. The function exports five
text files: uni_p1-p2.txt
, p1-vs-p2_in_s_grp1.txt
,
p1-vs-p2_in_s_grp2.txt
, p2-vs-p1_in_s_grp1.txt
, and
p2-vs-p1_in_s_grp2.txt
. Figure 2 shows the
content of the first three files. The resulting featured patterns are
listed in File 1 (uni_p1_p2.txt
). It can be seen that “ABC” (featured
pattern group 1) does not appear in p2
(input-pattern group 2), but
can be found in s_grp2
(20% of string group 2); “xyz” (featured
pattern group 2) does not appear in p1
(input-pattern group 1), but
can be found in s_grp1
(40% of string group 1). In File 2
(p1-vs-p2_in_s_grp1.txt
) and File 3 (p1-vs-p2_in_s_grp2.txt
), the
four columns are (original) string group, length of string, number of
featured patterns in string, and the starting position of featured
pattern in string. If p1
had
> s_grp1 <- c("ABC123", "123ABC", "ABCx123", "123xyz", "xyz123")
> s_grp2 <- c("xyz123", "xyzA123", "123xyz", "ABC123", "x123")
> p1 <- c("123", "ABC")
> p2 <- c("123", "xyz")
> FeaturedPatt(p1, p2, s_grp1, s_grp2)
Ideally, a featured pattern presents in 100% of the strings of one
string group but in none of the strings of the other group. However,
obtaining featured patterns based on cutoffs smaller than 100% as
described above still has significance. If two groups of strings are
different, then patterns in featured pattern vector 1 are likely to
present more frequently in string group 1 than in string group 2. For
example, although the featured pattern "ABC"
exists in both string
groups, it can be found in 60% of the strings in String Group 1, but in
only 20% in string group 2 (Figure 1 and Figure
2). Note that Figure 1 and
Figure 2 only show a simplified example. In
reality, each string may contain multiple featured patterns. Therefore,
featured patterns, including their numbers and positions in a string
(Figure 2), in turn can be used to categorize the
string, i.e., to classify or predict to which group the string belongs.
In addition to patterns, transitions are also often reported in string
analysis. For example, researchers in eye-tracking studies may be
interested in the gaze transitions within a specific location or between
two different locations. A transition is a substring with length of 2.
For a single string of length m, the total number of transitions n =
m - 1. According to the definition of transition, transitions are not
applied to strings with length < 2. The first function in this pair,
TransInfo()
, returns the numbers of two types of transitions —
letter and digit by default — in a group of strings. Letters and
digits are used as default because they are common components in strings
and can represent two different types of events. The function also
reports the number of transitions that do not belong to either of the
two types. To be more flexible, the users can define any two types of
transitions (see the following example).
> strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> TransInfo(strs.vec)
transition_name transition_number
1 type1 24
2 type2 8
3 mixed 7
The second function in this pair, TransMx()
, returns grand transition
matrices in a group of strings. The first matrix contains the numbers of
transitions between two characters in a string vector. Normalized
transition numbers are stored in the second matrix. A data frame that
contains transitions sorted by frequency is also returned. Moreover,
this function provides an option to export a transition matrix for each
individual string into the current directory. The following shows the
usage and results of TransMx()
. The results are stored in a list. Only
one of the components in the list, Transition_Matrix
, is shown. The
matrix of normalized transitions Transition_Normalized_Matrix
and the
data frame of the sorted transitions Transition_Organized
are not
shown. As an example, the readers can understand transition matrix
$Transition_Matrix by looking at the transition "12". In vector
strs.vec
, the transition "12" occurs three times (in the 2nd, 3rd
and 6th strings). Thus, the value is 3 in the cell of row 1, column 2 of
Transition_Matrix
, which is the total number of the transition "From
1 To 2".
> strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> TransMx(strs.vec)
$Transition_Matrix
From/To 1 2 3 a A B C d D e f
1 1 0 3 1 0 1 0 0 0 0 0 0
2 2 0 0 3 0 0 0 0 0 0 0 0
3 3 1 0 0 1 0 0 0 0 1 0 1
4 a 0 0 0 0 1 0 0 0 0 0 0
5 A 0 0 0 0 0 4 2 0 0 0 0
6 B 0 0 0 0 0 0 4 0 0 0 0
7 C 1 0 0 0 0 0 0 0 4 0 0
8 d 0 0 0 0 0 0 0 0 0 2 0
9 D 1 0 0 1 0 0 1 1 0 0 0
10 e 0 0 0 0 0 0 0 0 0 0 2
11 f 1 0 0 0 1 0 0 0 0 1 0
Transition entropy measures the diversity of the transitions in a string or a group of strings. It can be used as an estimate of string complexity. Larger entropy reflects more evenly distributed transitions and smaller entropy reflects more biased distribution of transitions. Entropy in the current version of package GrpString is calculated using the Shannon entropy formula (Shannon 1948):
Here, TransMx()
in this package. The formula is equivalent to the function
entropy.empirical()
in the
entropy package
(Hausser and Strimmer 2014) when setting
One function in this pair, TransEntro()
, computes the overall
transition entropy for a group of strings. The other function,
TransEntropy()
, returns the transition entropy for each of the strings
in a group. Note that the third string ("A")
is skipped in the result
because the length of this string is 1. This function provides another
way to quantitatively and/or statistically compare two groups of
strings. For example, with the entropy values obtained from
TransEntropy()
, the users can conduct a
> stra.vec <- c("ABCDdefABCDa", "def123DC", "A", "123aABCD", "ACD13", "AC1ABC",
"3123fe")
> TransEntro(stra.vec)
[1] 4.272331
> TransEntropy(stra.vec)
String Entropy
1 1 2.913977
2 2 2.807355
3 4 2.807355
4 5 2.000000
5 6 2.321928
6 7 2.321928
StrDif()
employs a permutation test to statistically compare the
difference between two groups of strings based on normalized Levenshtein
distances (LDs). StrDif()
is the main function, and HistDif()
is the
auxiliary function that optimizes the histogram generated by the former.
A Levenshtein distance between two strings is the minimum number of
operations to transform one string into the other by inserting,
deleting, or replacing characters. LDs in StrDif()
are calculated
directly using R’s adist()
function in the package utils. A
normalized LD between two strings is the LD value divided by the length
of the longer string. For two groups of strings, there are two types of
LDs. One is within-group LD that is computed by comparing each string
with all the others in the same group; the other is between-group LD
that is computed by comparing each string in one group with all strings
in the other group.
When comparing two groups of strings statistically, the null hypothesis
is that the average normalized between-group LD is equal to the average
normalized within-group LD. Under this null hypothesis, the difference
(
The
In practice, NP can be very large, which will make the running time of
computing all StrDif()
allows the users to choose the
number of permutations. The default value is 1000, which has been proved
to be a reasonable number (Tang et al. 2012).
StrDif()
prints StrDif()
provides options for the users to define these two positions
in order to obtain the graph with optimal appearance. However, since the
users usually do not know the ideal positions in advance, they may have
to re-run the function at least once to adjust the positions. To avoid
rerunning StrDif()
, which can take a long time, HistDif()
allows the
users to adjust the positions by directly using the vector containing
all StrDif()
. This is another reason we
suggest that the users save StrDif()
.
> strs1.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe")
> strs2.vec <- c("xYZdkfAxDa", "ef1563xy", "BC9Dzy35X", "AkeC1fxz", "65CyAdC",
"Dfy3f69k")
> ld.dif.vec <- StrDif(strs1.vec, strs2.vec, num_perm = 500, p.x = 0.025)
For the initial two groups of strings,
the average normalized between-group Levenshtein Distance is: 0.85056
the average normalized within-group Levenshtein Distance is: 0.84306
the difference in the average normalized Levenshtein Distance between
between-group and within-group is: 0.00751.
The p value of the permutation test is: 0.41000
The pair of functions, StrHclust()
and StrKclust()
, perform string
clustering based on Levenshtein distance matrices. Function
StrHclust()
utilizes hierarchical clustering and exports a
hierarchical dendrogram (Figure 4), which may
suggest a number of clusters. When the user selects a number of
clusters, the function assigns each string to a corresponding cluster.
> strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx",
"AkfxzYZg")
> StrHclust(strs3.vec)
Function StrKclust()
utilizes k-means clustering and assigns strings
to their clusters based on the number of clusters the user chooses. In
addition, a cluster plot is produced to visualize the clusters (Figure
5). In the plot, each labelled point represents a
string in the input vector. Points are clustered by ellipses, which are
also labelled. Note that by default, the ellipses are not shaded; but
the users can shade the ellipses when setting
> strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx",
"AkfxzYZg")
> StrKclust(strs3.vec)
Cluster Strings
1 1 ABCDdefABCDa
2 1 AC3aABCD
3 1 ACD1AB3
4 2 xYZfgAxZY
5 2 gf56xZYx
6 2 AkfxzYZg
This article describes the GrpString package for analyzing and
comparing groups of strings. Most functions in the package
GrpString were initially developed for analyzing groups of
scanpaths (i.e., sequences of eye gazes) in eye-tracking studies.
Nevertheless, as described above, this R package can be applied in
analysis of any type of character strings, and the strings do not have
to be associated with any event or state data. It should be noted that
the sole usage of one function may not be sufficient to draw conclusion
when answering a research question. An example is StrDif()
. There
could be various factors, such as string lengths, that can affect the
CommonPattern()
and/or TransEntropy()
may be used to support the conclusion. One
limitation of the current version of GrpString is that it only
provides basic and common options in some functions. For instance, only
Levenshtein distance is available when distances between strings are
computed (StrDif()
, StrHclust()
and StrKclust()
); the updated
versions should include an argument for different types of distance.
Finally, there are many other advanced analytical methods that have not
been included in this package, such as determining the centroid or
median string in a string vector (de la Higuera and Casacuberta 2000; Martínez-Hinarejos et al. 2000) or using
Markov chains for string modeling and classification (Krejtz et al. 2014). We
plan to build more functions into GrpString in the future and
welcome feedback from the users to improve this package.
This work was supported by a Startup Grant from the Office of Research
at the University of Georgia (No. 1026AR168004) and the Journal of
Chemical Education Grants. We thank Dr. Rhonda DeCook in the Department
of Statistics and Actuarial Science at the University of Iowa, who wrote
the first version of function StrDif()
.
stringr, stringb, stringi, gsubfn, uniqtag, stringdist, TraMineR, informR, GrpString, entropy
NaturalLanguageProcessing, OfficialStatistics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Tang, et al., "GrpString: An R Package for Analysis of Groups of Strings", The R Journal, 2018
BibTeX citation
@article{RJ-2018-002, author = {Tang, Hui and Day, Elizabeth L. and Atkinson, Molly B. and Pienta, Norbert J.}, title = {GrpString: An R Package for Analysis of Groups of Strings}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {359-369} }