Stylometry with R: A Package for Computational Text Analysis

This software paper describes ‘Stylometry with R’ (stylo), a flexible R package for the high-level analysis of writing style in stylometry. Stylometry (computational stylistics) is concerned with the quantitative study of writing style, e.g. authorship verification, an application which has considerable potential in forensic contexts, as well as historical research. In this paper we introduce the possibilities of stylo for computational text analysis, via a number of dummy case studies from English and French literature. We demonstrate how the package is particularly useful in the exploratory statistical analysis of texts, e.g. with respect to authorial writing style. Because stylo provides an attractive graphical user interface for high-level exploratory analyses, it is especially suited for an audience of novices, without programming skills (e.g. from the Digital Humanities). More experienced users can benefit from our implementation of a series of standard pipelines for text processing, as well as a number of similarity metrics.

Maciej Eder (Institute of Polish Language, Polish Academy of Sciences) , Jan Rybicki (Institute of English Studies, Jagiellonian University) , Mike Kestemont (Department of Literature, University of Antwerp)
2015-12-22

1 Introduction

Authorship is a topic which continues to attract considerable attention with the larger public. This claim is well illustrated by a number of high-profile case studies that have recently made headlines across the popular media, such as the attribution of a pseudonymously published work to acclaimed Harry Potter novelist, J. K. Rowling (Juola 2013), or the debate surrounding the publication of Harper Lee’s original version of To Kill a Mocking Bird and the dominant role which her editor might have played therein (Gamerman 2015). The authorship of texts clearly matters to readers across the globe (Love 2002) and therefore it does not come as a surprise that computational authorship attribution increasingly attracts attention in science, because of its valuable real-world applications, for instance, related to forensics topics such as plagiarism detection, unmasking the author of harassment messages or even determining the provenance of bomb letters in counter-terrorism research. Interestingly, the methods of stylometry are also actively applied in the Humanities, where multiple historic authorship problems in literary studies still seek a definitive solution – the notorious Shakespeare-Marlowe controversy is perhaps the best example in this respect.

Authorship attribution plays a prominent role in the nascent field of stylometry, or the computational analysis of writing style (Stamatatos et al. 2000; Van Halteren et al. 2005; Juola 2006; Koppel et al. 2009; Stamatatos 2009). While this field has important historical precursors (Holmes 1994, 1998), recent decades have witnessed a clear increase in the scientific attention for this problem. Because of its emergent nature, replicability and benchmarking still pose significant challenges in the field (Stamatatos 2009). Publicly available benchmark data sets are hard to come across, mainly because of copyright and privacy issues, and there are only a few stable, cross-platform software packages out there which are widely used in the community. Fortunately, a number of recent initiatives lead the way in this respect, such as the recent authorship tracks in the PAN competition (http://pan.webis.de), where e.g. relevant data sets are efficiently interchanged.

In this paper we introduce ‘Stylometry with R’ (stylo), a flexible R package for the high-level stylistic analysis of text collections. This package explicitly seeks to further contribute to the recent development in the field towards a more advanced level of replicability and benchmarking in the field. Stylometry is a multidisciplinary research endeavor, attracting contributions from divergent scientific domains, which include researchers from Computer Science – with a fairly technical background – as well as experts from the Humanities – who might lack the computational skills which would allow them easy access to the state-of-the-art methods in the field (Schreibman et al. 2004). Importantly, this package has the potential to help bridge the methodological gap luring between these two communities of practice: on the one hand, stylo’s API allows to set up a complete processing pipeline using traditional R scripting; on the other hand, stylo also offers a rich graphical user interface which allows non-technical, even novice practitioners to interface with state-of-the-art methods without the need for any programming experience.

2 Overview of stylometry

Stylometry deals with the relationship between the writing style in texts and meta-data about those texts (such as date, genre, gender, authorship). Researchers in ‘stylochronometry’, for instance, are interested in inferring the date of composition of texts on the basis of stylistic aspects (Juola 2007; Stamou 2008). Authorship studies are currently the most popular application of stylometry. From the point of view of literary studies, stylometry is typically concerned with a number of recent techniques from computational text analysis that are sometimes termed ‘distant reading’, ‘not reading’ or ‘macroanalysis’ (Jockers 2013). Instead of the traditional practice of ‘close reading’ in literary analysis, stylometry does not set out from a single direct reading; instead, it attempts to explore large text collections using computational techniques (and often visualization). Thus, stylometry tries to expand the scope of inquiry in the humanities by scaling up research resources to large text collections in order to find relationships and patterns of similarity and difference invisible to the eye of the human reader.

Usually, stylometric analyses involve a complex, multi-stage pipeline of (i) preprocessing, (ii) feature extraction, (iii) statistical analysis, and finally, (iv) presentation of results, e.g. via visualization. To this end, researchers presently have to resort to an ad hoc combination of proprietary, language-dependent tools that cannot easily be ported across different platforms. Such solutions are difficult to maintain and exchange across (groups of) individual researchers, preventing straightforward replication of research results and reuse of existing code. stylo, the package presented, offers a rich, user-friendly suite of functionality that is ideally suited for fast exploratory analysis of textual corpora as well as classification tasks such as are needed in authorship attribution. The package offers an implementation of the main methods currently dominant in the field. Its main advantage therefore lies in the integration of typical (e.g. preprocessing) procedures from stylometry and statistical functionality by other, external libraries. Written in the R language, the source code and binaries for the package are freely available from the Comprehensive R Archive Network, guaranteeing a straightforward installation process across different platforms (both Unix- and Windows-based operating systems). The code is easily adaptable and extensible: the developers therefore continue to welcome user contributions, feedback and feature requests. Our code is open source and GPL-licensed: it is being actively developed on GitHub.1

In the rest of this paper, we will first illustrate the functionality of the package for unsupervised multivariate analysis through the high-level function stylo(). Secondly, we will discuss a number of graphical user interfaces which we provide for quick exploration of corpora, in particular by novice users or students in an educational setting, as well as for scholars in the Humanities without programming experience. Next, we move on to the function classify(), implementing a number of supervised classification procedures from the field of Machine Learning. Finally, we concisely discuss the oppose(), rolling.delta() and rolling.classify() functionality which allow, respectively, to inspect differences in word usage between two subsets of a corpus, and to study the evolution of the writing style in a text.

3 Overview of the package

Downloading, installing and loading stylo is straightforward. The package is available at CRAN and at GitHub repository. The main advantages and innovative features of stylo include:

Feature extraction

Crucial in stylometry is the extraction of quantifiable features related to the writing style of texts (Sebastiani 2002). A wide range of features have been proposed in the literature, considerably varying in complexity (Stamatatos 2009). ‘Stylometry with R’ focuses on features that can be automatically extracted from texts, i.e. without having to resort to language-dependent preprocessing tools. The features that the package allows to extract are \(n\)-grams on token- and character level (Kjell 1994; Houvardas and Stamatatos 2006). Apart from the fact that this makes the package considerably language-independent, such shallow features have been shown to work well for a variety of tasks in stylometry (Daelemans 2013; Kestemont 2014). Moreover, users need not annotate their text materials using domain-specific tools before analyzing them with ‘Stylometry with R’. Apart from the standard usage, however, the package does allow the users to load their own annotated corpora, provided that this is preceded by some text pre-processing tasks. An example of such a non-standard procedure will be shown below. Thus, stylo does not aim to supplant existing, more targeted tools and packages from Natural Language Processing (Feinerer et al. 2008) but it can easily accommodate the output of such tools as a part of its processing pipeline.

Metrics

A unique feature of stylo is that it offers reference implementations for a number of established distance metrics from multivariate statistical analysis, which are popular in stylometry, but uncommon outside the field. Burrows’s Delta is the best example here (Burrows 2002); it is an intuitive distance metric which has attracted a good share of attention in the community, also from a theoretical point of view (Hoover 2004b,a; Argamon 2011).

Graphical user interface

The high-level functions of the package provide a number of Graphical User Interfaces (GUIs) which can be used to intuitively set up a number of established experimental workflows with a few clicks (e.g. unsupervised visualization of texts based on word frequencies). These interfaces can be easily invoked from the command line in R and provide an attractive overview of the various experimental parameters available, allowing users to quickly explore the main stylistic structure of corpora. This feature is especially useful in an educational setting, allowing (e.g. undergraduate) students from different fields, typically without any programming experience, to engage in stylometric experimentation. The said high-level functions keep the analytic procedure from corpus pre-processing to final results presentation manageable from within a single GUI. More flexibility, however, can be achieved when the workflow is split into particular steps, each controlled by a dedicated lower-level function from the package, as will be showcased below.

4 Example workflow

An experiment in stylometry usually involves a workflow whereby, subsequently, (i) textual data is acquired, (ii) the texts are preprocessed, (iii) stylistic features are extracted, (iv) a statistical analysis is performed, and finally, (v) the results are outputted (e.g. visualized). We will now illustrate how such a workflow can be performed using the package.

Corpus preparation

One of the most important features of stylo is that it allows loading textual data either from R objects, or directly from corpus files stored in a dedicated folder. Metadata of the input texts are expected to be included in the file names. The file name convention assumes that any string of characters followed by an underscore becomes a class identifier (case sensitive). In final scatterplots and dendrograms, colors of the samples are assigned according to this convention; common file extensions are dropped. E.g. to make the samples colored according to authorial classes, files might be named as follows:

ABronte_Agnes.txt   ABronte_Tenant.txt      Austen_Pride.txt
Austen_Sense.txt    Austen_Emma.txt         CBronte_Professor.txt
CBronte_Jane.txt    CBronte_Villette.txt    EBronte_Wuthering.txt

All examples below can be reproduced by the user on data sets which can be downloaded from the authors’ project website.2 For the sake of convenience, however, we will use the datasets that come with the package itself:

data(novels)
data(galbraith)
data(lee)

Our first example uses nine prose novels by Jane Austen and the Brontë sisters, provided by the dataset novels.

Preprocessing

stylo offers a rich set of options to load texts in various formats from a file system (preferably encoded in UTF-8 Unicode, but it also supports other encodings, e.g. under Windows). Apart from raw text, stylo allows to load texts encoded according to the guidelines of the Text Encoding Initiative, which is relatively prominent in the community of text analysis researchers.3 To load all the files saved in a directory (e.g. corpus_files), users can use the following command:

raw.corpus <- load.corpus(files = "all", corpus.dir = "corpus_files", 
         encoding = "UTF-8")

If the texts are annotated in e.g. XML, an additional pre-processing procedure might be needed:

corpus.no.markup <- delete.markup(raw.corpus, markup.type = "xml")

Since the dataset that we will use has no annotation, the markup deletion can be omitted. We start the procedure with making the data visible for the user:

data(novels)
summary(novels)

To preprocess the data, stylo offers a number of tokenizers that support a representative set of European languages, including English, Latin, German, French, Spanish, Dutch, Polish, Hungarian, as well as basic support for non-Latin alphabets such as Korean, Chinese, Japanese, Hebrew, Arabic, Coptic and Greek. Tokenization refers to the process of dividing a string of input texts into countable units, such as word tokens. To tokenize the English texts, e.g. splitting items as ‘don’t’ into ‘do’ and ‘n’t’ and lowercasing all words, the next command is available:

tokenized.corpus <- txt.to.words.ext(novels, language = "English.all", 
            preserve.case = FALSE)

The famous first sentence of Jane Austen’s Pride and Prejudice, for instance, looks like this in its tokenized version (the 8th to the 30th element of the corresponding vector):

tokenized.corpus$Austen_Pride[8:30]

 [1] "it"           "is"           "a"            "truth"        "universally" 
 [6] "acknowledged" "that"         "a"            "single"       "man"         
[11] "in"           "possession"   "of"           "a"            "good"        
[16] "fortune"      "must"         "be"           "in"           "want"        
[21] "of"           "a"            "wife" 

To see basic statistics of the tokenized corpus (number of texts/samples, number of tokens in particular texts, etc.), one might type:

summary(tokenized.corpus)

For complex scripts, such as Hebrew, custom splitting rules could easily be applied:

tokenized.corpus.custom.split <- txt.to.words(tokenized.corpus, 
       splitting.rule = "[^A-Za-z\U05C6\U05D0-\U05EA\U05F0-\U05F2]+",
       preserve.case = TRUE)

A next step might involve ‘pronoun deletion’. Personal pronouns are often removed in stylometric studies because they tend to be too strongly correlated with the specific topic or genre of a text (Pennebaker 2011), which is an unwanted artefact in e.g. authorship studies (Hoover 2004b,a). Lists of pronouns are available in stylo for a series of languages supported. They can be accessed via for example:

stylo.pronouns(language = "English")

 [1] "he"         "her"        "hers"       "herself"    "him"       
 [6] "himself"    "his"        "i"          "me"         "mine"      
[11] "my"         "myself"     "our"        "ours"       "ourselves" 
[16] "she"        "thee"       "their"      "them"       "themselves"
[21] "they"       "thou"       "thy"        "thyself"    "us"        
[26] "we"         "ye"         "you"        "your"       "yours"     
[31] "yourself" 

Removing pronouns from the analyses (much like stopwords are removed in Information Retrieval analyses) is easy in stylo, using the delete.stop.words() function:

corpus.no.pronouns <- delete.stop.words(tokenized.corpus, 
               stop.words = stylo.pronouns(language = "English"))

The above procedure can also be used to exclude any set of words from the input corpus.

Features

After these preprocessing steps, users will want to extract gaugeable features from the corpus. In a vast majority of approaches, stylometrists rely on high-frequency items. Such features are typically extracted in the level of (groups of) words or characters, called \(n\)-grams (Kjell 1994). Both word-token and character \(n\)-grams are common textual features in present-day authorship studies. Stylo allows users to specify the size of the \(n\)-grams which they want to use. For third order character trigrams (\(n=3\)), for instance, an appropriate function of stylo will select partially overlapping series of character groups of length 3 from a string of words (e.g. ‘tri’, ‘rig’, ‘igr’, ‘gra’, ‘ram’, ‘ams’). Whereas token level features have a longer tradition in the field, character \(n\)-grams have been fairly recently borrowed from the field of language identification in Computer Science (Stamatatos 2009; Eder 2011). Both \(n\)-grams at the level of characters and words have been listed among the most effective stylistic features in survey studies in the field. For \(n=1\), such text representations model texts under the so-called ‘bag-of-words’ assumption that the order and position of items in a text is negligible stylistic information. To convert single words into third order character chains, or trigrams:

corpus.char.3.grams <- txt.to.features(corpus.no.pronouns, ngram.size = 3, 
       features = "c")

Sampling

Users can study texts in their entirety, but also draw consecutive samples from texts in order to effectively assess the internal stylistic coherence of works. The sampling settings will affect how the relative frequencies are calculated and allow users to normalize text length in the data set. Users can specify a sampling size (expressed in current units, e.g. words) to divide texts into consecutive slices. The samples can partially overlap and they can be also be extracted randomly. As with all functions, the available options are well-documented:

help(make.samples)

To split the current corpus into non-overlapping samples of 20,000 words each, one might type:

sliced.corpus <- make.samples(tokenized.corpus, sampling = "normal.sampling", 
        sample.size = 20000)

Counting frequent features

A crucial point of the dataset preparation is building a frequency table. In stylometry, analyses are typically restricted to a feature space containing the \(n\) most frequent items. It is relatively easy to extract e.g. the 3,000 most frequent features from the corpus using the following function:

frequent.features <- make.frequency.list(sliced.corpus, head = 3000)

After the relevant features have been harvested, users have to extract a vector for each text or sample, containing the relative frequencies of these features, and combine them into a frequency table for the corpus. Using an appropriate function from stylo, these vectors are combined in a feature frequency table which can be fed into a statistical analysis (external tables of frequencies can be loaded as well):

freqs <- make.table.of.frequencies(sliced.corpus, features = frequent.features)

Feature selection and sampling settings might interact: an attractive unique feature of stylo is that it allows users to specify different ‘culling’ settings. Via culling, users can specify the percentage of samples in which a feature should be present in the corpus in order to be included in the analysis. Words that do not occur in at least the specified proportion of the samples in the corpus will be ignored. For an 80% culling rate, for instance:

culled.freqs <- perform.culling(freqs, culling.level = 80)

Analysis

Stylo offers a seamless wrapper for a variety of established statistical routines available from R’s core library or contributed by third-party developers; these include t-Distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008), Principal Components Analysis, Hierarchical Clustering and Bootstrap Consensus Trees (a method which will be discussed below). An experiment can be initiated with a pre-existing frequency table with the following command:

stylo(frequencies = culled.freqs, gui = FALSE)

When the input documents are loaded directly from text files, the default features are most frequent words (MFWs), i.e. 1-grams of frequent word forms turned into lowercase. Also, by default, a standard cluster analysis of the 100 most frequent features will be performed. To perform e.g. a Principal Components Analysis (with correlation matrix) of the 200 most frequent words, and visualize the samples position in the space defined by the first two principal components, users can issue the following commands:

stylo(corpus.dir = "directory_containing_the_files", mfw.min = 200, mfw.max = 200, 
       analysis.type = "PCR", sampling = "normal.sampling", sample.size = 10000, 
       gui = FALSE)

In Fig. 1, we give an example of how Principal Components Analysis (the first two dimensions) can be used to visualize texts in different ways, e.g. with and without feature loadings. Because researchers are often interested in inspecting the loadings of features in the first two components resulting from such an analysis, stylo provides a rich variety of flavours in PCA visualizations. For an experiment in the domain of authorship studies, for instance, researchers will typically find it useful to plot all texts/samples from the same author in the same color. The coloring of the items in plots can be easily controlled via the titles of the texts analyzed across the different R methods that are used for visualization – a commodity which is normally rather painful to implement across different packages in R. Apart from exploratory, unsupervised analyses, stylo offers a number of classification routines that will be discussed below.

The examples shown in Fig. 1 were produced using the following functions:

stylo(frequencies = culled.freqs, analysis.type = "PCR", 
        custom.graph.title = "Austen vs. the Bronte sisters",
        pca.visual.flavour = "technical", 
        write.png.file = TRUE, gui = FALSE)
        
stylo(frequencies = culled.freqs, analysis.type = "PCR", 
        custom.graph.title = "Austen vs. the Bronte sisters",
        write.png.file = TRUE, gui = FALSE)
        
stylo(frequencies = culled.freqs, analysis.type = "PCR", 
        custom.graph.title = "Austen vs. the Bronte sisters",
        pca.visual.flavour = "symbols", colors.on.graphs = "black",
        write.png.file = TRUE, gui = FALSE)
        
stylo(frequencies = culled.freqs, analysis.type = "PCR", 
        custom.graph.title = "Austen vs. the Bronte sisters",
        pca.visual.flavour = "loadings", 
        write.png.file = TRUE, gui = FALSE)

graphic without alt textgraphic without alt text