Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R

In recent years, the cost of DNA sequencing has decreased at a rate that has outpaced improvements in memory capacity. It is now common to collect or have access to many gigabytes of biological sequences. This has created an urgent need for approaches that analyze sequences in subsets without requiring all of the sequences to be loaded into memory at one time. It has also opened opportunities to improve the organization and accessibility of information acquired in sequencing projects. The DECIPHER package offers solutions to these problems by assisting in the curation of large sets of biological sequences stored in compressed format inside a database. This approach has many practical advantages over standard bioinformatics workflows, and enables large analyses that would otherwise be prohibitively time consuming.

Erik S. Wright (University of Wisconsin - Madison)
2016-05-01

1 Introduction

With the advent of next-generation sequencing technologies, the cost of sequencing DNA has plummeted, facilitating a deluge of biological sequences (Hayden 2014). Since the cost of computer storage space has not kept pace with biologists’ ability to generate data, multiple compression methods have been developed for compactly storing nucleotide sequences (Deorowicz and Grabowski 2013). These methods are critical for efficiently transmitting and preserving the outputs of next-generation sequencing machines. However, far less emphasis has been placed on the organization and usability of the massive amounts of biological sequences that are now routine in bioinformatics work. It is still commonplace to decompress large files and load them entirely into memory before analyses are performed. This traditional approach is quickly becoming infeasible as sequence sets swell in size, and alternative methods for storing, organizing, and analyzing sequences are needed.

A typical bioinformatics workflow begins with a set of biological sequences in one or more text files. These sequences are used as input to subsequent analysis steps, each of which generates text files as output (Schloss et al. 2011). Large workflows constructed in this manner can generate a plethora of text files, often resulting in unnecessary redundancy and disorganization. Fortunately, databases offer an organized means for storing related data, and underlie many commonly used bioinformatics software such as BLAST (Altschul et al. 1997). Nevertheless, biologists rarely use databases to curate their own sequences, in large part due to the difficulties associated with creating and accessing a database. Here I describe flexible user-friendly workflows for employing databases to efficiently analyze large sets of sequences via the R programming language.

Although R has traditionally been viewed as a statistical software, many add-on packages are available for analyzing biological sequence data. One representative, the Biostrings package (Pagès et al.), offers a suite of functions for reading, writing, searching, and manipulating DNA, RNA, or amino acid sequences. Sequences are stored in memory according to their corresponding XStringSet class, where “X” is specific to the type of sequences (e.g., “AA” for amino acid sequences). For example, a "DNAStringSet" can store the standard DNA bases (“A”, “C”, “G”, or “T”), as well as ambiguity codes (e.g., “N” for any bases) and gap characters used in alignment (“-”).

The DECIPHER package also makes use of XStringSet classes. However, unlike other R packages for biological sequence analysis, DECIPHER employs databases so that an entire sequence set does not need to simultaneously reside in memory. This enables DECIPHER to extend many analyses to millions of sequences without requiring extreme amounts of memory, and offers a means of handling the even more massive biological datasets of the future. The DECIPHER package also includes many advanced functions for oligonucleotide design (Wright et al. 2014b,a; Wright and Vetsigian 2016), sequence alignment (Wright 2015), and other common bioinformatics tasks. Despite its many applications, the core database functionality underpinning DECIPHER has not been previously described.

New users of DECIPHER are often unaccustomed to the use of a database to manage their own sequences. The purpose of this text is to describe the merits of this approach, and outline how sequence databases are configured by DECIPHER and can be used to improve analysis workflows. Sequences are stored independently within the database in a compressed format, which enables the database to be compact while maintaining fast random access to different sequences. The custom compression algorithm implemented in DECIPHER is compared to standard compression algorithms accessible within R. Finally, example uses of a sequence database are provided to demonstrate the power of this alternative workflow.

2 Merits of databases for storing biological sequences

A database, much like a spreadsheet, is an ideal way to maintain interconnected data. The concept of storing biological sequences in a relational database is not new (Xie et al. 2000), and underpins many popular bioinformatics programs and online web tools. However, end-users of these tools rarely directly employ databases for their own sequences, despite many practical advantages such as:

  1. Organizational improvements:
    Databases can be arranged to minimize redundancy by maintaining an association between different columns. For example, the length of each sequence in the database can be easily added as a separate column.

  2. Random access:
    Databases permit quick access to subsets of the data, without the need to seek through large text files in order to find the desired subset. For example, it is very fast to obtain the longest sequence in the database once a column with the length of each sequence has been added.

  3. Concurrent users:
    A database can be queried concurrently by multiple users without interference. For example, one user can obtain the shortest sequence, while another simultaneously requests the longest sequence.

  4. Reliable storage:
    Commands can be used that allow the database to revert changes in case of a mistake. DECIPHER workflows are designed to be non-destructive, so that the original sequence information is always preserved.

Several standard add-on packages are available to interface between R and popular database management systems (DBMSs) including MySQL, PostgreSQL, and SQLite (Ripley 2001b). DECIPHER uses SQLite for a number of reasons. First, SQLite databases are flat-files that can easily be transferred between computers, or even emailed like a standard text file. Second, SQLite requires minimal setup on the part of the end-user, unlike some DBMSs. Third, support for the BLOB data type from R is currently only available from the RSQLite package. The BLOB type is used to store compressed sequences, which are raw (binary) type within R. Lastly, SQLite databases can contain an exceedingly large number of rows, up to 264, meaning that they are typically limited in size by the available disk space rather than the DBMS.

The use of SQLite as the sole DBMS results in a few limitations. First, users cannot be given separate access privileges, as all users are considered database administrators. Second, concurrent writes to the same database are not permitted, although concurrent reads are not a problem. These drawbacks cause few practical limitations for a typical group consisting of a few database users performing standard operations with DECIPHER. Furthermore, databases can be organized such that each sequencing project resides in its own table, which minimizes conflicts between users.

3 Anatomy of a DECIPHER database

DECIPHER uses a simple relational database schema involving two tables (Fig. 1): one that is highly “visible” to the user (named “Seqs” by default) containing information about the sequences, and a second “hidden” table (named “_Seqs”) for storing compressed sequences and, if applicable, their corresponding quality scores. The tables are connected by a shared primary key, named “row_names”, that enables fast lookup between the two tables. This split table design substantially increases access speed over using a single table, as the table being queried does not include any sequences, which are often large in size and can slow access to other data. Use of separate tables within the same database is provided by the argument tblName in each function.

graphic without alt text
Figure 1: Database schema used by DECIPHER. Two tables are created when importing sequences with Seqs2DB: one that is highly visible to the user (named “Seqs” by default) and a second (named “_Seqs”) that is largely hidden. The tables are connected by a shared primary key (“row_names”) that enables fast lookup between the two tables. Columns containing the sequence descriptors, compressed sequences, and compressed quality scores (if applicable) are automatically generated. The user must specify an “identifier” during import, which can be changed later through a variety of methods. Additional columns of data may be added to the database using the Add2DB function.

Sequences are imported with the Seqs2DB function which supports three popular file formats as well as in-memory XStringSet objects. The destination table is automatically populated with appropriate text columns containing information stored in the file. For example, sequences imported from FASTA files store each sequence’s record name in a column named “description”. FASTQ files are imported similarly, but also store the quality information corresponding to each sequence in gzip compressed format. For GenBank files, the “description” column is obtained from the DEFINITION field and other fields can be imported as desired using the fields argument of Seqs2DB. By default, the ACCESSION and ORGANISM fields are imported as additional columns named “accession” and “rank” in the database.

All DECIPHER databases require an “identifier” to be specified during import. The identifier column is used extensively by DECIPHER functions, and is the recommended way to delineate groups of sequences. Common ways to identify sequences include: the file they were imported from (e.g., “file1”, “file2”, etc.), the cluster they belong to in a phylogeny (e.g., “cluster1”, “cluster2”, etc.), or the name of the organism from which the sequence originated (e.g., “E. coli”, “Yeast”, etc.). It is also possible to simply provide an empty character string (i.e., ““) as the identifier during import, and then set the identifier later. DECIPHER includes several functions to help with identifying the sequences after they are imported. For example, the IdClusters function can assign phylogenetic cluster numbers. The IdentifyByRank function can parse the”rank” column from an imported GenBank file to identify sequences according to a given taxonomic rank (e.g., “species”).

DECIPHER uses R’s connection interface (Ripley 2001a) to read files incrementally during import. Files compressed with gzip, bzip2, xz, or lzma compression are automatically detected and read appropriately without user intervention. The user can also provide a URL instead of a file path, which enables sequences to be imported from “http” or “ftp” sources without first needing to save the file locally. In the case of URLs, only uncompressed text files or files with gzip compression are supported. Notwithstanding this limitation, reading files directly from online sources, such as NCBI repositories, is often preferable to downloading files locally before importing them into a database as it prevents unnecessary redundancy. An example of importing a compressed GenBank file from online is shown below:

> library(DECIPHER)
>
> # specify the input file and database location
> gbk_file <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/RNA/rna.gbk.gz"
> db_file <- "~/Desktop/SeqsDB.sqlite"
> dbConn <- dbConnect(SQLite(), db_file)
> 
> # import the sequences from online
> Seqs2DB(seqs = gbk_file,
+         type = "GenBank",
+         dbFile = dbConn,
+         identifier = "Human_mRNA")
162916 total sequences in table Seqs.
Time difference of 176.19 secs

The imported file contains known and predicted human RNA sequences. At this point it is useful to reset the values in the identifier column so that they can be referenced in downstream analyses. In this particular example, the prefix of the accession number can be parsed into one of four values, indicating whether the sequence is a predicted or hand-curated RNA or messenger RNA (mRNA) sequence. This information is updated in the database using the Add2DB function, which, like many other DECIPHER functions, displays the associated SQL commands when the input argument verbose is TRUE. Add2DB will add or update table columns in accordance with the column names and row names in a "data.frame" input. For example, in this case the values of identifier in the input "data.frame" are added to the rows with corresponding “row_names” in the database. To view the results of this modification, the database table can be displayed in a web browser with the BrowseDB function (Fig. 2).

> # reset the 'identifier' column based on the prefix of the accession number
> x <- dbGetQuery(dbConn, "select accession from Seqs")
> id <- substring(x$accession, 1, 2)
> id <- c(`NM` = "curated mRNA",
+         `NR` = "curated RNA",
+         `XM` = "predicted mRNA",
+         `XR` = "predicted RNA")[id]
> Add2DB(data.frame(identifier = id), dbConn)
Expression:
update or replace Seqs set identifier = :identifier where row_names =
:row_names

Added to table Seqs:  "identifier".
Time difference of 2.06 secs

> BrowseDB(dbConn, limit = 1000)
graphic without alt text
Figure 2: Screenshot of viewing a database table in a web browser using the BrowseDB function. Metadata information such as accession numbers and taxonomy are automatically imported from GenBank sequence files into separate table columns. The value in the “identifier” column controls sequence groupings in many DECIPHER functions, and can be set during or after import by a variety of methods. For easier viewing, the text in each field is truncated at a maximum of 50 characters by default.

4 The nbit compression format for nucleotides

Many compression algorithms have been proposed for storing nucleotide sequences, some of which are specific to the FASTQ file format (Deorowicz and Grabowski 2013). The most common compression method is gzip, owing to its reasonable compression ratio and high decompression rate. However, since gzip is a generalized method, it may be possible to obtain better compression ratios by using algorithms specific to DNA sequences. In particular, reference-based compression is generally preferable when a reference sequence is available (Jones et al. 2012). In re-sequencing projects a reference genome is always available, whereas it is often unavailable for new sequencing projects. In either case, compression of a file containing many similar sequences allows redundancy to be exploited to a greater extent than independently compressing sequences.

Several considerations were taken into account when designing a compression method for DECIPHER. First, the method must work well with a wide variety in the number of sequences and their lengths. Second, in order to allow random access and deposition, each sequence is stored independently in the database, ruling out exploiting redundancy between sequences. Third, compression and decompression rates were prioritized over achieving the maximal possible compression ratio. Fourth, support for all possible characters in the DNA and RNA alphabets was desired, in particular the gap character that is largely neglected by most DNA-specific compression formats. These considerations led to the development of the nbit compression method for DNA or RNA sequences. Compression with nbit uses a combination of 2-bit encoding for gapless bases (i.e., A=00, C=01, G=10, T=11), and 3-bit encoding for gappy regions. Although this encoding was inspired by prior work (Wandelt et al. 2014), DECIPHER uses a unique implementation that is customized to the package’s goals.

In principle, the maximum achievable compression rate is 2-bits per base with this encoding, which is the theoretical limit (i.e., Shannon entropy) for four randomly drawn characters in equal proportions. However, DNA sequences contain appreciable information and are non-random in their construction, which permits additional compression to be achieved. To this end, the nbit algorithm incorporates runs of a single base or ambiguity codes (e.g., “NNN”), which are frequent in some sequences. Furthermore, long DNA sequences often contain exact repeats or reverse complement repeats, which can be stored compactly by referencing their prior occurrence. Finally, the nbit format includes a variable-sized cyclic redundancy check to ensure data integrity.

graphic without alt text
Figure 3: Comparison between different lossless compression algorithms on random subsequences of Human Chromosome II (HG18). DECIPHER’s custom nbit compression exhibited substantially faster compression rates than the other methods. The nbit algorithm offered a better compression ratio than any of the other methods for sequences less than about 100,000 nucleotides, beyond which xz compression provided more compaction, albeit at a substantially slower compression rate. The average of 100 replicates is shown for all methods, gzip (v1.2.8), bzip2 (v1.0.6), and xz (v5.0.7), computed using a single processor.

The nbit compression and decompression algorithms were implemented in the DECIPHER function Codec, which interconverts between character (uncompressed) and binary (compressed) data. In comparison to the three generalized compression methods available in R, the nbit algorithm exhibits a substantially faster compression rate, as shown in Figure 3. Compression sizes are generally better than those of the other methods, except for sequences longer than about 100,000 nucleotides where xz compression results in more compaction at the expense of substantially longer compression times. The user may also specify to use more than one processor with nbit, in which case the Codec function will compress/decompress sequences in parallel. Furthermore, the Codec function will automatically fall back to gzip compression when the sequences are incompressible with nbit, as in the case of amino acid sequences.

5 Example workflow with DECIPHER

One of the major advantages of using a sequence database is the ability to quickly access subsets of the sequences matching certain criteria. The SearchDB function can be used to easily build flexible queries and obtain the sequences meeting those specifications. SearchDB supports several common SQL clauses, including ‘LIMIT’, ‘OFFSET’, ‘ORDER BY’, and ‘WHERE’. If unspecified, SearchDB can automatically detect the type of sequences (DNA, RNA, or AA) returned from the search. In addition, it can be used to quickly count the number of sequences matching a query, name the sequences based on the value in a specific table column, remove gaps (“-”) from sequences, or replace characters not present in the specified sequence alphabet. As an example, the command below will find all of the “curated mRNA” sequences with “transcription factor” in their “description” and name the sequences by their accession number. The sequences can then be viewed in a web browser using the BrowseSeqs function, as shown in Figure 4.