Finding sunken treasure in the genome sequencing data flood.


In the era of fast, inexpensive, high-throughput sequencing, we are flooded with vast quantities of data being generated each day. In the last few years we have seen the contents of GenBank swell and the availability of bench top genome sequencing machines will just serve to raise the floodwaters even higher. It is quite natural to want to get above the deluge and assess the situation from higher ground. In genomics, this sort of approach has certainly yielded some important insights made possible through bioinformatics and computational analyses. Yet, there are times when diving into the data, facing the floodwaters head on, can yield sunken treasures.


Diving into the data can yield sunken treasures

Sequence duplications and chromosome rearrangements
Before the introduction of disruptive next-generation sequencing (NGS) in 2006, published bacterial genome sequences were complete, fully assembled, and annotated by a person. During this process, discoveries were made. TIGR were able to identify a 32 kb tandem duplication in Neisseria meningitidis strain MC58, which was only evident when looking at sequence coverage of the assembled genome. During assembly, identical regions such as this will be assembled into one contig with greater coverage of the rest of the genome, as will the multiple, near identical copies of the rRNA loci. The Sanger Centre were able to propose a never before described inversion-mediated phase variable capsule expression system in Bacteroides fragilis, identified when attempting to complete the final assembly of the genome. Today, with the vast majority of genome sequence data in an incomplete, permanent draft state, discoveries such as these may not be made. Currently, NGS  is dominated by short-read technologies, which are best suited for human-centric sequencing applications. As a result, bacterial genome sequencing projects too frequently use technologies that are available or are popular for eukaryotic applications, rather than those that are best suited to the experimental design for their organism and sequencing across the repeat profiles within its prokaryotic genome. In addition, other insights into the bacteria are waiting to be discovered – sunken treasures in the genomic data flood.

Comparative analyses of published complete genome sequences have provided insight into the features contributing to chromosomal rearrangement in N. gonorrhoeae. From this, a model could be proposed where ISNgo2-mediated excision and reintegration contributed to the chromosomal structure of N. gonorrhoeae strain NCCP11945, while homologous recombination between repetitive sequences are proposed to have caused rearrangements in strain FA1090. Investigations into chromosomal structural changes and rearrangements can only be investigated with complete, correctly assembled genome sequences. The N. gonorrhoeae strain FA1090 assembly was aided by a physical map. Before we conducted our detailed comparative analysis of the N. gonorrhoeae strain NCCP11945 genome sequence against stain FA1090, we did pulsed-field gel electrophoresis to verify the assembly. Yet, the vast majority of bacteria genomes in the public databases are incomplete and unassembled.

Current status of sequencing projects
As of 18 March 2015 there are 39,257 bacterial genome sequencing projects in the Genomes OnLine Database (GOLD). Of these, 22,017 are permanent drafts (56.1%) and 13,034 are incomplete (33.2%). Even amongst the complete and published (3057) and complete (608) genome sequencing projects, only a few of these will be closed chromosomes and not in contigs. Using hard work in the laboratory, a bit of clever bioinformatics, and some time and patience, some of these bacterial genomes could be closed. However, the investigators involved with these projects have likely already obtained the information they required from these projects; closing the genome requires additional funding and staff time in the lab.

Sequencing technology
One of the contributing factors to the abundance of incomplete genome sequencing projects is the technology itself. A vast amount of sequencing can be obtained from the Illumina sequencing machines, but only very short reads, particularly on the most popular models in use for most eukaryotic applications, the HighSeq. In bacterial genome sequences, which can be riddled with repetitive sequences that confound assembly, this can be counterproductive to understanding the bacterial genome, or transcriptome for that matter. Other technologies have sought to overcome this issue. The Ion Torrent PGM system is particularly well suited to the prokaryotic genome and not only has 400 bp read lengths, but has also launched a TrueMate kit system that is specifically designed for closing bacterial chromosomes. In the final stages of assembly of a bacterial genome sequencing project, the contigs are often divided by the ribosomal RNA loci, which are present multiple times on the chromosome and are approximately 6 kb. The Pacific Biosystems sequencing system has the capability to sequence through a region of this size, and therefore enable an assembly of a bacterial genome, regardless of homologous regions of the genome, even those as large as the rRNA loci. However, issues with the cost per genome and the initial outlay cost for the machine have not made this a technology that is widely used. Likewise, the read-length potential of Oxford Nanopore technology could overcome issues for assembly, but this technology is not yet available to everyone and there are concerns about its issues with accuracy that need to be resolved.

Trawling the sea of data
Even though the vast majority of bacterial genomic data is fragmented and destined not to be assembled into complete, circular genomes (or linear chromosomes, as appropriate), this genomic flood of data is still full of sunken treasure. To exploit this vast resource of data, we need only form hypotheses and look for evidence. We may find that our hypotheses are incorrect or need to be revised once new perspectives have been gained, but the availability of genome sequence data gives us the freedom to make these discoveries. It is time to dive in and explore, not only the complete and well annotated sequence data, but also the incomplete, draft sequence data, which may contain N’s and may not be annotated. It is important to be cautious in our use of annotations; annotators are only human and can make mistakes, or as is more frequently the case, annotators are computers making best hit matches and not investigating the biology of the organism or the published literature. Also, keep in mind that for some hypotheses, the evidence may only be in the raw sequence data that comes straight off the sequencing machines themselves, not in the assembled contigs and scaffolds that are submitted to the public databases. Evidence of inversions of sequences or changes in simple sequence repeats in phase variation will be lost during the creation of contigs. Most importantly, don’t be intimidated by the volume of data and remember that we are fortunate as bacteriologists that bacterial genomes are relatively small; it could be worse!

Even without complete genomes, there are important insights that can be gained, which is why we continue to sequence genomes and leave them in an incomplete state. A great deal can be learned about the evolutionary relationship between strains by extracting out from a fragmented genome the scattered pieces of information such as the rRNA loci sequences, Ribosomal Multilocus Sequence Typing (Rmlst) sequences, MLST sequences, antibiotic resistance marker sequences, and comparing these between isolates. Indeed, genome-wide single nucleotide polymorphism analyses, even on incomplete genome sequences, can help identify if isolates are related and, therefore, if disease cases are due to clones and thus a potential outbreak or are from distinct bacterial isolates. Sequencing is also enabling us to understand the rate in which bacteria accumulate mutations, through genome sequencing of isolates known to have been transmitted between patients and isolates from the same patient over time.

There is already a great wealth of sequence data available and undoubtedly more to come, which may, as the technologies develop and improve one day generate a completely assembled sequence of the bacterial chromosome.

Topically, the Society for Applied Microbiology Spring Meeting 2016 on whole genome sequencing (19 April 2016) will cover presentations on the latest technologies in WGS that have allowed for many new opportunities in health and disease research. The meeting will give all delegates a better understanding of the laboratory process that is revolutionizing microbiology.

Lori A. S. Snyder
Kingston University

Categories: Feature Articles

Tags: , , , , , , , ,

Leave a Reply

%d bloggers like this: