Bacterial genomes are a treasure trove of information, be it for the development of novel antibiotics or the protection of crops against pathogens. A study, led by the group of SIB’s Christian Ahrens at Agroscope, shows that unravelling those genomes in their entirety can be more difficult than commonly believed, due to the presence of highly complex regions in the shape of very long DNA repeats. The study, published in Nucleic Acids Research, not only provides cues as to how to resolve these regions while speeding up the entire process, but also reveals that very long repeat regions – which are present in about 3% of the 9,600 prokaryotic genomes analyzed – may actually contain functionally important features.

Pseudomonas koreensis text
The bacterial strain Pseudomonas koreensis P19E3, whose genome was resolved as part of this study (credit: Mitja N.P. Remus-Emsermann)

What is de novo genome assemby
De novo genome assembly aims at solving the puzzle of joining the multitude of sequence fragments of varying length to produce a full-length genome from scratch, i.e. without the use of a previously established reference genome.

This method is often described as less biased compared with a ‘mapping’ approach, where a reference genome is used, and enables the discovery of novel or strain-specific gene sequences.


‘Dark matter’ genomes: also in prokaryotes
Bacterial genomes, due to their smaller size, are generally considered to be less complex than those of animals and plants for example, and therefore straightforward to sequence and assemble. However, a sizeable fraction of the bacterial de novo genome assemblies generated so far (see box) are more intricate than previously thought.
“We found that ‘dark matter’ genomes, those genomes that are particularly difficult to unravel due to the presence of very long DNA repeats, represented about 3% of the 9,600 genomes we analyzed. An additional 7% harboured hundreds of repeats up to 5 kilobase pairs [kb]; these, however can be solved when relying on sequencing technology that generates long reads of 10-15 kb, such as Pacific Biosciences,” says Ahrens.

Cutting-edge technology and bioinformatics to the rescue
“To resolve and assemble the most complex regions of the genome of Pseudomonas koreensis strain P19E3 de novo, we had to rely on very long sequencing reads of over 70 kb and beyond, such as those generated with Oxford Nanopore Technologies,” explains Ahrens. “Indeed, such ultra-long reads have the ability to resolve repeat sequence regions of the genome that cannot be bridged by shorter reads,” he continues. These long reads then had to be combined using bioinformatics algorithms that were specifically designed to resolve repeat regions. This allowed the team to avoid time-consuming and labour-intense experiments as well as manual curation steps otherwise needed to generate complete genome sequences.

Promising biological functions hidden in the dark
“Once resolved, we showed that these repeat sequences harbour genes that may confer a fitness advantage to Pseudomonas koreensis,” says Remus-Emsermann, co-author on the study from the University of Canterbury (New Zealand). For example, some of the repeat regions found in the particular strain of P. koreensis assembled by the team – a bacterium isolated from the leaves of marjoram plants – carried gene encoding for proteins involved in the degradation of aromatic compounds, such as those exuded by aromatic plants. Others encoded for enzymes called DNA helicases, which may be protecting the bacterium from the DNA-damaging UV light to which marjoram leaves are exposed. Together, these data suggest that at least part of the long repeat sequences in prokaryotic genomes are functionally relevant.  

A resource for the community
“With the advent of long read technologies such as Pacific Biosciences and Oxford Nanopore Technologies, we expect the number of complete prokaryotic genomes to rise substantially. This provides an optimal basis for detailed functional genomics and systems biology studies,” comments SIB’s Michael Schmid at Agroscope and first author on the study.
“The analysis of the overall number and maximum length of repeats in over 9,600 prokaryotes has been made available to the scientific community in our study. This provides researchers around the globe with crucial information to choose the most suitable sequencing and assembly strategy for their bacterial organisms of interest,” concludes Ahrens.



Schmid M et al. Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats. Nucleic Acids Research, 2018, 46:8953-8965. doi: 10.1093/nar/gky726

Read more

This paper was recommended on the F1000 channel in a feature by Steven Salzberg