A fast, accurate ‘sequence search engine’

The MetaGraph tool can search through millions of published DNA, RNA and protein records in a matter of seconds. Developed by SIB scientists at ETH Zurich, the tool overcomes current limitations in analysing vast volumes of biological sequencing data – which will significantly accelerate life-science research and biomedical innovation. This important milestone in computational genomics was published in Nature.

Full-text search instead of downloading entire data sets

Over 100 million gigabytes (100 petabytes) of DNA, RNA and protein sequences are stored in public databases around the world – about as much as all the text on the internet. This vast collection of data is a treasure trove for research into disease treatments, ecology, new biotechnologies, and more. However, accessing and analysing data at this scale poses a major challenge. Current methods are often slow, require massive computing power and other resources, and lack scalability for high-throughput searches.

MetaGraph overcomes these limitations. Developed by the SIB Biomedical Informatics group at ETH Zurich, the tool works in the same way as a regular internet search engine: researchers enter the text of a sequence and, within seconds or minutes, get a list of any matching sequences in public sequence databases.

MetaGraph is a kind of Google for biological sequences. What was deemed to be very challenging a few years ago, can now easily be done on a modern laptop.

Gunnar Rätsch

Group Leader, SIB Biomedical Informatics group, ETH Zurich

A catalyst for biomedical advances

The Nature article published this month demonstrates that MetaGraph is not only fast, but also accurate and efficient. To demonstrate its practical feasibility, the authors used the tool to index an incredible half of all sequence data sets available worldwide, across the tree of life – comprising 18 million unique genome and transcriptome samples and 210 billion amino acid residues from the UniProt archive (UniParc). According to Gunnar Rätsch, the remaining half should follow by the end of the year.

The article also provides practical use cases to illustrate how such petabase-scale search can catalyse biomedical advances, such as tackling antimicrobial resistance. Given that MetaGraph is available as open source, it could also be of interest to pharmaceutical companies that have large amounts of internal research data.

A novel solution for peta-scale sequence analyses

MetaGraph works by indexing the data and presenting them in compressed form. This is achieved using complex mathematical graphs that improve the structure of the data – similar to spreadsheet programs such as Excel.

While the use of indexes to render large amounts of data searchable is standard practice in computer science research, the researchers added two new aspects: complex linking of raw data and metadata, and data compression by a factor of around 300. Similar to a book summary, the compressed data no longer contain every word, but all the main storylines and connections remain intact.

Thanks to these innovations, MetaGraph is comparatively cost-effective: the representation of all public biological sequences would fit on a few computer hard drives, and large queries could be as cheap as 74 cents per megabase. The methodology also allows scalability, a key advantage over other DNA search tools currently being researched. Notably, MetaGraph can easily adapt to current rapid advances in representing biological sequences, ensuring its long-term utility.