A customized generative AI tool integrated into Expasy, the Swiss bioinformatics resource portal, allows researchers to retrieve and compile information from SIB databases more quickly and easily than ever. ExpasyGPT provides fast, accurate responses, enables new insights through complex database querying, and makes biological and bioinformatics discovery more broadly accessible. Its powerful capabilities come from SIB’s expertise in Large Language Models (LLMs) and knowledge representation.

Bioinformatics resources developed by SIB Groups are used by researchers and clinicians around the world to study life and tackle global challenges, from diagnosing diseases and developing effective drugs to breeding new crops and protecting biodiversity. These resources are also a valuable source of openly available and reliable biological knowledge for teachers, students and the broader public.

See how SIB fosters world-class biodata resources

Enabling complex queries across Swiss life science databases

Expasy is a powerful discovery portal for the 160+ high-quality, open databases and software tools developed by SIB Groups. Covering a large variety of biomolecules and biological processes, these comprehensive resources provide up-to-date life science knowledge and enable researchers to share, analyse and interpret biodata.

Bioinformatics resources developed by SIB Groups are used by researchers and clinicians around the world to study life and tackle global challenges, from diagnosing diseases and developing effective drugs to breeding new crops and protecting biodiversity. These resources are also a valuable source of openly available and reliable biological knowledge for teachers, students and the broader public.

See how SIB fosters world-class biodata resources

Expasy’s keyword search already allows users to conveniently retrieve and view relevant information across all SIB resources – such as available data for a particular organism or gene, or tools for a specific use case – without needing any knowledge of the resource ecosystem. 

ExpasyGPT, released in beta version, is the next step in accelerating biological and bioinformatics discovery on Expasy. Through natural language questions, users can now:

  • extract insights from SIB databases that are not possible via keyword searches;
  • generate and run complex cross-database queries within seconds;
  • compile large amounts of data from multiple databases without any manual processes;
  • obtain more accurate information on SIB resources than via general LLM tools like ChatGPT.

These functionalities – combined with a user-friendly chat interface – provide a significant time saving. They also make data and resource discovery even easier for users with limited experience in bioinformatics or a specific biological field. 

ExpasyGPT was built by diverse teams across the SIB network with expertise in knowledge representation and generative AI. The tool is the latest example of our work to push the boundaries of data science, accelerate life science innovation, and ensure that biological knowledge is widely accessible for the benefit of science and society.

Try ExpasyGPT

The set of six SIB databases is:

All six are SIB Resources, and four are recognized as being of fundamental importance to the international life-science community: three as an ELIXIR Core Data Resource and four as a Global Core Biodata Resource.

Interconnected databases through knowledge representation

SIB scientists had already implemented standard vocabularies, or ‘ontologies’, to describe SIB resources through keywords. Our Semantic Web focus group extended this by building ontologies to describe relationships between different types of biological data. The focus group then structured this information, called metadata, and a set of SIB databases (see box) as knowledge graphs. 

The set of six SIB databases is:

All six are SIB Resources, and four are recognized as being of fundamental importance to the international life-science community: three as an ELIXIR Core Data Resource and four as a Global Core Biodata Resource.

These graphs form an integrated, machine-readable network of linked data that seamlessly interconnects information stored in different databases – such as the relationship between a specific protein, cell type, and disease. This knowledge representation allows more complex information to be retrieved from and across databases than is possible by keyword searches of text, and forms one pillar of ExpasyGPT.

See more on knowledge representation

Read article on the SIB Semantic Web of data

More than a chatbot through customized generative AI

The tool is also founded on Large Language Models, a recent type of generative AI that can quickly answer natural language questions. The Knowledge Representation Unit of SIB’s Vital-IT Computational Biology groupwith the support of our  Biodata Resources team, connected an LLM to the knowledge graphs  described above plus 1,000 sample database queries. This gives the tool a high level of specialized knowledge, meaning it provides more accurate responses than ChatGPT and other generally trained LLMs. In addition, ExpasyGPT’s integration with the knowledge graphs means it does not need constant retraining to stay up to date with the large amounts of new data and information that are continually being added to the SIB databases.

ExpasyGPT’s responses take two forms: direct answers to general questions about SIB resources, and the code for so-called SPARQL queries to retrieve and compile complex information from multiple databases. The latter is an especially powerful function. SPARQL queries extract deeper insights than searching databases for keywords and avoid manual compilation of data from different sources, but can be challenging and time-consuming to write, even for experts.

Read article on LLM-based SPARQL query generation

Read article on the sample database queries

Explore generative AI projects at SIB

Transparent responses and a user-friendly interface

SIB’s Information Technology team took a user-centred design approach to develop ExpasyGPT’s third pillar, an intuitive chat interface integrated into Expasy. Users receive an explanation of how SPARQL queries were generated and can easily view the underlying sources and reasoning behind each response. The tool will also ask for further information if needed. Furthermore, users can automatically run generated SPARQL queries directly from the chat, as well as edit these further if desired. 

Read article on SPARQL query editor interfaces

See more on ExpasyGPT’s development

Building on the beta version

This first release of ExpasyGPT performs best for the six databases listed above. Further SIB databases that can be queried using SPARQL will soon also be integrated, such as the Swiss Pathogen Surveillance Platform. Longer term work includes integrating all databases and software tools catalogued on Expasy, improving results for general biological and bioinformatics queries, and optimizing the user experience. Furthermore, the system is fully open source and can be reused with any knowledge graph of interest