Integrating comprehensive species data into biodiversity assessments

Important biological information, including so-called ‘cryptic’ diversity and within-species variation that can affect resilience, will be unlocked and integrated into a modelling framework for assessing species richness and distributions. Led by SIB, the SNSF-funded collaboration is developing and evaluating modelling approaches that combine traditional species occurrence data with three additional data types: genomics, traits mined from the literature, and taxonomic changes over time. The aim is to provide a broader evidence base for biodiversity research – and enable more tailored guidance for environmental protection and restoration.

Improving biodiversity models with previously siloed data

Biodiversity assessments inform conservation policies and actions by providing vital information on where species live, how ranges are changing, and which populations are most at risk. Such assessments use models to extrapolate data from field studies to larger habitats and ecosystems. These models typically combine species occurrence with environmental factors – which means they miss as-yet undescribed species as well as key biological characteristics that can shape long-term species resilience, such as genetic diversity and differences between populations. They are also affected by uncertainties arising from, for example, changes to species definitions as new knowledge is generated and ‘cryptic’ species that are difficult to distinguish morphologically.

Genomics data, published species knowledge and historical taxonomic records can fill the gaps, but are currently fragmented, not interoperable, or difficult to access. The new project will overcome this challenge. Two SIB Groups and partner Plazi will unlock data from these sources for selected species in Switzerland and Europe – then integrate these data with occurrence and environmental data in a shared modelling framework. The outputs are intended to help policymakers and conservation practitioners target protection efforts more precisely, and to provide researchers with a richer evidence base for further studies.

The work draws on SIB expertise in biodiversity data, AI-driven text mining and statistical modelling, and advances SIB's strategic goals of developing tools to address environmental challenges and supporting national efforts for environmental protection.

By integrating data and expertise that has traditionally been siloed, our new models will give a detailed picture of how species live, vary and respond to change – and so where to focus conservation efforts.

Robert Waterhouse

Director, SIB Environmental Bioinformatics group

Analysing genomes, mining the literature, and mapping taxonomic changes

The species groups chosen – birds, bats and fish in Switzerland and butterflies, bumblebees and freshwater amphipods in Europe – represent a spectrum of available knowledge, from vast, long-term records to sparser studies. To generate the new modelling inputs, the project will:

Characterize species’ genetic diversity, population structure and resilience indicators using genomics data from both established DNA sequence repositories and general-purpose repositories. Mobilizing the latter – which can be difficult to find and reuse – will first require identifying and cataloguing relevant datasets. These datasets will also be prepared for deposition in established sequence repositories where possible, to further improve their discoverability and use for modelling and research.
Liberate species trait data from the literature, such as life history, habitat preferences, and interactions with other species. The information will be extracted using AI-assisted text mining and natural language processing, from existing machine-readable scientific articles and taxonomic records (Biodiversity PMC, developed by an SIB Group; TreatmentBank, developed by Plazi). Field guides and monographs with appropriate access rights will also be digitized, processed into machine-readable formats, and analysed.
Quantify taxonomic uncertainty. When species are split, merged or reclassified, data from different decades may refer to scientific names and concepts that have since changed – or even to more than one species. Such changes will be mapped across the taxonomic record using Plazi's SynoSpecies tool. Where possible, the remaining ambiguity will be represented through taxonomic uncertainty indices.

Models incorporating the new data will be compared with baseline models that use species occurrence and environmental data only, with outputs reviewed by domain experts. This comparison will assess the extent to which each data type improves model performance.

Enabling conservation action and further research

The model outputs are expected to identify biodiversity hotspots, vulnerable populations, and projected responses to environmental change across the selected species groups. Datasets, trait annotations, workflows and code will be made openly available while respecting licensing, data rights and sensitivity constraints – supporting further biodiversity research and the extension of the modelling framework to new species.