What do we do?Our lab builds the bridge between Big Data Analysis and Biomedical Research. We develop novel Data Mining Algorithms to detect patterns and statistical dependencies in large datasets from the fields of biology and medicine. Our major goals are twofold: 1) to enable the automatic generation of new knowledge from Big Data through Machine Learning, and 2) to gain an understanding of the relationship between Biological Systems and their molecular properties. Such an understanding is of fundamental importance for Personalized Medicine, which tailors medical treatment to the molecular properties of a person.
Recent research exploiting the tremendous progress in sequencing technologies has generated huge data sets of genetic information that enable large-scale analyses, such as genome-wide association studies (GWAS) to explore genotype–phenotype relationships. An effort in this direction, to which the MLCB lab contributed, is the sequencing of the genomes of 1,135 naturally inbred lines of the model plant Arabidopsis thaliana, and the subsequent establishment of a high quality reference genome panel (The 1001 Genomes Consortium, 2016). Our lab was also at the forefront of establishing AraPheno, a public database that allows to easily submit, download and visualize phenotypic data for Arabidopsis thaliana (Seren et al., 2016). Our current work aims at bringing both genetic and phenotypic data together in one advanced online platform for performing genome-wide association studies: easyGWAS.
In biological and healthcare data, researchers are facing extremely high-dimensional representations of samples, from patients to bacteria. When linking these high-dimensional representations to phenotypes, multiple testing correction is of the utmost importance for practitioners. Due to the large number of dimensions, however, multiple testing correcting is computationally challenging and prone to losing all detection power. We present a first approach for finding significant feature combinations, that properly corrects for multiple testing and at the same time allows to account for categorical covariates such as age or gender of individuals (Papaxanthos et al., 2016).
Main publications 2016
- Papaxanthos L*, Llinares-López F*, Bodenham D, Borgwardt K. Finding significant combinations of features in the presence of categorical covariates. Accepted at NIPS 2016, in press. (*=equal contributions)
- Seren Ü*, Grimm D*, et al., Borgwardt K, Korte A. AraPheno: a public database for Arabidopsis thaliana phenotypes. Nucleic Acids Research, doi: 10.1093/nar/gkw986, 2016. (*=equal contributions)
- The 1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 2016;166(2):481-491.