Salma Aouled El Haj
TITRE DE LA THÈSE : Error Aware Event Mining in Biological Sequences
RESPONSABLES DE LA THÈSE : Julie THOMPSON (Uni Strasbourg), Mourad Elloumi (Uni Tunis)
PROJECT : The goal of this project is to develop novel sequence analysis approaches to address the specific issues related to the new NGS data. We will focus on the multiple alignment of protein sequences, and more particularly, the detection and alignment of the locally conserved features that modulate a protein’s function, for example, interaction sites, cell compartment targeting signals, post-translational modifications or cleavage sites. These features, that are often specific to a subset of sequences in a protein family, are generally ignored by most current methods (Thompson et al, 2011). To address these issues, we propose to develop two complementary machine learning approaches: error-aware data mining and mining rare events. On the one hand, probabilistic classifiers such as the naïve Bayesian classifier provide a convenient framework to take the uncertainty of input data into account (Wu & Zhu, 2008). On the other hand, rare events gained interest in the recent years (Koh et al., 2009). Mining rare events in the presence of noisy data is an open problem. We will develop a novel approach, involving the detailed characterization of protein sequences at different scales. The data are large-scale, heterogeneous and very noisy, which makes their analysis more difficult. The project will clearly necessitate the joint application of techniques from different domains of expertise: on the one hand, protein sequence analysis and MSA construction, benchmarking and exploitation in high-throughput biological projects (JD Thompson), and on the other hand, development of novel data mining and knowledge discovery adapted to the specific needs of biological data (M Elloumi).
- Characterization of a set of sequences, in terms of structured domains, transmembrane regions, coiled coils, disordered regions, etc. A large number of methods have been developed to predict these regions and we will establish a state-of-the-art and identify the most suitable methods for our purposes.
- Identification of known structural/functional domains, motifs etc. A large number of existing biological databases are freely available on the Internet and these regions are generally well aligned by existing MSA algorithms.
- Identification of badly predicted sequences. The quality of predicted protein sequences is a major problem, with >50% proteins containing mispredicted segments. If these are not identified, identification of conserved motifs is impossible.
- Identification of motifs conserved in the complete set of sequences or in a subset of the sequences. This is an essential task that has not been widely addressed and will require the development and evaluation of novel data mining approaches that are error-aware and are capable of discovering rare events.
- Establishment of a benchmark for validation of the methods developed. Existing databases and scientific literature will be mined for reliable examples of complex sequences that will be used to evaluate the accuracy of our results.
Our long term goal is to incorporate this information into a new MSA method that will divide a large set of complex, heterogeneous sequences into a number of smaller, more homogeneous subsets/segments. Each characterized subset could then be aligned efficiently by the most appropriate MSA algorithm and a final solution for the complete set of sequences reconstructed from the individual parts.