Combining Machine Learning and External Knowledge for Analyzing Gene Expression Profiles
Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is used to synthesized functional products like proteins, which are catalyzing the metabolic processes in our cells. Analyzing gene expression profiles is of particular interest for researchers, as they provide insights on cell processes and gene functions and can thus improve disease diagnosis and treatment.
Nowadays, gene expression profiles from several thousand genes of several hundred tissue samples can be generated. These data sets require computational tools applying Machine Learning techniques for a meaningful analysis. On the other hand, there exist many publicly available databases containing curated biomedical information, e.g. on protein-disease interactions. Contact: Cindy Perscheid
Topic Area: Association Rule Mining on Gene Expression Data (Contact: Cindy Perscheid)
Association Rule Mining, or Itemset Mining, is applied on gene expression data to identify correlations between the expression levels of different genes. A derived rule would have the form of GeneA (up) —> GeneB (up), meaning that if GeneA is upregulated, then typically GeneB is upregulated as well. This information helps researchers to derive unknown gene functions and better understand regulatory processes in cells for different disease types. The amount of rules resulting from those analyses are typically filtered with standard interestingness measures, e.g. support and confidence. These measures are driven by statistical analyses of the data sets. However, the interestingness of a gene or its resulting rule for gene expression data should also take into account its biological relevance, which can only be derived from external sources. Possible topcis for a Master Thesis are:
- Application of association rule mining to gene expression data, considering computational feasibility, e.g. high data dimensionality with comparably low numbers of transactions
- Definition of a subjective interestingness measures for association rules with special focus on their biological relevance, e.g. by incorporating external knowledge
Topic Area: Biclustering on Gene Expression Data (Contact: Cindy Perscheid)
Currently, clustering and classification is applied to gene expression data to identify specific expression profiles, e.g. for a particular cancer type. Traditional clustering assigns each gene to a single cluster. A gene, however, participates on average in 10 processes of a cell. This said, traditional clustering cannot appropriately reflect the correlations of genes, as it would only show one specific view on the data. Biclustering allows to identify overlapping clusters and subspaces in gene expression data, reflecting the underlying cell processes much better. The amount of resulting biclusters must be filtered with interestingness measures. These measures are driven by statistical analyses of the data sets. However, the interestingness of a bicluster should also take into account its biological relevance, which can only be derived from external sources. Possible topcis for a Master Thesis are:
- Visualization of biclustering results
- Definition of a subjective ranking measure for biclusters with special focus on their biological relevance, e.g. corresponding to known cell processes