1.
Heller, D., Krestel, R., Ohler, U., Vingron, M., Marsico, A.: ssHMM: Extracting Intuitive Sequence-Structure Motifs from High-Throughput RNA-Binding Protein Data. Nucleic Acid Research. 45, 11004–11018 (2017).
RNA-binding proteins (RBPs) play an important role in RNA post-transcriptional regulation and recognize target RNAs via sequence-structure motifs. The extent to which RNA structure influences protein binding in the presence or absence of a sequence motif is still poorly understood. Existing RNA motif finders either take the structure of the RNA only partially into account, or employ models which are not directly interpretable as sequence-structure motifs. We developed ssHMM, an RNA motif finder based on a hidden Markov model (HMM) and Gibbs sampling which fully captures the relationship between RNA sequence and secondary structure preference of a given RBP. Compared to previous methods which output separate logos for sequence and structure, it directly produces a combined sequence-structure motif when trained on a large set of sequences. ssHMM’s model is visualized intuitively as a graph and facilitates biological interpretation. ssHMM can be used to find novel bona fide sequence-structure motifs of uncharacterized RBPs, such as the one presented here for the YY1 protein. ssHMM reaches a high motif recovery rate on synthetic data, it recovers known RBP motifs from CLIP-Seq data, and scales linearly on the input size, being considerably faster than MEMERIS and RNAcontext on large datasets while being on par with GraphProt. It is freely available on Github and as a Docker image.
2.
Park, J., Blume-Kohout, M., Krestel, R., Nalisnick, E., Smyth, P.: Analyzing NIH Funding Patterns over Time with Statistical Text Analysis. Scholarly Big Data: AI Perspectives, Challenges, and Ideas (SBD 2016) Workshop at AAAI 2016. AAAI (2016).
In the past few years various government funding organizations such as the U.S. National Institutes of Health and the U.S. National Science Foundation have provided access to large publicly-available on-line databases documenting the grants that they have funded over the past few decades. These databases provide an excellent opportunity for the application of statistical text analysis techniques to infer useful quantitative information about how funding patterns have changed over time. In this paper we analyze data from the National Cancer Institute (part of National Institutes of Health) and show how text classification techniques provide a useful starting point for analyzing how funding for cancer research has evolved over the past 20 years in the United States.
3.
Grundke, M., Jasper, J., Perchyk, M., Sachse, J.P., Krestel, R., Neves, M.: TextAI: Enhancing TextAE with Intelligent Annotation Support. Proceedings of the 7th International Symposium on Semantic Mining in Biomedicine (SMBM 2016). pp. 80–84. CEUR-WS.org (2016).
We present TextAI, an extension to the annotation tool TextAE, that adds support for named-entity recognition and automated relation extraction based on machine learning techniques. Our learning approach is domain-independent and increases the quality of the detected relations with each added training document. We further aim at accelerating and facilitating the manual curation process for natural language documents by supporting simultaneous annotation by multiple users.