Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI
  • de

In-Memory Database Support for Source Code Querying and Analytics

Software engineers are compelled to deal with a large amount of information about source code. Appropriate tools are required to enable software engineers to work efficiently with this amount of information. A number of tools have been created to increase developers' productivity. However, these tools are limited with respect to the capability of processing and presenting such a large amount of information sufficiently. With the advent of in-memory column-oriented databases the performance of data-intensive applications can be significantly improved, resulting in a completely new user experience with those applications and enabling new use-cases.

This thesis investigates the applicability of in-memory column-oriented databases for supporting daily software engineering activities. The major research questions addressed in this thesis are as follows: Does in-memory column-oriented database technology provide the means and the necessary performance advantages for working interactively with large amounts of fine-grained structural information about source code? And, if yes, what advantage for software engineers will the application of this technology provide?

To investigate these research questions two scenarios have been selected that particularly suffer from low performance. The first selected scenario is source code querying. Existing source code information systems contain a large amount of structural data. Abstract syntax trees, call graphs, and data flow graphs are examples of such structural data. Existing tools have solved the performance problems either by reducing the amount of data using a coarse-grained representation, or by preparing answers to developers' questions in advance, or by reducing the scope of search. These alternatives result in the loss of developers' productivity. The second scenario is source code analytics. To perform reverse engineering tasks software engineers often are compelled to analyze a number of atomic facts that have been extracted from source code.

Examples of such atomic facts are occurrences of certain patterns in code, software product metrics or dependencies between software components. Each fact typically has several characteristics, such as, the type of the fact, location in code where found, and other attributes. Particularly, analysis of large software systems requires the ability to extract and to process a large amount of such facts efficiently.

During industrial experiments conducted for this thesis it was evidenced that in-memory technology provides performance gains that improve developers' productivity and enable scenarios previously not possible.

This thesis is driven by today's problems of software engineering. It overlaps both software engineering and database technology. From the viewpoint of software engineering, it seeks to find a way to support developers in dealing with a large amount of structural data. From the viewpoint of database technology, source code querying and analytics are domains for studying fundamental issues of storing and querying structural data.