Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

BlackSwan (*)

Automated annotation of global statistics

Contact

Johannes Lorey

Organizational Info

source: gapminder.org

Statistical Data

There are numerous sources for statistical data sets: companies, government agencies, international organizations, etc. Statistical data is usually

  • of numerical type
  • collected in fixed intervals over a certain time period, and
  • reveals certain short-term and long-term trends.
source: wikipedia.org

Event Data

Event data can be gathered from various (mostly unstructured) sources, such as Wikipedia, Freebase, News Archives, and so on. In the context of this seminar, an event is described by its

  • type,
  • location, and
  • (starting) point in time.

Augmenting Statistical Data

Our goal is to detect certain trends in statistical data and automatically relate them to the historical events that triggered these events based on specific rules previously learned by our system.

Proposed Architecture

The following diagram depicts a possible architecture for the system. The green boxes each represents a specific component a team of students will be working on.

Team structure and tasks

Event ExtractorTime Series AnalyzerRule SystemVisualizer
Tasks
  • gather data sources
  • text extraction
  • information integration
  • choose data format
  • gather data sources
  • information integration
  • regression / time series analysis
  • statistics
  • association rule mining
  • machine learning
  • time shift
  • evaluation
  • find visualization framework
  • implement interactive GUI
  • data export
  • Important Dates and Material

    DateEventSlides
    18.10.10
    • first meeting
    File Download
    22.10.10,
    23:59 CEST
    • deadline for sending participation request mail to Johannes
    • include your preferred focus (if any): extraction or analysis
    • include your preferred amount of "Leistungspunkte": 3 or 6
    n.a.
    24.10.10,
    20:00 CEST
    • participation confirmation mails are sent out
    n.a.
    25.10.10
    • announcement of team set-up
    • discussion of first steps
      • data model
      • event classes
      • statistical datasets
    File download
    01.11.10
    • tutorial statistical analysis
    File download
    08.11.10
    • tutorial information extraction
    File download
    15.11.10
    • short overview (30 min) by extraction team of
      • proposed datasets
      • proposed extraction algorithms
      • proposed data format
      • proposed frameworks
      • first results/implementations?
    • short overview (30 min) by statistics team of
      • proposed datasets
      • proposed analysis algorithms
      • proposed frameworks
      • first results/implementations?
    n.a.
    29.11.10
    • informal progress report
    n.a.
    10.12.10
    • presentation of preliminary results
      • statistics team
      • start time: 15:15 (room A-2.2, as usual)
      • 10 minutes per speaker
    File download
    13.12.10
    • presentation of preliminary results
      • extraction team
      • start time: 15:15
      • 10 minutes per speaker
    File download
    03.01.11
    • "naïve" join of statistics and event data on "date" attribute should be possible by now
    • no meeting on this day!
    n.a.
    10.01.11
    • team set-up for second phase
    File download
    17.01.11
    • tutorial data mining
    File download
    24.01.11
    • no meeting on this day!
    n.a.
    31.01.11
    • short overview of next steps by visualization team + approx. 3 members of rule system team (no slides)
    n.a.
    07.02.11
    • short overview of next steps by remaining rule system team (no slides)
    n.a.
    31.03.11
    • hand in source code
    • hand in "BlackSwan advertisement"
    n.a.
    11.04.11,
    17:00
    • final presentations (rule mining)
    File download
    15.04.11,
    15:00
    • final presentations (visualization)
    File download

    Literature

    Data Mining / Association Rules

    • Agrawal et al.: Mining association rules between sets of items in large databases, SIGMOD 1993
    • Agrawal et al.: Fast Algorithms for Mining Association Rules, VLDB 1994
    • Tan et al.: Selecting the right objective measure for association analysis, Information Systems 2004
    • Knorr et al.: Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB 1998
    • Han et al.: Data Mining: Concepts and Techniques, Morgan Kaufmann 2005 (can be found in our group library)
    • Segaran: Kollektive Intelligenz: analysieren, programmieren und nutzen, O'Reilly 2008 (can be found in our group library)
    • Witten, Frank: Data Mining. Practical Machine Learning Tools and Techniques, Morgan Kaufmann 2005 (can be found in our group library)

    Information extraction

    • Sarawagi: Information Extraction, Now Publishers Inc. 2008

    Statistical analysis

    • Adler: R in a nutshell, O'Reilly 2010 (can be found in our group library)

    Links

    (*) http://en.wikipedia.org/wiki/Black_swan_theory