Ralf Krestel

You are here: Home > Publications > Journal Articles > DBS 19

About Me
Publications
- Book Chapters
- Journal Articles
  - DMKD 21
  - WPI 21
  - JLCL 20
  - ARTI 20
  - DBS 19
  - DTA 19
  - NAR 17
  - DBS 17
  - TCDL 16
  - SMR 16
  - NN 15
  - NLE 14
  - IR 12
  - NEURO 12
  - IR 10
  - IS 10
- Conference Papers
- Workshop Papers
- Posters & Demos
- Proceedings
- Others
Travels

DBS 19

Measuring and Facilitating Data Repeatability in Web Science

Abstract

Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we de- fine as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.

Full Paper

DBS19.pdf

BibTex Entry

@Article{krestel-dbs19, author = {Risch, Julian and Krestel, Ralf}, journal = {Datenbank-Spektrum}, title = {Measuring and Facilitating Data Repeatability in Web Science}, year = {2019 volume={19}, number={2}, issn={1610-1995}, note={Springer} }

« prev| top| next »

News

Watch our new MOOC in German about hate and fake in the Internet ("Trolle, Hass und Fake-News: Wie können wir das Internet retten?") on openHPI (link).

New Publication

Our work on Measuring and Comparing Dimensionality Reduction Algorithms for Robust Visualisation of Dynamic Text Collections will be presented at CHIIR 2021.

New Photos

I added some photos from my trip to Hildesheim.