Facing the Storm
Sensitive research under public scrutiny
On Monday June 4 2012, Schufa, Germany’s largest and most well-known credit-rating agency, and the Hasso Plattner Institute (HPI), a privately funded CS institute at the University of Potsdam in Germany, jointly announce the inauguration of a research lab, dubbed SchufaLab@HPI. Its goal is to explore both the economic value and the societal risk of web data, including but certainly not limited to data from social networks. The lab is planned for a 3 year period with a possible extension for another two with funding two full-time employed researchers, hardware, travel, etc. One day later, on Tuesday, internal documents about the lab, including the formal cooperation agreement, slides from the planning phase, and a list of project ideas are leaked to a journalist of NDR, a national public radio and TV network. The contract outlines the broad research goals of the lab and an informal list of project ideas: the result of a brainstorming session including a wide range of technologies and their potential application areas. They include somewhat harmless avenues, such as sentiment analysis for products, but also ethically more debatable ideas, such as matching customer records to profiles in social networks. Many of the ideas are half-baked and thus the document explicitly includes a disclaimer stating that their execution depends on availability of the data and their legality under the very strict privacy laws in Germany. In the afternoon a member of Schufa’s executive board and I undertake a long conversation with the journalist to explain what is in fact planned and emphasize that clearly all research will be performed within legal and ethical boundaries. We also reiterate the explicit goal of exploring the societal risk, the research nature of the lab, and the importance of publicly showing the capabilities of current IT methods to extract knowledge from web data.
Nevertheless, seeking a scandal, on Thursday a report is published by NDR. It indirectly suggests a dark and secretive ploy by Schufa and HPI to undermine citizen’s privacy through illegal means . On that day and the following, both Schufa and HPI face a storm that is full-blown by any measure, including wide coverage on national TV and radio, reports and commentaries in every German newspaper, etc. Even two federal ministers, the secretary of justice and the secretary of food, agriculture, and consumer protection, issue statements to the press demanding the project to be halted. In fact, one of them, in a subsequent correspondence, maintains that using data from social networks for commercial purposes is “unthinkable”, completely ignoring the fact that it is both legal and common practice in Germany and all over the world. In fact, such commercial use is arguably the main driver of the internet. Finally, the German Ethics Council, which normally advises parliament on matters of bioethics and other medical matters, weighs in against the project. These statements are made without ever contacting us to even find out what in fact was planned. Tweets and blogs go wild, and a slew of hate-mails against me and PhD students in my group ensue. Some discuss ethical issues of collecting private data, others insult, and still others threaten my students to never be able to get a job after having been associated with HPI and this project. More creative Twitterers send out tongue-in-cheek tweets mentioning their riches – in the hope of raising their credit rating: “My bank called: The account is full and I should open another one. #twitternfuerdieschufa” 
With the help of HPI’s public relations expert and in coordination with Schufa’s PR department I experience three crazy and sleepless days of nonstop interviews for TV and radio, conversations with journalists, phone calls, emails, and strategy meetings.
Is such public outrage justified? Which ethical and legal responsibilities do we data(base) researchers have when handling data about individuals?
Searching for and analyzing publicly available data is legal in Germany and likely so in most countries; even for business purposes and even if the data is about individuals. Second, analyzing social network data is common practice in research and industry: Of course the social networks themselves make use of their data – without clustering, classification and further analytics their business models are moot. There are companies that expressly specialize in social network analysis for instance to support recruitment specialists. I have since received numerous requests for cooperation from companies following similar goals, but of course not with the intent of publishing their techniques. Many large and small software vendors support access to social network data for BI or CRM tasks. Finally, consider a life insurance representative deciding on the insurance rate for a customer: looking up the leisure activities of the customer will certainly influence his decision. In conclusion, commercial use of web data and social network data is happening, regardless of its legality; it will most likely expand; and it will not go away. A few days into the storm, more prudent newspaper articles appear stating as much, and for instance give advice to consumers on how to configure their Facebook privacy settings to protect themselves against such analysis.
Much research on social network data analysis has been performed and published – some by research groups of the social network companies themselves, more by independent researchers. They have obtained datasets from the networks directly, have crawled that data, or have used publicly available data, such as the details of 100 million Facebook users published as a torrent, the Enron E-Mail data set published by court-order, the infamous AOL search logs, or the Netflix data published in the context of a competition.
As researchers, we are often given more leeway than commercial enterprises; under the assumption that the public benefits from our results and that the data and methods are treated responsibly. When researching methods to collect data that is considered private, even if it was intentionally or unintentionally made publicly available, it is important to protect any data that was gathered. But it is as important to make the public aware of the ready availability of such data, and of the capabilities of advanced IT to analyze that data and automatically draw further, implicit conclusions. These conclusions, such as gender, profession, age, sexual orientation, health, education level, etc., need not be based only on the data about the individual, but also on data from his social context and using training data solicited from, or gathered about a broader population. That is, a blogger need not explicitly state that she is vegan; it might be deduced from implicit signals in her texts, from her friend’s statuses, or by matching her blog with her Facebook profile.
In my opinion, it is important to publicly and transparently perform and publish such research, and to not leave this field to private organizations with commercial interests. Research results can serve to educate the public, can provide tools for individuals to monitor their online presence beyond what a search engine result delivers, and can thus level the playing field. It is not enough to fret about the potential abilities of IT without studying them and after careful analysis drawing the right conclusions, such as changes in policy and the establishment of active consumer protection measures.
With the storm still raging strong on Friday and no signs of it abating, HPI decides to terminate the project. We concluded that it had become impossible to perform serious research at that noise level and with the necessary serenity. What is more, my responsibility for my PhD students demands to protect them from the undue and at times insulting attacks. While some journalists, politicians, and individuals express satisfaction that the project was stopped (still ignoring the fact that data collection and analysis is performed elsewhere anyway, just not in a transparent setting), others are disappointed that we had caved to public pressure. A new wave of interviews was necessary to justify the decision, but the storm died down as fast as it had gathered.
What are the lessons learned? First, it is very difficult to explain (and justify) complex research issues to laymen and journalists. The broad set of research questions around web data was reduced to the catchphrase “Schufa crawls Facebook for credit-rating”.
Second, there is an immense lack of privacy awareness: A surprising number of people are deeply convinced that whatever they write or post on Facebook or other social networks is private and they have no inkling of privacy settings. A presumably well-educated journalist indignantly drew the analogy of entering his home and taking pictures of all his private documents. I replied that a more fitting analogy would be his displaying the documents on the sidewalk of his home, and indexing it for easy reference.
Third, there is no arguing against a storm. Spending three entire days in interviews, appearing on TV and radio, and drafting press releases and responses to questions made no apparent dent to the negative coverage. Of course this lesson does not excuse from defending one’s research.
Fourth, while I do not regret having initiated the project and stand to its goals, next time when planning a project that might involve private data it should include more stakeholders, such as experts in sociology, politicians, ethicists, data protection officers, etc., to ensure a legal and ethical procedure, and to convince the public that such research is useful and important. Also, one should draft agreements, proposals, and project goals as if they were to be published. This measure protects against such PR catastrophes as the one I experienced, but also ensures early deliberation on possible ethical problems. While these conclusions seem obvious, I am simultaneously convinced that such measures are rarely taken in research reality.
As of now, research in Germany with or about data from social networks is tabooed: Merely the mentioning of the use of Twitter or Facebook data in the press is met with emotions ranging from skepticism to outright shock and rejection. Thus, research and development of such techniques shall be left to other countries and to private corporations with no transparency or willingness to publish their results. Or performed by researchers hoping that journalists do not read WWW, SIGIR, or CIKM proceedings…
 Translated from http://twitter.com/JustElex/status/210712374271946752