Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Recommending Tourist Activities - Data Science Challenges and the Needs for Data Pipelines

Maximilian Jenders, GetYourGuide, Berlin

Abstract

According to Harvard Business Review, Maximilian Jenders is working in the sexiest job of the 21st century: he’s a Data Scientist. In his presentation, Mr. Jenders provides insights into this very hyped discipline, gives an introduction to typical challenges in his field and emphasizes the importance of Data Pipelines. Throughout the lecture, he uses the example of building a user recommendation system to further illustrate his ideas. Next to interesting technical insights, Jenders is providing entertaining experiences and advices from his work as a Data Scientist.

Biography

Mr. Jenders graduated with a masters degree in IT-Systems Engineering from HPI and then continued working there as a researcher until 2017. He specialized in Data Mining and Machine Learning and is now working for GetYourGuide as a Data Scientist. Jenders describes the art of Data Science as the combination of hacking skills, math and statistics knowledge, and substantive expertise. He argues that only the union of these three will provide profound expertise to tackle Data Science challenges, as IT companies such as GetYourGuide have to. In his eyes, a good Data Scientist should also have a solid knowledge of Data Engineering for being able to build the groundwork that many modern data-driven applications require.

A recording of the presentation is available on Tele-Task.

Summary

written by Anjo Seidel, Victor Wolf, and Leon Schiller

GetYourGuide is an online marketplace for tourist activities that launched in 2009 and was the “3rd most innovative company in travel” in 2019. As one of more than 600 employees, Jenders had to handle different challenges at GetYourGuide. He mainly worked on the user recommendation system, optimizing the type, place and time of recommendations on the website. Another challenge he tackled aimed at predicting the value of individual advertisements for the company and also included the question of how much money should be spent on which kind of advertisement (budget allocation). Furthermore, GetYourGuide’s Data Science division also deals with challenges like the tagging of the company’s inventory, the design of search rankings or the allocation of money to different acquisition channels.

Mr. Jenders uses the example of building a recommendation system for GetYourGuide in order to display the need of Data Pipelines. He guides through the development process and describes underlying principles as well as challenges he faced while developing this system. GetYourGuide’s recommendation system aims at proposing suitable products to users, based on their interests and past transactions. Someone who just booked a walking tour through Berlin could receive a recommendation for e.g. visiting an ice-bar or a museum in the same city.

Guidelines of building a recommendation system

According to Mr. Jenders, it is crucial to determine specific constraints to the system and the development process, before starting to engineer it. The first question one should consider is what the company aims to achieve with the product. In the case of a recommendation system, this could include an increase in the number of sales, customer satisfaction or conversions, but also the pursuit of a specific business strategy or the solution of a stakeholder problem respectively.

Next, the target audience has to be determined. Here, the question occurs, where and when recommendations should be generated on the website. Another question is what items to show to the customer with respect to ranking, relevance, diversity or strategic business initiatives.

Furthermore, it is important to define suitable metrics that can be used to measure the success of the system in an accurate way. In our case, one could think of e.g. the conversion rate, revenue per visitor, average order value, repeat rate, or net promoter score as a metric for evaluating the system’s impact. For example, if the company strives for increasing cross sells (additional sells to an existing customer), a suitable method to measure the system’s effect on this would be the revenue per visitor as it is increasing directly with the number of cross sells. The conversion rate, on the other hand, would not be a good metric for this use case.

In addition to that, one should define a way of testing the system in practice. For this purpose, GetYourGuide is using the concept of A/B testing where two different versions of the online shop (Version A and Version B) are shown in order to find out which version is more suitable to the business goals.

While some of these questions can be answered on a whiteboard, others might need to be validated using analytics. In our case, it might be crucial to determine how many users actually scroll down the page or buy multiple products, before building the system. This should be done by analyzing actual customer data from the past.

The importance of Data Pipelines

After answering these questions, one can decide on how to achieve these goals. In Data Science, the most commonly used technique today is a Machine Learning algorithm. But before being able to run it, one has to find a way of providing fast, clean and accurate data. This is where the need of Data Pipelines comes into play.

Data collection

Before building a pipeline, one has to collect relevant data in a central place. In this step, one should consider the data’s cleanliness and update frequency and also think about future projects in order to collect data that could prove useful for them. Mr. Jenders recommends the usage of a Data Lake for this task, because of its superior flexibility which makes it easier to store large amounts of heterogeneous data from multiple sources. Later on, when ETL processes (extract, transform, load) are getting more complex, one can add Data Warehouses for storing information in a more structured way.

For its recommendation system, GetYourGuide collects more than 100 million user events per day. They are triggered when a user interacts with the website, e.g. when he books an activity, clicks on a recommendation or scrolls down the page. Mr. Jenders presented some of these events using Kibana, which is a platform for searching and visualizing data from elasticsearch.

Building the Machine Learning model

As a next step, one can think about building the actual Machine Learning model. Here, the desired input and output data should be specified, considering which data is available at what time. One should also have a concept for testing the model, versioning and deploying it. Furthermore, it is very important to define specific metrics that can be used for evaluation and model selection. Those could include e.g. the accuracy or the false-positive-rate of the model on a test dataset.

When training the model, one also needs to make sure that it is not “peeking”, i.e. that the model cannot make use of data it is being tested with or data that it wouldn’t have access to in a production scenario. When “peeking”, the model might perform very well while training and testing it, but might fail in production because then it will not have access to those parts of the data anymore. Therefore, one needs to make sure to train the model with accurate data and to use a well separated training and testing dataset.

This is especially true when one is training with historical data because here one needs to ensure that the entire training data only consists of information that was available at the time of its recording and does not contain enrichments from the future.

For the core of the model, Mr. Jenders recommends starting with a simple algorithm (e.g. linear regression) instead of a complex one like a neural network. In most cases, this will lead to remarkable results, while at the same time preventing premature optimization. In addition to that, one shouldn't restrict the model to just explicit data in order to leverage data exhaust.

Setting up the pipeline

In order to run the model, a set of preprocessing steps has to be introduced, that will clean the data, enrich it and bring it into a suitable format. Therefore, one has to identify the necessary steps and their dependencies among each other. It is also very important to handle faults in an appropriate way, i.e. one should at least be able to detect faults early on. If possible, the pipeline should be fault tolerant, meaning that it can still produce good results when a step fails, but otherwise it should rather be failing than producing bad results due to an unrecognized error.

Mr. Jenders recommends the usage of Apache Airflow for this task, which is a platform for structuring steps in a directed-acyclic-graph (DAG). This makes it easier to keep track of the dependencies and facilitates coping with increasing complexity when adding or modifying steps. Airflow also allows running the pipeline and helps ensuring fault tolerance, for example by allowing the user to rerun steps a certain number of times if it fails.

Another useful tool, GetYourGuide is using, is Databricks, which is a web-based platform for working with Apache Spark. It provides an automated cluster management and allows the user to run notebook-styled scripts. This can prove very useful for prototyping and experimenting, but shouldn’t be used in production because of a missing version control for notebooks.

Challenges of Data Pipelines

Furthemore, Mr. Jenders gives insights into the daily challenges of working with Data Pipelines. Accordingly, one of the most annoying problems is stale data, meaning that the pipeline uses data that is not available anymore (without failing), which might lead to strange results. Another challenge are data drifts: unforeseen changes in certain data characteristics that might lead to inputs the pipeline isn’t designed for. A good monitoring could mitigate this risk.

To sum up Mr. Jenders point-of-view on Data Science, it can be said that - no matter what problem you are working on - accurate, clean and fast data is essential. Therefrom arises a high importance of Data Pipelines and the need for proper Data Engineering. A company must first lay a solid groundwork before it can move on to more advanced problems. Therefore, a good Data Scientist should also be experienced in Data Engineering. The challenges, a Data Scientist will work on, are also heavily dependent on the maturity of his company. Hence, if you are planning for working as a Data Scientist, you should ask yourself what you like to do more: building the foundation or working on higher level optimization problems. You should then choose the company to work for accordingly.