GetYourGuide is an online marketplace for tourist activities that launched in 2009 and was the “3rd most innovative company in travel” in 2019. As one of more than 600 employees, Jenders had to handle different challenges at GetYourGuide. He mainly worked on the user recommendation system, optimizing the type, place and time of recommendations on the website. Another challenge he tackled aimed at predicting the value of individual advertisements for the company and also included the question of how much money should be spent on which kind of advertisement (budget allocation). Furthermore, GetYourGuide’s Data Science division also deals with challenges like the tagging of the company’s inventory, the design of search rankings or the allocation of money to different acquisition channels.
Mr. Jenders uses the example of building a recommendation system for GetYourGuide in order to display the need of Data Pipelines. He guides through the development process and describes underlying principles as well as challenges he faced while developing this system. GetYourGuide’s recommendation system aims at proposing suitable products to users, based on their interests and past transactions. Someone who just booked a walking tour through Berlin could receive a recommendation for e.g. visiting an ice-bar or a museum in the same city.
Guidelines of building a recommendation system
According to Mr. Jenders, it is crucial to determine specific constraints to the system and the development process, before starting to engineer it. The first question one should consider is what the company aims to achieve with the product. In the case of a recommendation system, this could include an increase in the number of sales, customer satisfaction or conversions, but also the pursuit of a specific business strategy or the solution of a stakeholder problem respectively.
Next, the target audience has to be determined. Here, the question occurs, where and when recommendations should be generated on the website. Another question is what items to show to the customer with respect to ranking, relevance, diversity or strategic business initiatives.
Furthermore, it is important to define suitable metrics that can be used to measure the success of the system in an accurate way. In our case, one could think of e.g. the conversion rate, revenue per visitor, average order value, repeat rate, or net promoter score as a metric for evaluating the system’s impact. For example, if the company strives for increasing cross sells (additional sells to an existing customer), a suitable method to measure the system’s effect on this would be the revenue per visitor as it is increasing directly with the number of cross sells. The conversion rate, on the other hand, would not be a good metric for this use case.
In addition to that, one should define a way of testing the system in practice. For this purpose, GetYourGuide is using the concept of A/B testing where two different versions of the online shop (Version A and Version B) are shown in order to find out which version is more suitable to the business goals.
While some of these questions can be answered on a whiteboard, others might need to be validated using analytics. In our case, it might be crucial to determine how many users actually scroll down the page or buy multiple products, before building the system. This should be done by analyzing actual customer data from the past.
The importance of Data Pipelines
After answering these questions, one can decide on how to achieve these goals. In Data Science, the most commonly used technique today is a Machine Learning algorithm. But before being able to run it, one has to find a way of providing fast, clean and accurate data. This is where the need of Data Pipelines comes into play.
Data collection
Before building a pipeline, one has to collect relevant data in a central place. In this step, one should consider the data’s cleanliness and update frequency and also think about future projects in order to collect data that could prove useful for them. Mr. Jenders recommends the usage of a Data Lake for this task, because of its superior flexibility which makes it easier to store large amounts of heterogeneous data from multiple sources. Later on, when ETL processes (extract, transform, load) are getting more complex, one can add Data Warehouses for storing information in a more structured way.
For its recommendation system, GetYourGuide collects more than 100 million user events per day. They are triggered when a user interacts with the website, e.g. when he books an activity, clicks on a recommendation or scrolls down the page. Mr. Jenders presented some of these events using Kibana, which is a platform for searching and visualizing data from elasticsearch.
Building the Machine Learning model
As a next step, one can think about building the actual Machine Learning model. Here, the desired input and output data should be specified, considering which data is available at what time. One should also have a concept for testing the model, versioning and deploying it. Furthermore, it is very important to define specific metrics that can be used for evaluation and model selection. Those could include e.g. the accuracy or the false-positive-rate of the model on a test dataset.
When training the model, one also needs to make sure that it is not “peeking”, i.e. that the model cannot make use of data it is being tested with or data that it wouldn’t have access to in a production scenario. When “peeking”, the model might perform very well while training and testing it, but might fail in production because then it will not have access to those parts of the data anymore. Therefore, one needs to make sure to train the model with accurate data and to use a well separated training and testing dataset.
This is especially true when one is training with historical data because here one needs to ensure that the entire training data only consists of information that was available at the time of its recording and does not contain enrichments from the future.
For the core of the model, Mr. Jenders recommends starting with a simple algorithm (e.g. linear regression) instead of a complex one like a neural network. In most cases, this will lead to remarkable results, while at the same time preventing premature optimization. In addition to that, one shouldn't restrict the model to just explicit data in order to leverage data exhaust.
Setting up the pipeline
In order to run the model, a set of preprocessing steps has to be introduced, that will clean the data, enrich it and bring it into a suitable format. Therefore, one has to identify the necessary steps and their dependencies among each other. It is also very important to handle faults in an appropriate way, i.e. one should at least be able to detect faults early on. If possible, the pipeline should be fault tolerant, meaning that it can still produce good results when a step fails, but otherwise it should rather be failing than producing bad results due to an unrecognized error.
Mr. Jenders recommends the usage of Apache Airflow for this task, which is a platform for structuring steps in a directed-acyclic-graph (DAG). This makes it easier to keep track of the dependencies and facilitates coping with increasing complexity when adding or modifying steps. Airflow also allows running the pipeline and helps ensuring fault tolerance, for example by allowing the user to rerun steps a certain number of times if it fails.
Another useful tool, GetYourGuide is using, is Databricks, which is a web-based platform for working with Apache Spark. It provides an automated cluster management and allows the user to run notebook-styled scripts. This can prove very useful for prototyping and experimenting, but shouldn’t be used in production because of a missing version control for notebooks.
Challenges of Data Pipelines
Furthemore, Mr. Jenders gives insights into the daily challenges of working with Data Pipelines. Accordingly, one of the most annoying problems is stale data, meaning that the pipeline uses data that is not available anymore (without failing), which might lead to strange results. Another challenge are data drifts: unforeseen changes in certain data characteristics that might lead to inputs the pipeline isn’t designed for. A good monitoring could mitigate this risk.
To sum up Mr. Jenders point-of-view on Data Science, it can be said that - no matter what problem you are working on - accurate, clean and fast data is essential. Therefrom arises a high importance of Data Pipelines and the need for proper Data Engineering. A company must first lay a solid groundwork before it can move on to more advanced problems. Therefore, a good Data Scientist should also be experienced in Data Engineering. The challenges, a Data Scientist will work on, are also heavily dependent on the maturity of his company. Hence, if you are planning for working as a Data Scientist, you should ask yourself what you like to do more: building the foundation or working on higher level optimization problems. You should then choose the company to work for accordingly.