Hasso-Plattner-Institut
Prof. Dr. h.c. Hasso Plattner
 

Alexander Kastius

Research Assistant, PhD Candidate

Phone:    +49 (331) 5509-560
Fax:    +49 (331) 5509-579
Email:    alexander.kastius(at)hpi.de
Room:    V-2.05 (Campus II)
LinkedIn: Profile

Research

Reinforcement Learning for Revenue Management

My research field is Reinforcement Learning for Revenue Management.
I investigate improvements regarding stability and data efficiency of reinforcement learning algorithms when used on economic problems,
in order to allow the practical application of those tools even when no information about customer or competitor behaviour is available and the observable information is limited.

Background

Companies that do offer their products online on large marketplaces, which allow fast price update intervals, are facing problems at a previously unseen scale. In many cases, they offer a product that has a magnitude of alternative sellers or products with comparable properties. While this is also the case in retail, online trade allows for possible customers to easily get a detailed overview of the market in its current state. In addition, the competitors can also access all publicly available information and use it to gain an advantage. Automation allows increasingly fast observations of the market and reactions to its current state. In this environment, automation becomes a must have to stay competitive.

To achieve automation, it is possible to formulate many problems in revenue management, for example, dynamic price updates or inventory management, as Markov decision processes (MDPs). MDPs consist of a process description with three major components, states, actions, and the reward. The states describe the current properties of the process, which might change after each state transition. Actions consist of the set of possible actions an algorithm can choose from to achieve a state transition. For example, in a pricing problem, the set of actions might consist of the set of possible price choices, for inventory management the possible amounts of items to reorder. Within such an environment, a self-learning system, the agent, observes the current state and chooses an action, which in turn leads to a state transition. After achieving the state transition, the agent is provided with a reward signal. This scalar encodes the success of the agent. To solve Markov decision processes, it is necessary to find a mapping of states to actions, called a policy, which the agent can choose from to maximize his expected summed reward stream until the end of the process.

Reinforcement learning consists of a set of algorithms designed to find optimal policies under this criterion. Its solution methods are not problem-specific, as long as all formal requirements are met, which means that in general every RL algorithm can be applied to any MDP. The formal requirements include, depending on the system under assessment, that the process is formulated as MDP and certain information about the process is visible. The most capable and least restrictive algorithms rely on observations of past states and rewards only. Early methods require storing certain estimations for every single state or possible state-action combination. If this is not possible, due to the either of the two sets not being discrete or if they exceed a size that makes storing those values unsuitable, Deep RL can be used. Deep RL algorithms rely on parametric representations of the policy or reward estimators and update those parameters based on previous observations. They can handle complex processes, are generalistic to be robust against specific process properties, and can encode arbitrary policies. This generality comes at the price of a noticeable demand for training observations.

Current Projects

Usage of RL for Revenue Management

In the past, we applied generalistic RL algorithms for problems specified in revenue management. For those experiments, we focused on dynamic pricing, in which the agent has to correctly determine the price choice for a product that maximizes long-term revenue given the current competitor prices. We have shown that the generality of algorithms like Deep Q-Networks (DQN) and Soft Actor-Critic (SAC) allow their application on very broad market setups. We evaluated if and under which conditions those systems do find near-optimal policies in duopolies in many market setups with different competitor strategies. While there are edge-cases in which the systems struggle to find well-performing policies, for example, situations in which the competitor relies on fixed price strategies, they usually show the expected performance. We have observed that the more complex actor-critic systems tend to show better performance than their value estimation-based counterparts, represented by DQN. The analysis displayed a noticeable disadvantage of the systems, which lies in the number of required observations until convergence. SAC did achieve convergence after more than 100,000 observations to yield a near-optimal policy reliably. The following research projects aim at overcoming this limitation. Furthermore, we observed that convergence in a duopoly is hard to achieve when both competitors do use self-adjusting policies, a problem that we try to overcome with asymmetric information in multi-agent pricing problems.

Transfer Learning for Revenue Management

Transfer learning formalizes the idea, that the convergence speed of an algorithm can be improved by using data sourced from multiple problems. There is a magnitude of alternatives assessed by current RL research projects, which consist of very different mechanisms to achieve this goal. Training one conventional agent with information sourced from several different MDPs is difficult. Different state and action spaces have to be encoded in a way, that a single parametric function can evaluate them and output a meaningful response. While this includes formal equality of the state- and action spaces, this also requires equality of the state-transition and reward function. In this sense, equality means that performing the same action in the same state leads to the same result in all processes. Having equal state- and action spaces and equal state-transition and reward probabilities violates the concept of having multiple different problems to source data. Several proposals have been made by available literature to solve this discrepancy. As an example, we use DISTRAL, an algorithm that uses a multi-policy approach to improve information exchange. In DISTRAL, a different policy is optimized for each problem used to generate data. The information exchange occurs by regularizing the policies to move towards a distilled policy, which in turn is generated by searching for the policy that achieves minimal divergence from all task-specific policies. This algorithm still requires a formally equal state and action space, but each task-specific policy can take different transition and reward mechanics into account.

For revenue management, in many cases, formal equality is not given as well. For different products, possible price choices and reasonable state descriptions do deviate. We propose to dynamically map those different price areas and market states into a single unified representation and then use DISTRAL to learn unified policies under minimal policy deviations necessary. This approach combines several of the proposed mechanisms for multi-task reinforcement learning. Given the specialized nature of the problem under evaluation, we expect further improvements to be possible if the generality of the parametric representation functions is given up to reduce the search space and allow usage of joint layers between the task-specific policies.

Multi-Agent Systems with Non-Symmetric Knowledge

Past research has shown that it is hard to achieve stable equilibria with multiple reinforcement learning-based agents in the same system, especially if they rely on non-linear function approximation tools. The independent updates of the agents' policy dramatically increase the difficulty, as each algorithm has to keep up with policy changes, which in turn changes its policy. In the past, several publications evaluated markets with non-symmetric information and achieved convergence under those conditions. In such a market, one agent has access to all other agents' current policies and can incorporate this information in its learning mechanism by evaluating the expected reward given the competitors' reaction. Previously, such systems were mostly used with RL algorithms that use tabular representations of the value functions. This project aims at evaluating if such a system achieves convergence when used with non-linear function approximation tools. Furthermore, the structural properties of the process have to be considered. In practice, a competitor lacks the motivation to publish his policy, as this leaves the leader of the game with an advantage. We currently plan to measure if the faster convergence can lead to improvements for every market participant which prevails the disadvantage of having to publish his current policy. If this is not the case, equality can be achieved by allowing each market participant to charge a fee. The correct amount of this fee has in turn to be estimated correctly to achieve equal gains by information exchange, an additional optimization problem that has to be solved in the future.

State Space Exploration

In most RL systems, exploration is purely stochastic and action-based. In Q-learning with e-greedy policies, an action is chosen randomly given the current state. Systems like Soft Actor-Critic encode randomness in their state-dependent policies. In the future, we propose to improve this mechanism by allowing state-space-oriented exploration. We aim at finding a mechanism, which allows an agent to recognize areas of the state space he observed regularly and those areas which have been rarely seen in the past. If the agent does know state-transition probabilities, this can be achieved by recognizing the confidence for each observed state, measured by the number of visits in a discrete state space, and explicitly aim for those states which have been rarely observed. The formal requirement of knowledge about the state transitions makes this approach only suitable for methods like Approximate Dynamic Programming. For systems without knowledge about transition probabilities, a different method has to be found. For Q-learning, exploration counts can be recorded together with the Q-values themselves, an approach that might make this method suitable for further evaluation.

Outlook

While we are currently evaluating several improvements regarding both multi-task and multi-agent setups, which we both deem necessary to allow usage of RL systems in practice, further research in this area is possible and necessary. While all extensions described are currently under evaluation, several new improvements for reinforcement learning stay unused. This includes model-based reinforcement learning, which keeps a model to learn state transitions and alternatives to replace DISTRAL, which rely on other information exchange methods than policy exchange. Furthermore, there are examples of RL algorithms which avoid the use of deep models, which displays an interesting alternative, as the current models are not interpretable by their users.

Publications

2021

  • 1.
    Kastius, A., Schlosser, R.: Dynamic Pricing under Competition using Reinforcement Learning. Journal of Revenue and Pricing Management. 1–22 (2021).