# Artificial Intelligence and Sustainability by Prof. Ralf Herbrich

## Summary written by Anton Persitzky, Julian Hackenberg & Dinh Trung Hieu Le

Ralf Herbrich studied Computer Science at Technical University Berlin. He earned his Diploma degree in 1997, focusing on Computer Graphics and Artificial Intelligence (AI), and his PhD degree in 2000, focusing on Theoretical Statistics. Herbrich held positions at well-known tech companies. Apart from doing research himself, he has held managing positions at Microsoft, Facebook, Amazon and Zalando, often leading research teams. Before this, Herbrich worked as a researcher at Microsoft Research and Darwin College, focussing on machine learning as well as other artificial intelligence topics. This includes among others approximate computing, Bayesian inference and decision making, and natural language processing. He has published over 80 peer-reviewed conference and journal papers in these research fields. Since April 2022 Ralf Herbrich has held the chair "Artificial Intelligence and Sustainability" at the Hasso Plattner Institute. There, he has also been collaborating with betteries, a Berlin-based startup working on battery upcycling.

Ralf Herbrich presented his chair's research topics in a lecture as part of the Lecture Series on HPI Research. The following is a summary of his talk.

## Lecture Summary

We often gauge a system's intelligence by comparing its performance to a human's. When we interact with a machine from a distance, and it appears like we are interacting with a human, we refer to it as artificially intelligent.

There are various approaches to achieving artificial intelligence (AI). One fundamental approach is machine learning. The concept is that we want a computer program to learn from past experiences. More formally, we have a task we want to achieve, a performance measure that tells us how well the program performs at this specific task, and some previous experience. The task is often called a prediction, the performance measure is called a loss function, and the experience is called training data. A program learns if it performs better using its prior experience.

Classification problems are a prevalent use case for machine learning. These problems involve a finite set of predefined classes. Given some input, we want to determine which class it belongs to. For instance, we might want to read hand-written postal codes. We could process each digit individually and use a machine-learning algorithm to identify it. In this scenario, the input would be a hand-written digit, and the classes would be the digits from zero to nine.

Another use case is regression. In regression, we are interested in the relationship between variables. Given the values of one or more variables, we want to estimate the value of another variable. Temperature predictions in a weather forecast are an example of such a task: Given a history of sensor data, we estimate a temperature in the future. The chair focuses on probabilistic machine learning, a specific form of machine learning. The goal is to build an algorithm that takes a point from the input space and outputs a point from the output space.

The chair focuses on probabilistic machine learning, a specific form of machine learning. The goal is to build an algorithm that takes a point from the input space and outputs a point from the output space.

Many functions could describe the relationship between the input and output spaces. The hypothesis space is the set of these possible functions. Our objective is to find the function of the hypothesis space that most accurately describes the relationship. To achieve this, we require:

- The hypothesis space.
- Prior beliefs. We may already have some assumptions about how likely each function is. We can make use of these assumptions and call them our prior beliefs.
- Training data. Training data demonstrates how the input space relates to the output space.
- The likelihood function. A likelihood function takes the training data and a function from the hypothesis space as inputs. Then, it determines the likelihood of the function being correct based on the training data. It computes this as the probability of observing our training data, assuming that this function accurately describes the relationship between the input and output spaces.

Statistics provides the tools for deriving the best function from the hypothesis space. Namely, these are:

- Maximum Likelihood (ML) if we want to avoid considering prior beliefs.
- Maximum A Posterior (MAP) for considering our prior beliefs.

Probabilistic machine learning offers two primary advantages. First, it describes learning as optimization in the hypothesis space. Second, storing the algorithm means only storing the function's parameters (i.e., its coordinates) in the hypothesis space. However, this approach has a limitation. We can only find a single best function from the hypothesis space.

Using probabilistic machine learning, we can answer the following key questions:

- Given a prediction, some input, and training data, how likely is the prediction correct?
- Given some input and training data, what is the best prediction?

Deep learning is a field that strongly relates to probabilistic machine learning. In deep learning, the functions of the hypothesis space take the form of an artificial neural network (ANN). An ANN comprises artificial neurons. A neuron takes a vector and outputs a single scalar value. It computes this scalar value by first calculating the weighted sum of the input values. For this, each neuron has a vector of weights associated with it. Then, the neuron puts this sum through a predefined function that maps from one scalar value to another. We call this function an activation function. The sigmoid function is a typical example. The ANN consists of layers of these artificial neurons. Each neuron takes the outputs of the neurons from the previous layer as an input. The input for the first layer is the input of the ANN. The output from the last layer is the output of the ANN. The weights of all the neurons in an ANN are the parameters of the function in the hypothesis space. In other words, Deep learning uses a particular type of function for the hypothesis space. However, the same principles from probabilistic machine learning apply.

Deep learning makes heavy use of the operations of linear algebra. Today, GPUs accelerate these operations.

## Artificial Intelligence meets Energy

The field of AI has primarily focused on improving accuracy. While AI still makes mistakes, it has surpassed human abilities in many fields. For example, in 2021, AI surpassed human accuracy in answering visual questions and understanding English. In 2016, AI beat humans in Go, one of the most complex board games.

Nowadays, we only see marginal improvements in accuracy, and energy consumption has skyrocketed. The chair has extracted three pivotal observations from Stanford University's 2023 Artificial Intelligence Index Report that support these claims.

## Performance Saturation

Firstly, there's a noticeable trend of Performance Saturation within AI. This means that despite our efforts to make AI more accurate, it's not improving as rapidly as it used to. In most tasks, performance gains are now only in the single digits. In the past, we would see big improvements every year, but now they are much smaller. For instance, last year, AI only got about 4% better on average, compared to a historical average of around 42.4%. So, the pace of improvement in AI is slowing down, and the gains we're making are relatively modest.

## Scaling and Costs of AI Models

Figure 1: Estimated Training Cost of Select Large Language and Multimodel Models [1]

The second observation concerns the scaling and costs of AI models. These models are growing in size, and there's a direct link between the cost of training large language and multimodal models and their size. As shown in the figure above, the AI Index supports the widely held belief that these expansive models are becoming increasingly expensive to train, often costing millions of dollars. Furthermore, because the cost is increasing, we can expect that the energy consumption associated with training these models is also rising substantially.

## Impact on Global CO₂ Emissions

Figure 2: CO2 Equivalent Emissions (Tonnes) by Selected Machine Learning Models and Real Life Examples, 2022 [1]

Lastly, the Impact on Global CO₂ Emissions is an increasingly pressing concern. The energy needed to train AI models is contributing significantly to the increasing levels of CO₂ emissions worldwide. As shown in the figure above, training GPT-3 consumes as much energy as flying 500 passengers from NYC to San Francisco.

Most AI algorithms run on GPUs. A GPU consumes between 300W to 700W. That’s about as much energy as running an oven. As training a model often involves running many GPUs, training consumes a lot of energy. For instance, training ChatGPT models can consume the energy equivalent of running 20,000 ovens continuously for several days. Running AI models takes a lot of energy as well. Generating an image with DALL-E uses the energy equivalent of fully charging an iPhone. At the same time, the human brain only consumes 20W. Training a ChatGPT model consumes more energy than our brain consumes in a lifetime. With his research, Prof. Herbrich aims to close the gap in energy consumption. He structures this research in three areas: Systems, Methods, and Theory.

## Systems

In the systems area, Prof. Herbrich explores ideas to run existing algorithms more efficiently. This involves both hardware and systems-level software. Prof. Herbrich talked about two concepts for building more efficient systems.

The first concept is **Low-Precision Arithmetics**. Here, they investigate the effectiveness of reducing the bit representation of model parameters. Currently, model parameters are typically represented as floating-point numbers. Their representation usually uses 32 or 64 bits. It comprises one bit for the sign, some bits for the mantissa (i.e., the digits after the decimal point), and some bits for the exponent (i.e., the number before the decimal point). The objective is to decrease the number of bits used to represent these numbers. For example, instead of using 64 bits, which would necessitate 64 changes at the system level for precise numerical representation, they explore the feasibility of using fewer bits, such as 32 bits or even fewer. However, employing a lower number of bits reduces the number of values that we can represent. We can compensate for this loss in precision by employing probabilistic techniques. The chair has shown that reducing the number of bits yields substantial improvements in energy efficiency. When they reduced the parameter representation from 64 to 8 bits, they saw an 88 percent and 89 percent reduction in energy consumption for the Cholesky and Conjugate Gradient algorithms respectively, while maintaining a similar level of accuracy.

The second concept revolves around **Low Voltage Processing**. In today's computer systems, bits are stored by transistors. As long as these transistors are powered, they can keep their state indefinitely. The chair collaborates with MIT to utilize transistors operating at lower supply voltages. When at room temperature, these transistors can still maintain their state. However, changing their state can fail. This results in uncertainty about the state of the system. For many applications, such behavior is unacceptable. For instance, in banking, you need certainty about whether or not a transaction has gone through. However, Prof. Herbrich claims these transistors are useful for AI since AI algorithms rely on probability anyway.

## Methods

In the methods domain, the chair focuses on AI algorithms. Prof. Herbrich asserts that three operations make up most of these algorithms: addition, multiplication, and taking the maximum. He claims we can improve energy efficiency by reducing the number of operations we have to perform. To reduce the number of operations, he exploits mathematical properties. The following table [2] contains examples of reducing the number of operations using distribution.

Operation 1 | Operation 2 | Example |
---|---|---|

+ | · | a · b + a · c = a · (b + c) |

max | · | max(a · b, a · c) = a · max(b, c) |

max | + | max(a + b, a + c) = a + max(b, c) |

## Theory

In the theory area, the chair aims to establish a mathematical relationship between energy consumption and accuracy improvements in machine learning. This relationship is unknown since current theories only focus on the relationship between data requirements and learning accuracy. However, both learning improvements and energy consumption are related to information theory.

Algorithmic coding provides the link between learning improvements and information theory. In order to compress data, we need to predict it well. For instance, after encountering “th” in English text, the next letter is more likely to be 'e' rather than 'h'. Hence, by using fewer bits to represent “the” than to represent “thh”, we can improve compression. On the other hand, if we can compress data well, we must also be able to predict it well.

Landauer’s principle relates information theory to energy consumption. It states how much energy is needed to erase one bit of information by using compression. This number is actually very small: it’s only 2.9×10⁻²¹ Joules at room temperature. Prof. Herbrich suggests we can use these two relationships to investigate the relationship between energy consumption and accuracy improvements.

## Energy meets Artificial Intelligence

After talking about making AI more energy efficient, Professor Ralf Herbrich introduced a whole new research area. In this part of the lecture, Herbrich talked about possible solutions to some of the problems of renewable energies..

The amount of energy emitted by the sun hitting the earth is about 200000 times as much as the human population currently needs. However, because of the earth's rotation and varying weather conditions, we cannot use solar energy the whole time. Therefore, energy storage solutions are required to use solar energy effectively.

In an overview of possible energy storage solutions, Herbrich mentions pumped hydroelectric energy storage, but also, due to their high energy density, coal and oil. These fossil energy sources come however with two major drawbacks: they take very long to regenerate and consuming them is very harmful to the planet's environment.

Figure 3: Battery Life Cycle [2]

Batteries have improved massively as energy storage devices over the past decades, especially in recent years with the increasing number of electric vehicles.

Due to the nature of the chemical process of charging and discharging batteries, the overall capacity of a battery decreases with the number of charging cycles. When a battery reaches 70% of its original capacity, it can no longer be used in a car, because the car’s range decreases with the battery capacity. If at all, the battery only barely offsets the CO₂ emissions of its production during this time. However, the battery can still be used for other applications, thus gaining a second life and improving sustainability. This can be seen in the figure above.

The current restricting factor of a second life is the unknown state of each battery cell after the first life. This leads to uneven balancing, resulting in a shorter second life span. Prof. Herbrich proposes that battery managers should try to extend the life of batteries. Currently, the functionality of battery managers is limited to the prevention of overheating. Prof. Herbirch proposes that battery management systems could prolong the second life by using AI models that predict the state of each battery cell. To build these models, Prof Herbrich suggests simulating the life of batteries. Given the simulated sensor data like voltages and temperatures, he wants to use probabilistic machine learning to infer the current and the initial state of each battery. He claims that this would lead to a causal model that would help the battery manager make optimal decisions.

## Summary

We gauge a system's intelligence by comparing its performance to a human's. When we interact with a machine from a distance, and it appears like we are interacting with a human, we refer to it as artificially intelligent. Such artificial intelligence has been achieved in many areas. However, these systems consume orders of magnitudes more power than the human brain. One key focus of the chair is building such a system that consumes no more power than the human brain.

A computer program learns if it performs better using prior experiences. This is called machine learning. The chair focuses on a specific form of machine learning called probabilistic machine learning. In particular, they are developing methods and hardware to make probabilistic machine learning more energy efficient.

Another key focus is deriving a theory about the minimal amount of energy that is necessary to improve the accuracy of an artificially intelligent system.

Lastly, the chair develops algorithms and systems that help improve battery life.

## Resources

[1] Artificial Intelligence Index Report 2023 - Stanford University (stanford.edu)

[2] Lecture "Artificial Intelligence and Sustainability " (tele-task.de)