My notes on risks from learned optimisation

Base optimiser (e.g. SGD) which is used to create a learned algorithm in order to complete some task (e.g. classifying pictures as cats or dogs)
Sometimes the learned algorithm may also be an optimiser e.g. it searches through a space of possible options, in order to complete the task better (e.g. an LLM search through the space of options in order to give you cooking instructions for the ingredients you happen to have). In this case, the learned algorithm is a mesa-optimiser
For example, consider humans in evolution… #todo
Worry: mesa-objective by default quite different from base objective

When should we expect the learned algorithm to be a mesa-optimiser?

When the task…
- requires generalisation, or as the environment gets more and more diverse
- involves modelling humans (e.g. RLHF)
Base optimiser:
- Inductive biases
- Mesa-optimiser needs to be reachable by the base optimiser (i.e. surrounded by other learned algorithms which do well on the task)

Questions:

Are LLMs mesa-optimisers? Should we think of each possible AI as having some level of mesa-optimisation?