Chapter 9. Meta-Learning#

\[ % Latex macros \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\p}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\b}[1]{\boldsymbol{#1}} \newcommand{\c}[1]{\mathcal{#1}} \newcommand{\w}{\boldsymbol{w}} \newcommand{\x}{\boldsymbol{x}} \]

Contents#

  • Tuning meta-parameters

  • Reuse of representations

  • Reuse of models

  • Modularity and compositionality

We learn to understand and control many things in our life and learning one task often makes learning another task easier. This observation has been a target of study under various keywords, such as

  • Lifelong learning (Thrun 1996)

  • Meta learning (Doya 2002)

  • Transfer learning (Taylor & Stone 2009)

Tuning meta-parameters#

Learning algorithms change the parameters of the model, such as the weights of neural networks, but most algorithms have higher-level parameters that control how learning goes on. They are called hyperparamters or metaparameters and control the balance of various trade-offs. Those meta parameteres are often tuned by machine learning engineers based on domain knowledge or past experience, but making the parameter tuning automatic or unnecessary is an important research topic.

How that is done in our brain is also an important question in neuroscience.

Learning Rate#

The learning rate parameter deals with the trade-off of quick learning and stable memory. Stochastic gradient decent and other online learning algorithms take the form

\[ w(t+1) = w(t) + \alpha(u(t) - w(t)) \]

where \(u(t)\) is the input for the parameter update.

This can be re-formulated as

\[ w(t+1) = \sum_{s=1}^t \alpha(1-\alpha)^{t-s}u(s)\]

showing that \(w\) is an exponentially weighted average of past inputs \(u(s)\) with the decaying factor \(1-\alpha\). With a large \(\alpha\) close to one, past experiences are quickly forgotten.

To effectively average over about \(N\) past samples, the learning rate has to be set at the scale of \(\alpha\simeq\frac{1}{N}\)

In a stationary environment, a good strategy is to take an even average of past inputs

\[ w(t+1) = \frac{1}{t}\sum_{s=1}^t u(s) \]

This can be realized by hyperbolically decaying learning rate \(\alpha=\frac{1}{t}\) because

\[ w(t+1) = \frac{1}{t}(\sum_{s=1}^{t-1}u(s) + u(t)) = \frac{1}{t}((t-1)w(t) + u(t)) \]
\[ = w(t) + \frac{1}{t}(u(t) - w(t)) \]

Exploration-Exploitation#

In reinforcement learning, the balance of exploration of novel actions and exploitation by an action that is known to be good is controlled by \(\epsilon\) in \(\epsilon\)-greedy action selection and the inverse temperature \(\beta\) in the softmax or Boltzmann action selection

\[ p(a|s) = \frac{e^{\beta Q(s,a)}}{\sum_{a'\in\c{A}}e^{\beta Q(s,a')}} \]

where \(\c{A}\) is the set of available actions.

This can be seen as maximization of the negative free energy

\[ –F = E[ Q(s,a) - \frac{1}{\beta}\log p(a|s)] \]

which is a sum of expected action value and the entropy of action selection probability.

As learning goes on, to reduce exploration and to promote exploitation, the inverse temperature is gradually increased, or the temperature \(\tau=\frac{1}{\beta}\) is reduced. This is called annealing.

Another ways to promote exploration in early stage of learning is to give an additional reward for the states or state-action pairs that have not been tried before. This is known as novelty bonus (Kakade & Dayan, 2002). A similar effect can be realized by optimistic initial setting of the value functions.

Further sophistication is Bayesian reinforcement learning which tries to learn not just the expected reward but the distribution of the reward \(P(r|s,a)\). Starting from a flat prior distribution, as the reward distribution becomes sharper, there is less need for exploration. Knowledge of the reward distribution further allows optimistic, risk-seeking action selection or conservative, risk-avoiding action selection.

Temporal discount factor#

The temporal discount factor \(\gamma\) defines the temporal scale of maximization of future rewards

\[ E[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ...] \]

A low setting of \(\gamma\) can result in a short-sighted or impulsive behaviors which neglect long-term consequences of an action.

Although a high setting of \(\gamma\) promotes a long-term optimal behaviors, setting \(\gamma\) very close to one can make prediction more demanding, thus takes long time to learn.

For an average reward \(r_t\sim r_0\), the value function takes the order of

\[ V \simeq \frac{r_0}{1-\gamma} \]

which grows very large as \(\gamma\) is set close to one. Then the temporal difference error

\[ \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \]

is dominated by the temporal difference of value functions, responding less to the encountered reward \(r_{t+1}\).

If it is known that approximately \(n\) steps are necessary from the initial state of the goal states, the temporal discounting should be set at the order of \(\frac{1}{1-\gamma}=n\), i.e., \(\gamma=1-\frac{1}{n}\).

Self-tuning of meta-parameters#

Can we use reinforcement learning or evolutionary algorithms to find good meta-parameteres for a given range of tasks or environments?

That idea had been tested with reinforcement learning (Schweighofer, Doya 2003) and evolutionary algorithm (Elfwing et al. 2011).

More recently, the idea of automatizing parameter tuning and model selection has been addressed in the project of AutoML

Neuromodulators#

Neuromodulators are a subset of neurotransmitters that are not simply excitatory or inhibitory, but have complex, sometimes long lasting effects depending on the receptors. The most representative neuromodulators are dopamine (DA), serotonin (5HT), noradrenaline (NA) (also called norepinephrine (NE)), and acetylcholine (ACh).

The neurons that synthesize those neuromodulators are located in specific areas in the brain stem and their axons project broadly to the cerebral cortex and other brain areas.

From these features, neuromodulators are proposed to broadcast some signals and to regulate some global parameters of brain function.

{figure-md} Fig_Modulators Neuromodulators Major neuromodulators and their projections. Red: dopamine (DA) from ventral tegmental area (VTA) and the substantia nigra pars compacta (SNc). Green: serotonin (5HT) from the dorsal raphe (DR) and median raphe (MR) nuclei. Blue: noradrenaline (NA), or norepinephrine (NE), from the locus ceorelues (LC). Magenta: acetylcholine (ACh) from the septum (S), mynert (M) nucleus, and the pedunculo pontine nucleus (PPTN). (from Doya, 2002).

Dopamine neurons in the ventral tegmental area (VTA) and the substantia nigra pars compacta (SNc) have been shown to represent the reward prediction error signal, the most important global learing signal \(\delta\) in reinforcement learning (Schultz 1998).

Building up on this notion, Doya (2002) proposed other major neuromodulators also encode and regulate global signals in reinforcement learning, namely, serotonin for the temporal discount factor \(\gamma\), noradrenaline for the inverse temperature \(\beta\), and acetylcholine for the learning rate \(\alpha\).

Serotonin and delayed reward#

Motivated by the hypothesis, there has been a series of studies assessing the role of serotonin in regulating the behaviors for delayed rewards.

For example, Miyazaki et al. (2011) showed that serotonergic neurons in the dorsal raphe nucleus increase firing when rats kept waiting for a food pellet or water spout to come out.

More recently, it was shown in multiple laboratories that optogenetic activation of dorsal raphe serotoin neurons promote behaviors for delayed rewards (Miyazaki et al. 2014, 2018; Lottem et al. 2018).

However, the dorsal raphe serotonin neurons have been implicated in other behaviors as well, such as flexible switching of choice in the reversal task (Matias et al. 2017).

Grossman et al. (2021) proposed a meta-learning model in which the learning rate is increased with an increase in the reward prediction error, and suggested that serotonin is involved in meta-learning by monitoring recent reward prediction errors.

Acetylcholine, noradrenaline and learning#

Acetylcholine have been shown to facilitate learning from new sensory inputs (Hasselmo & Sarter 2010).

Noradrenalinergic neurons in the locus coeruleus has been shown to have phasic and tonic modes of operation. While phasic activation was suggested to promote selection of the optimal response for exploitation (Usher et al. 1999), tonic activities have been suggested to promote exploration of suboptimal actions (Aston-Jones & Cohen 2005).

Based on a Bayesian framework of learning, Yu and Dayan (2005) proposed a theory of differential roles of acetylcholine and noradrenaline; acetylcholine for expected uncertainty and noradrenaline for unexpected uncertainty.

Sensory representation learning#

When an animal learn a behavior, such as classical conditioning, a big challenge is to figure out which of the sensory inputs, such as vision, audition and odor, is relevant for any responses. Once the animal realizes which sensory input is relevant in the task, such as sound, subsequent learning can be much faster, for example, linking other sounds to action or reward.

Courville and colleageus derived a Bayesian framework of how an animal infers the hiddden cause of stimuli and reward (Courville et al. 2005).

Having appropriate set of sensory features is critical in pattern recognition and other tasks (Bengio et al. 2012). A common practice in visual object recognition is to re-use the features learned in the hidden layers of a deep neural network trained by a big dataset by re-training only the weights in the upper layers for a new but similar task.

Models and latent variables#

In reinforcement learning, an agent learns a policy given the environmental dynamics and reward setting. In the model-free approach, an agent learns a policy for each setting of the environment.

In the model-based approach, an agent learns internal models of the state dynamics and reward function, and use internal computation to infer what is the best action. This indirectness gives computational burden for real-time execution, but may provide a benefit in adaptation, such as in the case where only the goal or the reward setting is changed but the state dynamics stays the same.

The behavioral benefit and neural mechanisms for model-based reinforcement learning were demonstrated by Glascher et al. (2010) by asking Caltech students to first learn a tree-like state transition, then disclosing the rewards at leaf nodes and finally letting them to find the right action sequence.

Daw et al. (2011) took a similar two-step task and analyzed how subjects respond to a large reward following a rare transition in the first step.

Modularity and compositionality#

In learning the model of the environment, rather than to learn a monolithic model to predict everything, it is more practical to learn multiple models for different situations or aspects of the environment and select or combine them as required.

Learning and use of modular internal models have been demonstrated, for example, in motor control tasks (Ghaharamani & Wolpert 1997). Such notions promoted computational architectures for modular learning and control, such as the MOSAIC architecture (Wolpert et al. 2003).

In reinforcement learning and optimal control theory, how to design controllers that can be efficiently combined is an area of active research (Todorov 2009).

Yang et al. (2018) trained a single recurrent neural network to perform 20 different cognitive tasks and analyzed how the hidden neurons are used in different tasks. They found different sets of hidden neurons develop into multiple clusters specialized for different cognitive processes, allowing compositional reuse of learned modules.

In cognitive science, compositionality of models and skills is also regarded as a major component in human intelligence (Lake et al. 2017).

Pathway gating#

Most studies of functional MRI assume that, when a subject is asked to perform a certain task, computational modules requires for that are activated and identify brain areas specialized in particular computations. However, it is unknown how those required modules are activated and connected appropriately. This poses us interesting problems both at computational and biophysical implementation levels.

Wang and Yang (2018) termed this as the pathway gating problem. They considered possible biophysical mechanisms, focusing on the roles of different inhibitory neurons in the cortical circuit.

Fernando et al. (2017) proposed Pathnet to use evolutionary optimization to find out which pathway in a deep neural network is to be used for each particular task.

Meta-reinforcement learning#

While reinforcement learning is based on the gradual learning of the policy, animals and humans some times change their policy quickly based on the recent experience of actions and rewards. A typical example is the “win-stay-lose-switch” strategy.

Ito & Doya (2015) analyzed the choice sequences of rats in a binary choice task and showed that a finite-state model that assumes the agent transits between a number of discrete states based on the action-reward experience and takes a policy depending on the state can fit the animal behaviors better than reinforcement learning models.

Wang et al. (2018) trained a recurrent neural network to perform a reinforcement learning task with variable parameters, such as variable reward probabilities for left and right choices in a bandit task. After sufficient training, they found that the network can adapt to new parameter setting even when their weights are fixed. That was because the hidden units had learned to change their state

This was because the task-relevant features, such as the reward probability for different actions, are represented and update in the hidden state of the recurrent neural network, and the output was changed by the latent variables. They called this mechanism as “meta-reinforcement learning.”

In large language models (LLMs), the network produces a variety of output based on the text presented, some of them providing the context or task domain. The adaptive output by such preceding inputs is called “in-context learning.”

References#

Learning rate and exploration#

Serotonin:#

  • Miyazaki K, Miyazaki KW, Doya K (2011) Activation of dorsal raphe serotonin neurons underlies waiting for delayed rewards. The Journal of Neuroscience, 31:469-479. https://doi.org/10.1523/JNEUROSCI.3714-10.2011

  • Miyazaki KW, Miyazaki K, Tanaka KF, Yamanaka A, Takahashi A, Tabuchi S, Doya K (2014) Optogenetic activation of dorsal raphe serotonin neurons enhances patience for future rewards. Current Biology, 24:2033-2040. https://doi.org/10.1016/j.cub.2014.07.041

  • Miyazaki K, Miyazaki KW, Yamanaka A, Tokuda T, Tanaka KF, Doya K (2018) Reward probability and timing uncertainty alter the effect of dorsal raphe serotonin neurons on patience. Nature Communications, 9:2048. https://doi.org/10.1038/s41467-018-04496-y

  • Lottem E, Banerjee D, Vertechi P, Sarra D, Lohuis MO, Mainen ZF (2018). Activation of serotonin neurons promotes active persistence in a probabilistic foraging task. Nat Commun, 9, 1000. https://doi.org/10.1038/s41467-018-03438-y

  • Grossman CD, Bari BA, Cohen JY (2021). Serotonin neurons modulate learning rate through uncertainty. Curr Biol, 10.1016/j.cub.2021.12.006. https://doi.org/10.1016/j.cub.2021.12.006

  • Doya K, Miyazaki KW, Miyazaki K (2021). Serotonergic modulation of cognitive computations. Current Opinion in Behavioral Sciences, 38, 116-123. https://doi.org/10.1016/j.cobeha.2021.02.003

Acetylcholine and noradrenaline (norepinephrine)#

Representation learning#

Model-based strategies#

  • Glascher J, Daw N, Dayan P, O’Doherty JP (2010). States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66, 585-95. http://doi.org/10.1016/j.neuron.2010.04.016

  • Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69, 1204-15. https://doi.org/10.1016/j.neuron.2011.02.027

Modularity and compositionality#

Pathway gating#

Meta-reinforcement learning#

  • Ito M, Doya K (2015). Parallel representation of value-based and finite state-based strategies in the ventral and dorsal striatum. PLoS Comput Biol, 11, e1004540. https://doi.org/10.1371/journal.pcbi.1004540

  • Wang JX, Kurth-Nelson Z, Kumaran D, Tirumala D, Soyer H, Leibo JZ, Hassabis D, Botvinick M (2018). Prefrontal cortex as a meta-reinforcement learning system. Nat Neurosci, 21, 860-868. https://doi.org/10.1038/s41593-018-0147-8