by Calvin Leather, Yuqing Hu

In response to this article: http://www.jneurosci.org/content/36/39/10016

Recent literature in reinforcement learning has demonstrated that the context in which a decision is made influences subject reports and neural correlates of perceived reward. For example, consider visiting a restaurant where have previously had many excellent meals. Expecting another excellent meal, when you receive a merely satisfactory meal, your subjective experience is negative. Had you received this objectively decent meal elsewhere, without the positive expectations, your experience would have been better. This intuition is captured in adaptive models of value, where a stimuli’s reward (i.e. Q-value) is expressed as being relative to the expected reward in a situation, and it has been found that this accurately models activation in value regions (Palminteri et al 2015). Such a model also can be beneficial as it allows reinforcement learning models to learn to avoid punishment, as avoiding a contextually-expected negative payoff results in a positive reward. This had previously been challenging to express within the same framework as reinforcement learning models (Kim et al, 2006).

Alongside these benefits, there has been concern that adaptive models might be confused by certain choice settings. In particular, agents with an adaptive model of value would have an identical hedonic experience (i.e. Q-values in the model) when receiving a reward of +10 units, in a setting where they might receive either +10 or 0 units, and a reward of 0 units, in a setting where they might receive either -10 or 0 units (we will refer to this later as the ‘confusing situation’). With this issue in mind, Burke et. al. (2016) develop an extension to the adaptive model, where contextual information only has a partial influence on reward. So, whereas the previous, fully-adaptive model has a subjective reward (Q-value) of +5 units for receiving an objective reward of 0 in the context where the possibilities were 0 and -10, and an absolute model ignoring context would experience a reward of 0, the Burke model would experience a reward of +2.5. It takes the context into account, but only partially, and accordingly they call their model ‘partially-adaptive’. Burke et. al. compare this partially-adaptive model with a fully-adaptive model, and an absolute model (which ignores context). When subjects were given the same contexts and choices as the confusing situation outlined above, Burke et. al. found that the partially-adaptive model reflects neural data in the vmPFC and striatum better than the fully-adaptive or absolute models.

The partially-adaptive model is interesting, as it has the same advantages as the fully-adaptive model (reflecting subjective experience and neural data well, allowing for avoidance learning), while potentially avoiding the confusion outlined above. Here, we seek to investigate the implications and benefits of Burke et. al.’s partially-adaptive model more thoroughly. In particular, we will consider the confusion situation’s ecological validity and potential resolution, whether it is reasonable that partially-adaptive representations might extend beyond decision (to learning and memory), and the implications of the theory for future work. Before we do this we would like to briefly present an alternative interpretation of their findings.

The finding that the fMRI signal is best classified by a partially-adaptive model does not necessarily entail the brain utilizing a partially-adaptive encoding as the value over which decisions occur. All neurons within a voxel can influence the fMRI signal, so it is possible that the signal may reflect a combination of multiple activity patterns present within a voxel. This mixing phenomenon has been used to explain the success of decoding early visual cortex, where the overall fMRI signal in a voxel reflects the specific distribution of orientation-specific columns within a voxel (Swisher, 2010). Similarly, the partially-adaptive model’s fit might be explained by the average contribution of some cells with a full-adaptive encoding, and other cells with absolute encodings of value (within biological constraints). This concern is supported by the co-occurrence of adaptive and non-adaptive cells in macaque OFC (Kobayashi, 2010). Therefore, more work is needed to understand the local circuitry and encoding heterogeneity of regions supporting value-based decision making.

Returning to the theory presented by the authors, we would like to consider whether a fully-adaptive encoding of value is truly suboptimal. The type of confusing situation presented above was shown to be problematic for real decision makers in Pompilio and Kacelnik (2010), where starlings became indifferent between two options with different objective values, due to the contexts those options appeared in during training. However, this type of choice context might not be ecologically valid. If two stimuli are exclusively evaluated within different contexts, as in Pompilio and Kacelnik, it is not relevant whether they are confusable, as the decision maker would never need to compare them.

Separate from the confusion problem’s ecological validity is the inquiry into its solution. Burke et. al. suggest partially-adaptive encoding avoids confusion, and therefore should be preferred to a fully-adaptive encoding. However, this might only be true for the particular payoffs used in the experiment. Consider a decision maker who makes choices in two contexts. One, the loss context, has two outcomes, L0 (worth 0), and Lneg (worth less than 0), while the other, the gain context, has two outcomes, G0 (worth 0), and Gpos (worth more than 0). If L0-Lneg = Gpos– G0, as in Burke et. al., a fully-adaptive agent would be indifferent between G0 and Lneg (and between Gpos and L0). A partially-adaptive agent, however, would not be indifferent, as the value of G0 would be higher than Lneg.   Now consider what happens if we raise the value the value of Gpos. By doing this, we can raise the average value of the gain context by any amount. Now consider what this does the experienced value (Q-value) of G0. As we increase the average reward of the context, G0 becomes a poorer option in terms of its Q-value. Note that since the only reward we are changing is Gpos, the Q-values for the loss context do not change. Therefore, we can decrease the Q-value for G0 until it is equal to that of Lneg. This is exactly the confusion that we had hoped the partially-adaptive model would avoid. Furthermore, this argument will work for any partially-adaptive model: we are unable to defeat this concern by parameterizing the influence of context in the update equations, and manipulating this parameter.

As mentioned earlier, it is possible that some cells might encode partially-adaptive value, while others might have a fully- or non-adaptive encoding. We should be open to the possibility that even if partially-adaptive value occurs in decision, non-adaptive encodings might be used for storage of value information, and are transformed at the time of decision into the observed partially-adaptive signals. Why might this be reasonable? An agent who maintains partially-adaptive representations in memory faces several computational issues. One is efficiency: storage requirements for a partially-adaptive representation requires S*C quantities or distributions to be stored (one for each of the S stimuli in each of the C contexts). On the other hand, consider an agent who stores non-adaptive stimulus values and the average value of each context, and then adjusts stimulus values using the context values at the time of decision. They could utilize the same information, but only store S+C quantities. Another problem with storage of value information in an adaptive format is the transfer of learning across contexts. If I encounter a stimulus in context A, my experiences should alter my evaluation of that stimulus in context B: getting sick after eating a food should reduce my preference for that food in every context. An agent who stores value adaptively would need to update one quantity for the encountered stimulus in each context, namely C quantities. An agent who stores value non-adaptively only updates a single quantity. So, even if decision utilizes partially-adaptive encoding, non-adaptive representation is most efficient for storage. Furthermore, non-adaptive information is present in the state of the world (e.g. concentration of sucrose in a juice does not adapt to expectations), so this information is available to agents during learning. Accordingly, it must be asked why agents would discard information that might ease learning. While these differences do not necessarily affect the authors’ claims about value during decision, they should be considered when investigating the merits of different models of value.

In sum, while partial adaption is an exciting theory that may provide novel motivations for empirical work, more effort is needed to understand when and where it is optimal. If we can overcome these concerns, the new theory opens up potential investigation into the nature of contextual influence: if we allow a range of contextual influence (via a parameter) in the partially-adaptive model, do certain individuals have more contextual influence, and does this heterogeneity correlate with learning performance? Do different environments (e.g. noise in signals conveying the context) alter the parameter? Do different cells or regions respond with different amounts of contextual influence? As such, the theory opens up new experimental hypotheses that might allow us to better understand how the brain incorporates context in the learning and decision-making processes.

 

 

References

 

Burke, C. J., Baddeley, X. M., Tobler, X. P. N., & Schultz, X. W. (2016). Partial Adaptation of Obtained and Observed Value Signals Preserves Information about Gains and Losses. Journal of Neuroscience, 36(39), 10016–10025. doi:10.1523/JNEUROSCI.0487-16.2016

Kim, H., Shimojo, S., & Doherty, J. P. O. (2006). Is Avoiding an Aversive Outcome Rewarding ? Neural Substrates of Avoidance Learning in the Human Brain. PLoS Biology, 4(8), 1453–1461. doi:10.1371/journal.pbio.0040233

Kobayashi, S., Carvalho, O. P. De, & Schultz, W. (2010). Adaptation of Reward Sensitivity in Orbitofrontal Neurons. Journal of Neuroscience, 30(2), 534–544. doi:10.1523/JNEUROSCI.4009-09.2010

Palminteri, S., Khamassi, M., Joffily, M., & Coricelli, G. (2015). Contextual modulation of value signals in reward and punishment learning. Nature Communications, 6, 1–14. doi:10.1038/ncomms9096

Pompilio, L., & Kacelnik, A. (2010). Context-dependent utility overrides absolute memory as a determinant of choice. PNAS, 107(1), 508–512. doi:10.1073/pnas.0907250107

Swisher, J. D., Gatenby, J. C., Gore, J. C., Wolfe, B. A., Moon, H., Kim, S., & Tong, F. (2010). Multiscale pattern analysis of orientation-selective activity in the primary visual cortex. Journal of Neuroscience, 30(1), 325–330. doi:10.1523/JNEUROSCI.4811-09.2010.Multiscale