FIGURE 16-1 Agent–environment interaction in reinforcement learning [2]. A: Primary reward signals are supplied to the agent from a “critic” in its environment. B: A refinement in which the environment is divided into an internal and an external environment, with all reward signals coming from the former. The shaded box corresponds to what we would think of as the animal or robot.
In the biological world, occurrent pain clearly serves an adaptive purpose by eliciting escape responses and thus protecting organisms from further danger. Taking a design perspective, adaptivity could be dispensable for robots that perform repetitive tasks in predefined environments (such as production lines), but it is a sine qua non if they are to work and survive in human, dynamic scenarios, where there is no expert programmer at hand. Among the repertoire of artificial adaptive mechanisms ranging from unsupervised to supervised ones [23], RL algorithms are the most appropriate in the aforementioned scenarios, where evaluating the performance of a behavior is possible but the optimal behavior is unknown.
RL refers to the process of improving performance through trial and error. The simplest RL algorithms rely on the intuitive idea that if an action is followed by an improvement in the robot’s state, then the tendency to produce such action should be strengthened in that situation, whereas it should be weakened if a negative effect like pain occurs. This is the classic mechanism of reward/punishment. More formally, RL algorithms entail progressively building a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The robot is not told which action to take, as in supervised learning, but instead must discover which actions yield the highest reward by trying them. In more complex cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of RL [21].
Fig. 16-1A shows the classic diagram of an RL agent (e.g., a robot controller) interacting with its environment. The agent generates actions in the context of sensed states of this environment, and its actions influence how the environment’s states change over time. The environment contains a “critic” component that, at each time step, provides the agent with a numeric evaluation of its ongoing behavior. The term critic is used, instead of “teacher,” because in supervised learning algorithms a teacher provides more informative instructional information, such as directly telling the agent what its actions should have been instead of merely scoring them.
In the biological world, the critic’s evaluation corresponds to what behavioral psychologists call primary reward, namely that encouraging behavior directly related to survival and reproductive success, such as eating, drinking, and escaping. The mapping from states to rewards is called a reward function, which is an essential component of RL algorithms since it implicitly encodes the problem the agent must learn to solve. Pain takes the form of negative reward.
As mentioned above, the objective of a complex agent is to act at each time step so as to maximize not the immediate reward, but the total reward it expects to receive over the future (called the expected return), which is usually a weighted sum in which later rewards are weighted less than earlier ones. Because the agent’s actions influence the environment’s state over time, maximizing expected return requires the agent to control the evolution of its environment. This is very challenging, since the agent might have to sacrifice short-term reward in order to achieve more reward over the long term. Some pains are bearable if the expected return is high enough.
The simplest RL agents attempt to achieve this objective by adjusting a policy, which is a rule that associates actions to observed environment states. A policy corresponds to a stimulus–response (S-R) rule of animal learning theory. But RL is not restricted to simple S-R agents: more complex RL agents learn models of their environments that they can use to make plans about how to act appropriately. This entails the capacity to predict and choose between alternative courses of action.
Analogously, RL in robotics is applied in two rather different forms depending on whether it takes place at the sensorimotor or cognitive levels. Sensorimotor adaptation is implemented by building mappings from stimuli to proper movements, while cognitive learning entails constructing symbolic representations to guide decision making. In the following two subsections, we will describe how associative S-R agents have been used to learn robot sensorimotor mappings and a body schema, respectively, whereas symbolic RL agents for learning to plan will be described in “Reinforcement Learning to Plan.” A detailed survey of the challenges and successes in the application of RL to robotics is provided in [8].
Reinforcement Learning of Sensorimotor Mappings
Motion control, both in biological and technological systems, relies strongly on sensorimotor mappings. These mappings vary depending not only on the nature of the involved sensors and actuators, but also on the goal pursued. Hand–eye coordination and stable walking are skills that may be acquired through RL, their underlying sensorimotor mappings involving visual and proprioceptive signals, respectively, and their reward functions relying on somewhat delayed reinforcement in the form of success (object grasped) or pain (falling down).
Despite their apparent disparity, these mappings have in common an underlying highly nonlinear relation between a continuous (often hard to interpret) input domain and a continuous motor domain; a relation that in most cases would be impossible to derive analytically. Thus, such relation needs to be approximated through associative learning. Neural network architectures have the versatility to encode a large range of nonlinear mappings, and, through RL, they have proven adequate to handle the massively parallel task of relating perception patterns to motor commands.
For further details, the reader is referred to a review of neural learning algorithms used to approximate sensorimotor mappings for the control of articulated robots [24]. Similar procedures apply to wheeled robot navigation [13], where a mapping from proximity sensor readings to obstacle avoidance motions is learned by using as reward function the pain (or punishment) derived from collisions with environmental obstacles.
Acquisition of a Body Schema
In this volume, de Vignemont (Chapter 3) relates pain to the learning of the body boundaries and ultimately to the development of a bodily self. Along the same line but without referring to pain, Haselager et al. [5] highlight the importance of sensing one’s own movements for the development of a nonconceptual sense of self. These authors claim that proprioception and kinesthesis are essential in this development, and make a plea for robots to be equipped with a richer sense of proprioception, so as to advance in the understanding of creatures acting in the world with a sense of themselves.

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

