Tip:
Highlight text to annotate it
X
I am going to present studies about how we can build a 'curious' robot learner for 'interactive
goal-babbling' by designing a system for strategically choosing, 'what', 'how', 'when', and from
'whom' to learn . In other words, the system chooses the 'content', 'procedure', 'timing'
and 'source' of its learning process.
Our long-term goal is to enable 'life-long learning', which is about learning multiple
tasks , in an open-ended and evolving environment, and how to choose 'which tasks' it should
try to learn.
The main challenge to life-long learning , is that the learning agent has only 'limited
resources' like a limited life-time. On the other hand, the environment is open-ended
and its sensorimotor space is of very high dimension . Therefore the agent has to sample
a huge search space.
Inspired by works in psychological development , our idea is to endow the learning agent
with a sampling strategy using both social guidance and autonomous exploration based
on artificial curiosity , also called intrinsic motivation .
We implemented this idea into an algorithmic architecture for the learning agent to decide
'what' and 'how' to learn; 'what', 'when', 'how', and 'whom' to imitate;
Let us first set the background of our work. We wish to enable agents to learn during their
life-time so as to adapt constantly to the open-ended and changing environment. And one
example of successful learning agents are human babies.
During their development, we can observe that they choose to focus on different objects,
or activities, according to a developmental sequence .
But in this blooming and buzzing confusion which is their environments, how do babies
still manage to learn, and improve their skills? Despite this open-ended environment with so
many objects and people moving around, what are the 'principles' that make them focus
on toys and games in an ordered manner?
Likewise, the child decides to interact with social partners or not. How do they decide
when and whom to interact with?
We will analyse this behaviour from the learning perspective. How are these choices related
to active learning for multi-task , and life-long learning?
We will answer to these questions by analysing how we can sample an open-ended and high-dimensional
environment for life-long learning.
We are mostly interested by robot control .
And motor control learning can be described as the learning of a probabilistic distribution
'p', of 'b', given a . For instance, for a child learning to fish, she learns that, given
a position that she wants to reach with her float, what arm movement she needs to perform.
In this example, "b" would be a policy and, "a", a position of the float .
For learning such a probabilistic distribution, the learning agent has to sample the spaces
B and A.
But A and B in our real world can be continuous, and high-dimensional spaces. Therefore the
search space is very large . The mapping between A and B can be stochastic
. So repeating the same policy can lead to different outcomes.
The mapping can also be redundant . To reach outcome A 2 , you can perform different policies.
There can also be problems of in homogeneousity . You have 'un-learnable' spaces. For instance,
if you are learning how to fish in front of a lake you can only put the float at positions
around you, while positions that are 2 kilometres from you are un reachable .
You also have problems when exploring unbounded environment. Because acquiring data takes
time , and you only have a 'limited' life-time, you can only have a limited number of training
data, to learn from.
Therefore, guiding actively data collection can maximize what can be learnt within a life-time
.
For learning complex motor controls 2 methods which we call "modes", have been developed.
The first of them takes as source of information a 'teacher' or social partner. We call this
mode 'socially guided' exploration mode . "an appropriate robot controller can be derived
from observations of a human's own performance thereof".
The interaction with a teacher enables a direct transfer of knowledge from human to robot.
They can be categorised into 2 sub-categories. In the mimicry mode , the learning agent would
try to imitate the policy of the teacher. For example on the left-hand picture, the
little girl mimics the posture and the position of the rod of her sister. In the emulation
[ mode , the learning agent would try to produce the same outcome as the teacher. The little
girl on the right-hand side tries to put her float next to her sister's, but uses a different
policy. For socially guided exploration, several techniques
have been developed in robotics. For instance, programming by demonstration methods have
enable robots to learn a complex motor command to reach a specific outcome from few demonstrations.
In socially guided exploration, human input highlight 'interesting localities' to explore.
However, these methods are limited by the teaching dataset which can be 'sparse and
suboptimal'. The teacher may not give an insufficient number of demonstrations, or he might give
bad demonstrations because he is not an expert . These methods also have to address 'correspondence
problems' when the body and dynamics of the teacher and learner are different . Thirdly,
these methods are mainly developed to reach one single goal , and can be hardly extended
for multi-task learning .
The second mode uses 'oneself' as source of information. We call these methods 'autonomous'
exploration modes . The learning agent experiments by itself.
Methods such as reinforcement learning or goal-oriented learning of inverse models have
enabled learning agents to learn complex motor skills.
These methods have the advantage of enabling the agent to explore 'independently' of any
human effort . Its learning is also adapted to the agent's own body , which means it does
not have to face correspondence problems . Moreover, a few methods have been developed for multi-task
learning. But they still face problems when the explorable space is unbounded.
These 2 main families of methods use 2 kinds of source of information .
We would like to combine the advantages of both approaches into a single architecture
to learn a mapping between spaces A and B. To collect data, you can experiment by yourself.
You can also observe a teacher, whom you can mimic by reproducing the observed policy,
or whom you can emulate by reproducing the observed outcome.
The idea here is to have a single system that can use different modes of exploration , and
decide which sampling mode should be used. This principle of active learning can be generalized
to have active decision about which teacher the agent wants to imitate, if it wants to
copy the demonstrated outcome or self-decide on a goal outcome, if it wants to copy the
demonstrated policy . Likewise, if for autonomous learning , the agent can decide 'which outcome'
it wants to target, and 'which policy' to use.
These questions can be answered with active learning .
Methods of active learning, have enabled exploration to maximize the expected learning progress
, and to 'evaluate' this 'empirically'. This leads to a [[rate 150] meta-exploration [[rate
170] problem addressed with Bandit algorithms.
In this work, we use a different principle for active learning.
We use psychology theories of 'intrinsic motivation' as inspiration.
Intrinsic motivation, is defined as the doing of an activity, for its inherent satisfaction
rather than for some separable consequence. When intrinsically motivated, a person is
moved to act for the fun or challenge entailed rather than because of external products,
pressures or reward.
This theory developed first in psychology, has been successfully applied for robot learning
with, active goal-babbling. We would like to use the same principle, to
build [[rate 165]] a system to learn multiple tasks , with an active choice of outcomes
to produce, and in 'high-dimensional, stochastic, and continuous' search spaces.
I would now like to illustrate on an example this idea of devising a data-collection strategy
, based on social guidance and artificial curiosity , on a very simple example[[ rate
170]], and then later on see more complicated experimental setups.
Let's say, a teacher puts on a table an object, and asks you to be able to recognise the object
later on whatever 'position', and which ever 'orientation' it has. What would you do to
learn how to recognise this object?
One answer is by 'manipulation'. You can push the object to different positions. You can
lift and drop the object, or you can a human to manipulate the object for you. The question
here is : which manipulation will bring you more useful information about the object.
Then if you have not only 1, but 'several' objects to recognise, you should also decide
which object to manipulate.
In this video, you can see several objects. The humanoid robot iCub can lift and drop
this ball. By manipulating this way, the ball will land in a different position and orientation.
The robot gets a new image of the ball, to learn how to recognise it.
If we phrase this problem mathematically, we are learning a probabilistic distribution
'p' of 'b' given 'a', where 'b' is an object and 'a' is an image .
With regard to the active decisions we make, we're making a choice of 'manipulation' or
' sampling mode' we should use. We also actively choose 'which object' to manipulate, or in
other terms, 'which subspace ' to explore.
These choices can be easily summarised in this table. Each row corresponds to an object
: a car or a cube. And, each column corresponds to a manipulation : pushing, lift and dropping,
or interacting with a human. Choosing a combination of a manipulation and
an object means choosing a box in this table.
Our idea is to use active learning with intrinsic motivation based on competence progress to
make this choice. We will choose the object and manipulation that enable it to make the
most progress .
We define a competence measure, gamma . For an image 'a', gamma of 'a' is the competence
at recognising in image 'a', the right object .
We measure gamma à' empirically . We start by sampling stochastically all possibilities.
We plot for each box the competence gamma with respect to time. For instance, in this
case, pushing the car leads to the highest slope . Therefore, we will keep pushing the
car, to make more progress . As we keep pushing the car, our competence at recognising the
car increases . But its 'slope' starts to decrease because we have learned everything
about the car. The competence progress of pushing the car decreases. [[ slnc 500]]So,
we will switch to another object and manipulation that makes more competence progress . In this
case, we will ask the human to manipulate the cube. Again, we keep asking humans for
help, until the competence is high and we don't make any more progress .
To implement this idea, we designed an algorithmic architecture called SGIM-ACTS , which stands
for Socially Guided Intrinsic Motivation with Active Choice of Teacher and Strategy . It
has a hierarchical algorithmic architecture to explore the image space and the sampling
modes and the object space . If we interact with a teacher , we ask him
to manipulate an object for us. He will hand us an object 'b' 'g'. By manipulating the
object, he generates a new image 'a' 'r' of the object. With our recognition algorithm,
we recognise in this image 'a' 'r', the object 'b' 'r' . And the comparison between the recognised
object 'b' 'r' and the real identity of the object 'b' 'g' will give you a measure of
competence at recognising the object 'b' 'g'. This measure of competence is recorded to
compute the competence progress , and select the next sampling mode and object.
The same way, if you explore autonomously , you would manipulate an object 'b' 'g'.
By manipulation, you generate a new image 'a' 'r' of the object in a different position
and orientation. With our recognition algorithm, we recognise in this image 'a' 'r', the object
'b' 'r'. And the comparison between the recognised object 'b' 'r' and the real identity of the
object 'b' 'g' will give you a measure of competence at recognising the object 'b' 'g'.
This measure of competence is recorded to compute the competence progress , and select
the next sampling mode and object.
We would like to test this algorithmic architecture to see if SGIM-ACTS can choose sampling modes
for online learning . We would also like to test whether SGIM-ACTS is robust to bad teachers
.
We first wanted to see how well the robot can learn to recognise and differentiate different
objects . We plotted with respect to time its recognition level as the f-measure. Each
plot represents how well it can recognise each of the objects.
As shown in the graph on the left , with SGIM-ACTS , the robot iCub progressively learns to recognise
all the objects as their recognition level increase .
In comparison, we plotted on the right the recognition level when the robot uses a random
choice of manipulation and object . We can see that it learns to recognise almost all
the objects, but not the cube which is the green plot.
Moreover, we also plotted below the choice of objects that the robot manipulated, with
respect to time. We can see that indeed, in the case of a random sampling , the objects
are chosen in an unordered manner . In contrast, SGIM-ACTS chooses objects in a more ordered
manner . The graph on the left shows that the robot can actually detect that the cube
is difficult to recognise and concentrate on manipulating it.
Therefore, SGIM-ACTS learns better than Random sampling. And these experiments show that
guiding data collection improves performance.
In a second experiment , we wished to test the effect of the teacher's behaviour on the
performance of SGIM-ACTS. We conducted the same experiments, but this time the teacher
always shows the objects at the same position and orientation . Therefore, this bad teacher
does not bring useful information to the robot.
As expected, we can see in the right-hand side graph, that the robot performs worse
than previously.
Here, with random sampling , the robot actually never learns to recognise the cube .
In contrast, on the left-hand side graph, although it struggles in the beginning, SGIM-ACTS
manages in the end to recognize the cube . The plot of the chosen object below , shows that
again, SGIM-ACTS could concentrate on the cube .
Therefore, SGIM-ACTS can be robust to the quality of social guidance.
We have shown in this illustrative example, that active learning based on intrinsic motivation
can improve the learning performance . Now we would like to examine what happens when
we learn a more complicated probability distribution in continuous space .
This is what we tried to address in this second experimental setup.
In the fishing experiment , a robotic arm can manipulate a fishing rod to place the
float on the surface of the water. Here the surface of the water is represented by this
white surface. A camera from above records the position where the float has landed with
a green square , compared to the goal position marked by a white circle .
The robot can explore autonomously by making random movements .
A human teacher can also decide to give demonstrations .
In this case, the robot imitates the demonstrated movement several times to assess how close
it can reach the demonstrated position with the float.
Regularly, we also evaluate the performance of the robot by measuring how close it can
reach predefined positions . The robot can reach well positions close to where it explored
autonomously. It can also reach well positions close to demonstrations. But on unexplored
regions, it performs badly .
More precisely, our robot had 6 degrees of freedom . Its movement is determined by twenty
five parameters , which define the Bezier curve of target trajectories for the joint
angle of the robot. Therefore the policy space is of dimension twenty five .
For each movement, the robot observes the position of the float as outcome of its action.
The outcome space is the surface of the water, so the outcome space is a two dimensional
space .
If we phrase this problem mathematically, we are learning a probabilistic distribution
'p' of 'b' given 'a', where 'b' is a dynamical movement and 'a' is a position of the float
. Given a position, what dynamical movement the robot needs to perform to reach it ?
In this experiment, we use 2 sampling modes which are self-exploration and imitation . But
in this experiment, the learner does not decide which sampling mode it should use, this is
pre-programmed . With regard to the active decisions the robot
makes, it makes a choice of ' outcome ' or ' policy ' it should use when exploring autonomously.
For our robot to make these two choices we designed a simplified version of our algorithmic
architecture. This version is called SGIM-D for Socially Guided Intrinsic Motivation by
Demonstration . Here again, we have 2 modes . In the socially
guided sampling mode, a teacher makes a demonstration . With a correspondence, the robot translates
this demonstration into a demonstrated outcome and a demonstrated policy . The robot can
emulate the demonstrated outcome 'a' d and mimic the policy 'b' d . When the robot tries
to reproduce the policy 'b' d, it reaches an outcome 'a' r. And the distance between
the reached outcome 'a' r and the demonstrated outcome 'a' d gives you a measure how well
the robot can reach 'a' d. The progress in this measure gives an estimation of how 'interesting'
the outcome 'a' d is .
On the contrary, if the robot explores autonomously , the robot decides itself on a goal outcome
'a' g you want to reach. The robot also decides which policy 'b' r reaches 'a' g . In our
experiment, this choice of policy uses the Nelder-Mead algorithm for non-linear optimisation,
but it could be replaced by other optimisation algorithms.
When the robot executes the policy 'b' r , it actually reaches the outcome 'a' r . The difference
between 'a' r and 'a' g gives you a measure of competence at reaching 'a' g, so as to
decide on the next time step which outcome to set as a goal.
We are still doing the experiment on the physical robots, but for a first step, have tested
our algorithm on a simulator . The simulation environment is stochastic with
a non-uniform stochastic distribution . We plotted in the bottom left graph on the surface
of the water. The red crosses correspond to the positions reached by the float when repeating
the same movement 'b' 1 twenty times. As we can see, there are several positions, therefore
the environment is 'stochastic'. The green diamonds correspond to the positions reached
by the float when repeating another movement 'b' 2. Their distribution is different from
the red crosses. Therefore the stochastic distribution is non-uniform .
As shown in the previous video, to evaluate the performance of the learner, we defined
a set of benchmark points . The grey circle is the position of the robot in the middle
of the surface of the water. The red points are the goal outcomes that we ask the robot
to reach, and we measure the distance to all these points.
We also show the human demonstrations in the bottom right graph. The demonstration set
is sparse and uniformly distributed on the reachable space.
The demonstrations were given by kinesthetics .
The simulation environment can be seen on the screen on the left , and a physical robot
on the right enables the human demonstrator to control the robot in the simulation, and
retrieve the position of the float.
To assess the performance of our algorithm, we compared it with several other exploration
strategies . The baseline is random sampling of the policy space, where the robot chooses
random policies to learn. The second strategy is called SAGG-R.I.A.C . It is an algorithm
for learning with intrinsic motivation and goal-oriented exploration, and has proven
efficient to learn motor skills in high-dimensional spaces. The third kind of sampling is by observation
, where the robot sees a demonstration at a regular frequency. The difference with imitation
is that with imitation, the robot would repeat and experience by itself the demonstrated
policy with small variations. So SGIM-D is a mix between imitation and SAGG-R.I.A.C.
It observes demonstrations and imitates 5 times, then switches back to SAGG-R.I.A.C,
until the teacher makes a new demonstration .
In this set of experiments, we wish to test whether SGIM-D can learn to reach all the
goal outcomes better than the other algorithms. Secondly, can SGIM-D learn faster than the
other algorithms? Thirdly, for which outcomes does SGIM-D improve
the robot's performance? Finally, we investigate whether these results
are scalable to larger outcome spaces .
First, we conducted experiments and plotted the mean error with respect to time for the
different learning algorithms . In the left graph , we can see that the red
plot, which is the mean error for SGIM-D is lower than the others, which means that SGIM-D
learns with better precision . Its variance is smaller than that of the green plot, therefore
SGIM-D learns more reliably than SAGG-R.I.A.C. Besides, its mean error is lower from the
beginning, therefore SGIM-D learns faster than the other algorithms.
In terms of outcomes that the agent can reach, we plotted on the right , the histograms of
the outcomes reached by the robot. On the 2-D surface of the water, red areas are positions
where the float landed often, and blue are areas there the float seldom landed. The grey
circle is the center position of the robot. We can see that in random sampling , the robot
mostly reaches areas just in from of it . In SAGG-R.I.A.C , the explored space has increased,
and in SGIM-D, the explored space has increased even more . In the right box, I put crosses
where demonstrations are, for comparison.
These histograms show that SGIM-D has increased its explored space and explores more uniformly
the reachable space , including the isolated subspaces.
What about larger spaces ? In the second set of experiments, we used
an outcome space which is 100 times larger . On the left , we plotted the histogram of
the goal outcomes that the robot has set by itself . Whereas the histogram for SAGG-R.I.A.C
is fuzzy, the goals that SGIM-D set for itself corresponds better to the reachable space.
Therefore, somehow, SGIM-D has learned where the reachable space is .
On the right, we plotted the mean error with respect to time. The red plot is lower than
the other plots. So, SGIM-D learns with better precision and faster than random sampling
or SAGG-R.I.A.C.
Therefore, SGIM-D is robust to large spaces .
These plots show that SGIM-D learns faster than random exploration, imitation learning
or SAGG-R.I.A.C. It can also produce a wider variety of goal
outcomes and is scalable to large outcome spaces.
Thus the questions now are: How is the performance of SGIM-D dependent on demonstrations? And
what is the role of demonstrations?
To examine the sensitivity of SGIM-D to correspondence problems , we considered 2 teachers . Teacher
1 is a robot that has learned with SAGG-R.I.A.C and now gives demonstrations to our learning
agent. Teacher 3 is the human teacher seen in the last video.
On the right we plotted the mean error with respect to time for each of these teachers
and for SGIM-D and learning with observation. You already know the error plots for observation
and SGIM-D with teacher 3. The cyan plot represents the mean error for learning by observation
with teacher 1. The error is lower than the case of teacher 3, because both teacher and
learner are robots, so there are no correspondence problems.
We thus expect that SGIM-D with teacher 3 would have a lower error than SGIM-D with
teacher 1. But actually, the blue plot of SGIM-D with teacher 1 has an error comparable
to the red plot SGIM-D with teacher 3. Therefore, performance of SGIM-D is less sensitive to
small correspondence than learning by observation.
More precisely, we can even think from this graph that teacher 3 is a better teacher than
teacher 1, despite the correspondence problem. This is what we investigated on the next slide.
We examine the demonstrations in detail and plot for all the demonstrated movements the
values of joint angles of motor 1 with respect to time. Each line corresponds to a demonstration.
On the left are demonstrations of teacher 1. On the right are demonstrations of teacher
3 . We can see clearly a visible structure in all these demonstrated movements of teacher
3. They seem to be the same movement but only scaled differently. An anova analysis shows
that demonstrations 3 do not come from a random distribution.
We can say that as human demonstrations are structured differently, the robot learner
can take more advantage of them .
The next question we investigated is how robust SGIM-D is to the quality of demonstrations
, when demonstrations are suboptimal . For this, we considered teacher 4 and 5 , who
are actually subsets of teacher 3. So demonstrations 4 are demonstrations 3 which reach points
behind the robot, and demonstrations 5 are demonstrations 3 that reach points in front
of the robot. Again, we plotted the mean error with respect
to time for SGIM-D with each of the teachers. You already know the plots for random exploration
and SGIM-D with teachers 1 and 3 . The error with teacher 4 in cyan seems to be lower than
the error with teacher 5 in magenta.
If we examine the histogram of the outcomes reached by the float in the graphs in the
bottom, we can see that teacher 4 enhances more exploration than teacher 5. Teacher 5
makes the learning agent explore only areas in front of it, while teacher 4 also encourages
the exploration of areas behind it.
Therefore, SGIM-D is sensitive to demonstrated outcomes. But it still learns despite the
quality of demonstrations and their sparsity .
Therefore, demonstrations structure and orient the exploration of the policy and outcome
spaces .
Autonomous exploration makes it robust to the sparsity of demonstrations or their relevance
, as well as to limited correspondence problems .
So we can conclude from this set of experiments with the fishing robot, that we have devised
an architecture for online learning of inverse models in continuous high-dimensional robotic
sensorimotor spaces , to learn multiple goals , to generalise over a continuous ensemble
of outcomes , and to actively choose online goals to learn.
We reuse this algorithmic architecture to model and understand child development , and
more precisely the development of vocalisation .
In this experiment, we had a simulator of a vocal tract that the learning agent can
control. The learning agent can control its vocal tract motors to produce various sounds
. Therefore, it learns a probabilistic distribution that maps between its motor commands and the
sounds that it can produce.
In this example, we have a system which decides which sampling mode to use, which sound to
produce and which motor command to execute.
This leads to these results. We show that with autonomous babbling only,
there is an emergence of developmental stages . First, even if it moves its vocal tract,
it makes no phonation, no noise comes out. Then, it makes some kind of noise that are
unarticulated. Only then it makes articulated sounds that are similar to syllables.
When we put the same learner in a social environment , with sounds that it can emulate, we show
that it shifted from a phase in the beginning where it only explores by itself to phases
where it tries to emulate the ambient sound .
Therefore, we show there is a double evolution from no articulation to articulated sounds
, but also from autonomous exploration to imitation . These correspond to descriptions
that have been made for the development of infant vocalisation .
From this learning process emerges a structure , and this corresponds qualitatively to the
developmental sequence observed in infants.
As a conclusion, we designed a system that discovers its physical and social environment
and tries to structure its development .
We designed 3 kinds of algorithmic architectures to explore different combinations of active
choices . The first one is SGIM-D which combines both
autonomous exploration and social guidance. Its choices are the outcomes and policies
when exploring autonomously . It has been illustrated on the fishing experiment.
We showed that SGIM-D can learn as well or with better precision , more reliably and
faster than other sampling modes. It uses demonstrations to bias its search in the policy
and outcome spaces. It uses self-exploration to overcome correspondence problems, and to
compensate the sparsity of the demonstration set.
The second algorithmic architecture is SGIM-IM , which I did not present here. It is mainly
SGIM-D where the learner can make one more active choice : about the sampling mode.
We tested SGIM-IM on the fishing experiment and an air hockey experiment.
Here we have a system for interactive learning , to decide where to request for help from
a teacher or not. It self-adjusts the timing of requests for help depending on the cost
of a request for demonstration. It has been tested on both deterministic and stochastic
environments.
Finally, we have SGIM-ACTS , which is the full version of the algorithmic architecture.
It makes all the active decisions about the sampling mode, the outcome and policies, but
also which teacher to ask for help. We presented the results for the experiments
with the iCub and the vocalisation. We also tested it in an experiment where the robot
can learn different types of tasks and with several teachers .
We show that the algorithmic architecture is efficient for interactive learning with
several teachers to learn several types of tasks . We tested SGIM-ACTS in continuous
and discrete spaces, and both in simulation and with a physical robot . We also used it
to model child development .
We realised a learning system that automatically selects the most adapted sampling mode for
a given outcome, as well as the best teacher . It automatically discovers the easy, reachable
and difficult outcomes . SGIM can discover the properties of its physical and social
environment , and uses them to structure its learning process into a developmental sequence
.
We contributed in terms of interactive learning by combining for the first time social guidance
and intrinsic motivation . We also have the first implementation of strategic
learning . Our system actively chooses with the same principle the content, the timing,
the procedure and the source of the learning process .
Then, we contributed in the field of life-long learning, by building a system for multi-task
learning with online choice of tasks . And this active choice based on intrinsic motivation
can explain the emergence of the developmental sequences .