A Curious Robot Learner for Interactive Goal - Babbling

I am going to present studies about how we can build a 'curious' robot learner for 'interactive goal-babbling' by designing a system for strategically choosing, 'what', 'how', 'when', and from 'whom' to learn . In other words, the system chooses the 'content', 'procedure', 'timing' and 'source' of its learning process. Our long-term goal is to enable 'life-long learning', which is about learning multiple tasks , in an open-ended and evolving environment, and how to choose 'which tasks' it should try to learn. The main challenge to life-long learning , is that the learning agent has only 'limited resources' like a limited life-time. On the other hand, the environment is open-ended and its sensorimotor space is of very high dimension . Therefore the agent has to sample a huge search space. Inspired by works in psychological development , our idea is to endow the learning agent with a sampling strategy using both social guidance and autonomous exploration based on artificial curiosity , also called intrinsic motivation . We implemented this idea into an algorithmic architecture for the learning agent to decide 'what' and 'how' to learn; 'what', 'when', 'how', and 'whom' to imitate; Let us first set the background of our work. We wish to enable agents to learn during their life-time so as to adapt constantly to the open-ended and changing environment. And one example of successful learning agents are human babies. During their development, we can observe that they choose to focus on different objects, or activities, according to a developmental sequence . But in this blooming and buzzing confusion which is their environments, how do babies still manage to learn, and improve their skills? Despite this open-ended environment with so many objects and people moving around, what are the 'principles' that make them focus on toys and games in an ordered manner? Likewise, the child decides to interact with social partners or not. How do they decide when and whom to interact with? We will analyse this behaviour from the learning perspective. How are these choices related to active learning for multi-task , and life-long learning? We will answer to these questions by analysing how we can sample an open-ended and high-dimensional environment for life-long learning. We are mostly interested by robot control . And motor control learning can be described as the learning of a probabilistic distribution 'p', of 'b', given a . For instance, for a child learning to fish, she learns that, given a position that she wants to reach with her float, what arm movement she needs to perform. In this example, "b" would be a policy and, "a", a position of the float . For learning such a probabilistic distribution, the learning agent has to sample the spaces B and A. But A and B in our real world can be continuous, and high-dimensional spaces. Therefore the search space is very large . The mapping between A and B can be stochastic . So repeating the same policy can lead to different outcomes. The mapping can also be redundant . To reach outcome A 2 , you can perform different policies. There can also be problems of in homogeneousity . You have 'un-learnable' spaces. For instance, if you are learning how to fish in front of a lake you can only put the float at positions around you, while positions that are 2 kilometres from you are un reachable . You also have problems when exploring unbounded environment. Because acquiring data takes time , and you only have a 'limited' life-time, you can only have a limited number of training data, to learn from. Therefore, guiding actively data collection can maximize what can be learnt within a life-time . For learning complex motor controls 2 methods which we call "modes", have been developed. The first of them takes as source of information a 'teacher' or social partner. We call this mode 'socially guided' exploration mode . "an appropriate robot controller can be derived from observations of a human's own performance thereof". The interaction with a teacher enables a direct transfer of knowledge from human to robot. They can be categorised into 2 sub-categories. In the mimicry mode , the learning agent would try to imitate the policy of the teacher. For example on the left-hand picture, the little girl mimics the posture and the position of the rod of her sister. In the emulation [ mode , the learning agent would try to produce the same outcome as the teacher. The little girl on the right-hand side tries to put her float next to her sister's, but uses a different policy. For socially guided exploration, several techniques have been developed in robotics. For instance, programming by demonstration methods have enable robots to learn a complex motor command to reach a specific outcome from few demonstrations. In socially guided exploration, human input highlight 'interesting localities' to explore. However, these methods are limited by the teaching dataset which can be 'sparse and suboptimal'. The teacher may not give an insufficient number of demonstrations, or he might give bad demonstrations because he is not an expert . These methods also have to address 'correspondence problems' when the body and dynamics of the teacher and learner are different . Thirdly, these methods are mainly developed to reach one single goal , and can be hardly extended for multi-task learning . The second mode uses 'oneself' as source of information. We call these methods 'autonomous' exploration modes . The learning agent experiments by itself. Methods such as reinforcement learning or goal-oriented learning of inverse models have enabled learning agents to learn complex motor skills. These methods have the advantage of enabling the agent to explore 'independently' of any human effort . Its learning is also adapted to the agent's own body , which means it does not have to face correspondence problems . Moreover, a few methods have been developed for multi-task learning. But they still face problems when the explorable space is unbounded. These 2 main families of methods use 2 kinds of source of information . We would like to combine the advantages of both approaches into a single architecture to learn a mapping between spaces A and B. To collect data, you can experiment by yourself. You can also observe a teacher, whom you can mimic by reproducing the observed policy, or whom you can emulate by reproducing the observed outcome. The idea here is to have a single system that can use different modes of exploration , and decide which sampling mode should be used. This principle of active learning can be generalized to have active decision about which teacher the agent wants to imitate, if it wants to copy the demonstrated outcome or self-decide on a goal outcome, if it wants to copy the demonstrated policy . Likewise, if for autonomous learning , the agent can decide 'which outcome' it wants to target, and 'which policy' to use. These questions can be answered with active learning . Methods of active learning, have enabled exploration to maximize the expected learning progress , and to 'evaluate' this 'empirically'. This leads to a [[rate 150] meta-exploration [[rate 170] problem addressed with Bandit algorithms. In this work, we use a different principle for active learning. We use psychology theories of 'intrinsic motivation' as inspiration. Intrinsic motivation, is defined as the doing of an activity, for its inherent satisfaction rather than for some separable consequence. When intrinsically motivated, a person is moved to act for the fun or challenge entailed rather than because of external products, pressures or reward. This theory developed first in psychology, has been successfully applied for robot learning with, active goal-babbling. We would like to use the same principle, to build [[rate 165]] a system to learn multiple tasks , with an active choice of outcomes to produce, and in 'high-dimensional, stochastic, and continuous' search spaces. I would now like to illustrate on an example this idea of devising a data-collection strategy , based on social guidance and artificial curiosity , on a very simple example[[ rate 170]], and then later on see more complicated experimental setups. Let's say, a teacher puts on a table an object, and asks you to be able to recognise the object later on whatever 'position', and which ever 'orientation' it has. What would you do to learn how to recognise this object? One answer is by 'manipulation'. You can push the object to different positions. You can lift and drop the object, or you can a human to manipulate the object for you. The question here is : which manipulation will bring you more useful information about the object. Then if you have not only 1, but 'several' objects to recognise, you should also decide which object to manipulate. In this video, you can see several objects. The humanoid robot iCub can lift and drop this ball. By manipulating this way, the ball will land in a different position and orientation. The robot gets a new image of the ball, to learn how to recognise it. If we phrase this problem mathematically, we are learning a probabilistic distribution 'p' of 'b' given 'a', where 'b' is an object and 'a' is an image . With regard to the active decisions we make, we're making a choice of 'manipulation' or ' sampling mode' we should use. We also actively choose 'which object' to manipulate, or in other terms, 'which subspace ' to explore. These choices can be easily summarised in this table. Each row corresponds to an object : a car or a cube. And, each column corresponds to a manipulation : pushing, lift and dropping, or interacting with a human. Choosing a combination of a manipulation and an object means choosing a box in this table. Our idea is to use active learning with intrinsic motivation based on competence progress to make this choice. We will choose the object and manipulation that enable it to make the most progress . We define a competence measure, gamma . For an image 'a', gamma of 'a' is the competence at recognising in image 'a', the right object . We measure gamma à' empirically . We start by sampling stochastically all possibilities. We plot for each box the competence gamma with respect to time. For instance, in this case, pushing the car leads to the highest slope . Therefore, we will keep pushing the car, to make more progress . As we keep pushing the car, our competence at recognising the car increases . But its 'slope' starts to decrease because we have learned everything about the car. The competence progress of pushing the car decreases. [[ slnc 500]]So, we will switch to another object and manipulation that makes more competence progress . In this case, we will ask the human to manipulate the cube. Again, we keep asking humans for help, until the competence is high and we don't make any more progress . To implement this idea, we designed an algorithmic architecture called SGIM-ACTS , which stands for Socially Guided Intrinsic Motivation with Active Choice of Teacher and Strategy . It has a hierarchical algorithmic architecture to explore the image space and the sampling modes and the object space . If we interact with a teacher , we ask him to manipulate an object for us. He will hand us an object 'b' 'g'. By manipulating the object, he generates a new image 'a' 'r' of the object. With our recognition algorithm, we recognise in this image 'a' 'r', the object 'b' 'r' . And the comparison between the recognised object 'b' 'r' and the real identity of the object 'b' 'g' will give you a measure of competence at recognising the object 'b' 'g'. This measure of competence is recorded to compute the competence progress , and select the next sampling mode and object. The same way, if you explore autonomously , you would manipulate an object 'b' 'g'. By manipulation, you generate a new image 'a' 'r' of the object in a different position and orientation. With our recognition algorithm, we recognise in this image 'a' 'r', the object 'b' 'r'. And the comparison between the recognised object 'b' 'r' and the real identity of the object 'b' 'g' will give you a measure of competence at recognising the object 'b' 'g'. This measure of competence is recorded to compute the competence progress , and select the next sampling mode and object. We would like to test this algorithmic architecture to see if SGIM-ACTS can choose sampling modes for online learning . We would also like to test whether SGIM-ACTS is robust to bad teachers . We first wanted to see how well the robot can learn to recognise and differentiate different objects . We plotted with respect to time its recognition level as the f-measure. Each plot represents how well it can recognise each of the objects. As shown in the graph on the left , with SGIM-ACTS , the robot iCub progressively learns to recognise all the objects as their recognition level increase . In comparison, we plotted on the right the recognition level when the robot uses a random choice of manipulation and object . We can see that it learns to recognise almost all the objects, but not the cube which is the green plot. Moreover, we also plotted below the choice of objects that the robot manipulated, with respect to time. We can see that indeed, in the case of a random sampling , the objects are chosen in an unordered manner . In contrast, SGIM-ACTS chooses objects in a more ordered manner . The graph on the left shows that the robot can actually detect that the cube is difficult to recognise and concentrate on manipulating it. Therefore, SGIM-ACTS learns better than Random sampling. And these experiments show that guiding data collection improves performance. In a second experiment , we wished to test the effect of the teacher's behaviour on the performance of SGIM-ACTS. We conducted the same experiments, but this time the teacher always shows the objects at the same position and orientation . Therefore, this bad teacher does not bring useful information to the robot. As expected, we can see in the right-hand side graph, that the robot performs worse than previously. Here, with random sampling , the robot actually never learns to recognise the cube . In contrast, on the left-hand side graph, although it struggles in the beginning, SGIM-ACTS manages in the end to recognize the cube . The plot of the chosen object below , shows that again, SGIM-ACTS could concentrate on the cube . Therefore, SGIM-ACTS can be robust to the quality of social guidance. We have shown in this illustrative example, that active learning based on intrinsic motivation can improve the learning performance . Now we would like to examine what happens when we learn a more complicated probability distribution in continuous space . This is what we tried to address in this second experimental setup. In the fishing experiment , a robotic arm can manipulate a fishing rod to place the float on the surface of the water. Here the surface of the water is represented by this white surface. A camera from above records the position where the float has landed with a green square , compared to the goal position marked by a white circle . The robot can explore autonomously by making random movements . A human teacher can also decide to give demonstrations . In this case, the robot imitates the demonstrated movement several times to assess how close it can reach the demonstrated position with the float. Regularly, we also evaluate the performance of the robot by measuring how close it can reach predefined positions . The robot can reach well positions close to where it explored autonomously. It can also reach well positions close to demonstrations. But on unexplored regions, it performs badly . More precisely, our robot had 6 degrees of freedom . Its movement is determined by twenty five parameters , which define the Bezier curve of target trajectories for the joint angle of the robot. Therefore the policy space is of dimension twenty five . For each movement, the robot observes the position of the float as outcome of its action. The outcome space is the surface of the water, so the outcome space is a two dimensional space . If we phrase this problem mathematically, we are learning a probabilistic distribution 'p' of 'b' given 'a', where 'b' is a dynamical movement and 'a' is a position of the float . Given a position, what dynamical movement the robot needs to perform to reach it ? In this experiment, we use 2 sampling modes which are self-exploration and imitation . But in this experiment, the learner does not decide which sampling mode it should use, this is pre-programmed . With regard to the active decisions the robot makes, it makes a choice of ' outcome ' or ' policy ' it should use when exploring autonomously. For our robot to make these two choices we designed a simplified version of our algorithmic architecture. This version is called SGIM-D for Socially Guided Intrinsic Motivation by Demonstration . Here again, we have 2 modes . In the socially guided sampling mode, a teacher makes a demonstration . With a correspondence, the robot translates this demonstration into a demonstrated outcome and a demonstrated policy . The robot can emulate the demonstrated outcome 'a' d and mimic the policy 'b' d . When the robot tries to reproduce the policy 'b' d, it reaches an outcome 'a' r. And the distance between the reached outcome 'a' r and the demonstrated outcome 'a' d gives you a measure how well the robot can reach 'a' d. The progress in this measure gives an estimation of how 'interesting' the outcome 'a' d is . On the contrary, if the robot explores autonomously , the robot decides itself on a goal outcome 'a' g you want to reach. The robot also decides which policy 'b' r reaches 'a' g . In our experiment, this choice of policy uses the Nelder-Mead algorithm for non-linear optimisation, but it could be replaced by other optimisation algorithms. When the robot executes the policy 'b' r , it actually reaches the outcome 'a' r . The difference between 'a' r and 'a' g gives you a measure of competence at reaching 'a' g, so as to decide on the next time step which outcome to set as a goal. We are still doing the experiment on the physical robots, but for a first step, have tested our algorithm on a simulator . The simulation environment is stochastic with a non-uniform stochastic distribution . We plotted in the bottom left graph on the surface of the water. The red crosses correspond to the positions reached by the float when repeating the same movement 'b' 1 twenty times. As we can see, there are several positions, therefore the environment is 'stochastic'. The green diamonds correspond to the positions reached by the float when repeating another movement 'b' 2. Their distribution is different from the red crosses. Therefore the stochastic distribution is non-uniform . As shown in the previous video, to evaluate the performance of the learner, we defined a set of benchmark points . The grey circle is the position of the robot in the middle of the surface of the water. The red points are the goal outcomes that we ask the robot to reach, and we measure the distance to all these points. We also show the human demonstrations in the bottom right graph. The demonstration set is sparse and uniformly distributed on the reachable space. The demonstrations were given by kinesthetics . The simulation environment can be seen on the screen on the left , and a physical robot on the right enables the human demonstrator to control the robot in the simulation, and retrieve the position of the float. To assess the performance of our algorithm, we compared it with several other exploration strategies . The baseline is random sampling of the policy space, where the robot chooses random policies to learn. The second strategy is called SAGG-R.I.A.C . It is an algorithm for learning with intrinsic motivation and goal-oriented exploration, and has proven efficient to learn motor skills in high-dimensional spaces. The third kind of sampling is by observation , where the robot sees a demonstration at a regular frequency. The difference with imitation is that with imitation, the robot would repeat and experience by itself the demonstrated policy with small variations. So SGIM-D is a mix between imitation and SAGG-R.I.A.C. It observes demonstrations and imitates 5 times, then switches back to SAGG-R.I.A.C, until the teacher makes a new demonstration . In this set of experiments, we wish to test whether SGIM-D can learn to reach all the goal outcomes better than the other algorithms. Secondly, can SGIM-D learn faster than the other algorithms? Thirdly, for which outcomes does SGIM-D improve the robot's performance? Finally, we investigate whether these results are scalable to larger outcome spaces . First, we conducted experiments and plotted the mean error with respect to time for the different learning algorithms . In the left graph , we can see that the red plot, which is the mean error for SGIM-D is lower than the others, which means that SGIM-D learns with better precision . Its variance is smaller than that of the green plot, therefore SGIM-D learns more reliably than SAGG-R.I.A.C. Besides, its mean error is lower from the beginning, therefore SGIM-D learns faster than the other algorithms. In terms of outcomes that the agent can reach, we plotted on the right , the histograms of the outcomes reached by the robot. On the 2-D surface of the water, red areas are positions where the float landed often, and blue are areas there the float seldom landed. The grey circle is the center position of the robot. We can see that in random sampling , the robot mostly reaches areas just in from of it . In SAGG-R.I.A.C , the explored space has increased, and in SGIM-D, the explored space has increased even more . In the right box, I put crosses where demonstrations are, for comparison. These histograms show that SGIM-D has increased its explored space and explores more uniformly the reachable space , including the isolated subspaces. What about larger spaces ? In the second set of experiments, we used an outcome space which is 100 times larger . On the left , we plotted the histogram of the goal outcomes that the robot has set by itself . Whereas the histogram for SAGG-R.I.A.C is fuzzy, the goals that SGIM-D set for itself corresponds better to the reachable space. Therefore, somehow, SGIM-D has learned where the reachable space is . On the right, we plotted the mean error with respect to time. The red plot is lower than the other plots. So, SGIM-D learns with better precision and faster than random sampling or SAGG-R.I.A.C. Therefore, SGIM-D is robust to large spaces . These plots show that SGIM-D learns faster than random exploration, imitation learning or SAGG-R.I.A.C. It can also produce a wider variety of goal outcomes and is scalable to large outcome spaces. Thus the questions now are: How is the performance of SGIM-D dependent on demonstrations? And what is the role of demonstrations? To examine the sensitivity of SGIM-D to correspondence problems , we considered 2 teachers . Teacher 1 is a robot that has learned with SAGG-R.I.A.C and now gives demonstrations to our learning agent. Teacher 3 is the human teacher seen in the last video. On the right we plotted the mean error with respect to time for each of these teachers and for SGIM-D and learning with observation. You already know the error plots for observation and SGIM-D with teacher 3. The cyan plot represents the mean error for learning by observation with teacher 1. The error is lower than the case of teacher 3, because both teacher and learner are robots, so there are no correspondence problems. We thus expect that SGIM-D with teacher 3 would have a lower error than SGIM-D with teacher 1. But actually, the blue plot of SGIM-D with teacher 1 has an error comparable to the red plot SGIM-D with teacher 3. Therefore, performance of SGIM-D is less sensitive to small correspondence than learning by observation. More precisely, we can even think from this graph that teacher 3 is a better teacher than teacher 1, despite the correspondence problem. This is what we investigated on the next slide. We examine the demonstrations in detail and plot for all the demonstrated movements the values of joint angles of motor 1 with respect to time. Each line corresponds to a demonstration. On the left are demonstrations of teacher 1. On the right are demonstrations of teacher 3 . We can see clearly a visible structure in all these demonstrated movements of teacher 3. They seem to be the same movement but only scaled differently. An anova analysis shows that demonstrations 3 do not come from a random distribution. We can say that as human demonstrations are structured differently, the robot learner can take more advantage of them . The next question we investigated is how robust SGIM-D is to the quality of demonstrations , when demonstrations are suboptimal . For this, we considered teacher 4 and 5 , who are actually subsets of teacher 3. So demonstrations 4 are demonstrations 3 which reach points behind the robot, and demonstrations 5 are demonstrations 3 that reach points in front of the robot. Again, we plotted the mean error with respect to time for SGIM-D with each of the teachers. You already know the plots for random exploration and SGIM-D with teachers 1 and 3 . The error with teacher 4 in cyan seems to be lower than the error with teacher 5 in magenta. If we examine the histogram of the outcomes reached by the float in the graphs in the bottom, we can see that teacher 4 enhances more exploration than teacher 5. Teacher 5 makes the learning agent explore only areas in front of it, while teacher 4 also encourages the exploration of areas behind it. Therefore, SGIM-D is sensitive to demonstrated outcomes. But it still learns despite the quality of demonstrations and their sparsity . Therefore, demonstrations structure and orient the exploration of the policy and outcome spaces . Autonomous exploration makes it robust to the sparsity of demonstrations or their relevance , as well as to limited correspondence problems . So we can conclude from this set of experiments with the fishing robot, that we have devised an architecture for online learning of inverse models in continuous high-dimensional robotic sensorimotor spaces , to learn multiple goals , to generalise over a continuous ensemble of outcomes , and to actively choose online goals to learn. We reuse this algorithmic architecture to model and understand child development , and more precisely the development of vocalisation . In this experiment, we had a simulator of a vocal tract that the learning agent can control. The learning agent can control its vocal tract motors to produce various sounds . Therefore, it learns a probabilistic distribution that maps between its motor commands and the sounds that it can produce. In this example, we have a system which decides which sampling mode to use, which sound to produce and which motor command to execute. This leads to these results. We show that with autonomous babbling only, there is an emergence of developmental stages . First, even if it moves its vocal tract, it makes no phonation, no noise comes out. Then, it makes some kind of noise that are unarticulated. Only then it makes articulated sounds that are similar to syllables. When we put the same learner in a social environment , with sounds that it can emulate, we show that it shifted from a phase in the beginning where it only explores by itself to phases where it tries to emulate the ambient sound . Therefore, we show there is a double evolution from no articulation to articulated sounds , but also from autonomous exploration to imitation . These correspond to descriptions that have been made for the development of infant vocalisation . From this learning process emerges a structure , and this corresponds qualitatively to the developmental sequence observed in infants. As a conclusion, we designed a system that discovers its physical and social environment and tries to structure its development . We designed 3 kinds of algorithmic architectures to explore different combinations of active choices . The first one is SGIM-D which combines both autonomous exploration and social guidance. Its choices are the outcomes and policies when exploring autonomously . It has been illustrated on the fishing experiment. We showed that SGIM-D can learn as well or with better precision , more reliably and faster than other sampling modes. It uses demonstrations to bias its search in the policy and outcome spaces. It uses self-exploration to overcome correspondence problems, and to compensate the sparsity of the demonstration set. The second algorithmic architecture is SGIM-IM , which I did not present here. It is mainly SGIM-D where the learner can make one more active choice : about the sampling mode. We tested SGIM-IM on the fishing experiment and an air hockey experiment. Here we have a system for interactive learning , to decide where to request for help from a teacher or not. It self-adjusts the timing of requests for help depending on the cost of a request for demonstration. It has been tested on both deterministic and stochastic environments. Finally, we have SGIM-ACTS , which is the full version of the algorithmic architecture. It makes all the active decisions about the sampling mode, the outcome and policies, but also which teacher to ask for help. We presented the results for the experiments with the iCub and the vocalisation. We also tested it in an experiment where the robot can learn different types of tasks and with several teachers . We show that the algorithmic architecture is efficient for interactive learning with several teachers to learn several types of tasks . We tested SGIM-ACTS in continuous and discrete spaces, and both in simulation and with a physical robot . We also used it to model child development . We realised a learning system that automatically selects the most adapted sampling mode for a given outcome, as well as the best teacher . It automatically discovers the easy, reachable and difficult outcomes . SGIM can discover the properties of its physical and social environment , and uses them to structure its learning process into a developmental sequence . We contributed in terms of interactive learning by combining for the first time social guidance and intrinsic motivation . We also have the first implementation of strategic learning . Our system actively chooses with the same principle the content, the timing, the procedure and the source of the learning process . Then, we contributed in the field of life-long learning, by building a system for multi-task learning with online choice of tasks . And this active choice based on intrinsic motivation can explain the emergence of the developmental sequences .