In many domains that involve machine learning, a widely successful paradigm for learning task-specific models is to first pre-train a general-purpose model from an existing diverse prior dataset and then adapt the model with a small addition of task-specific data. This paradigm is attractive to real-world robot learning since collecting data on a robot is expensive, and fine-tuning an existing model on a small task-specific dataset could substantially improve the data efficiency for learning a new task. Pretraining a policy with offline reinforcement learning and then fine-tuning it with online reinforcement learning is a natural way to implement this paradigm in robotics. However, numerous challenges arise when using this recipe in practice.
Firstly, compared to the local robot platform, off-the-shelf robot datasets frequently employ different objects, fixture placements, camera perspectives, and lighting conditions. Effectively fine-tuning a robot policy becomes challenging due to non-trivial distribution shifts between pretraining and online fine-tuning data. Most previous studies only highlight the advantages of the pre-train and fine-tune paradigm, in which the robot employs the same hardware instance for both the fine-tuning and pretraining stages. Second, significant human supervision is frequently needed when training or fine-tuning a policy in the actual world. This supervision involves manually resetting the environment between trials and designing reward functions.
They aim to tackle these two issues in this study and provide a workable framework that allows robot fine-tuning with the least human and time-consuming effort. In the last several years, significant advancements have been made in developing effective and self-governing reinforcement learning algorithms. However, only the system could learn with human supervision and various demonstration datasets without requiring human-engineered incentive mechanisms and manual environment resets. Reset-free reinforcement learning (RL) is one method put forth in certain works to lessen the necessity for manual environment resets. During training, an agent alternates between executing a task policy and a reset policy, updating both with online experience.
These efforts, however, do not use a variety of commercial robot datasets. Though these new techniques have not been included in a system that attempts to minimize human supervision during the fine-tuning phase, recent advancements in offline reinforcement learning algorithms have enabled policies to exploit various offline data and develop further via online fine-tuning. Other papers suggest that learning reward prediction models can replace the requirement for human-specified reward functions; nevertheless, they discovered that many of these models can be fragile when used in an actual RL fine-tuning environment. In conclusion, while earlier research has provided the necessary individual components for constructing a functional system for effective and human-free robot learning, it is still being determined which components and how to assemble them.
Researchers from Stanford University created ROBOFUME, a system that uses a variety of offline datasets and online fine-tuning to enable autonomous and effective real-world robot learning. Their system has two stages of operation. They assume that during the pretraining phase, they have access to a varied prior dataset, a small collection of sample failure observations in the target task, a few task demonstrations, and reset demonstrations of the target task. They derive a language-conditioned, offline reinforcement learning multitask strategy from this data. They require an algorithm that can both robustly fine-tune in environments distinct from those seen in the offline dataset and efficiently digest heterogeneous offline data to handle the shift in distribution between offline interactions and online interactions.
They discover that calibrated offline reinforcement learning techniques ensure that the pre-trained policy can efficiently process a variety of offline data and keeps improving during online adaptation by correcting the scale of the learned Q-values and underestimating predicted values of the learned policy from offline data. They must eliminate the requirement for reward engineering by developing a reward predictor to guarantee that the online fine-tuning phase requires as little human input as possible.
Their clever approach involves using a sizable vision-language model (VLM) to provide a reliable pre-trained representation, then honing it with a tiny quantity of in-domain data to make it specific to the reward classification scenario. Pre-trained VLMs have already been trained using large-scale linguistic and visual data from the internet. Compared to the models employed in earlier efforts, this makes the model more resilient to changes in lighting and camera placement. During the fine-tuning stage, a robot independently adjusts the policy in the actual world by alternating between trying to complete the job and restoring the environment to its initial state distribution. Meanwhile, the agent updates the procedure using the pre-trained VLM model as a stand-in reward.
To assess their framework, they pre-train it on the Bridge dataset and then test it on various downstream real-world tasks, such as folding and covering cloths, picking up and placing sponges, covering pot lids, and setting pots in sinks. They discover that with as little as three hours of in-person instruction, their strategy offers notable advantages over offline-only techniques. In a simulation scenario, they conduct additional quantitative trials to show that their strategy works better than imitation learning and offline reinforcement learning approaches that either don’t fine-tune online or don’t use a variety of previous data.
A fully autonomous system for pre-training from an earlier robot dataset and fine-tuning on an unknown downstream task with a minimum number of resets and learned reward labels are among their primary contributions. Secondly, they have developed a technique for refining vision-language models that have already been trained and utilizing them to create a surrogate reward for downstream reinforcement learning.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.