## Understanding the power of Lifelong Learning through the Efficient Lifelong Learning Algorithm (ELLA) and VOYAGER

I encourage you to read Part 1: The Origins of LLML if you haven’t already, where we saw the use of LLML in reinforcement learning. Now that we’ve covered where LLML came from, we can apply it to other areas, specifically supervised multi-task learning, to see some of LLML’s true power.

**Supervised LLML: The Efficient Lifelong Learning Algorithm**

The Efficient Lifelong Learning Algorithm aims to train a model that will excel at multiple tasks at once. ELLA operates in the multi-task supervised learning setting, with multiple tasks T_1..T_n, with features X_1..X_n and y_1…y_n corresponding to each task(the dimensions of which likely vary between tasks). Our goal is to learn functions f_1,.., f_n where f_1: X_1 -> y_1. Essentially, each task has a function that takes as input the task’s corresponding features and outputs its y values.

On a high level, ELLA maintains a shared basis of ‘knowledge’ vectors for all tasks, and as new tasks are encountered, ELLA uses knowledge from the basis refined with the data from the new task. Moreover, in learning this new task, more information is added to the basis, improving learning for all future tasks!

Ruvolo and Eaton used ELLA in three settings: landmine detection, facial expression recognition, and exam score predictions! As a little taste to get you excited about ELLA’s power, it achieved up to a 1,000x more time-efficient algorithm on these datasets, sacrificing next to no performance capabilities!

Now, let’s dive into the technical details of ELLA! The first question that might arise when trying to derive such an algorithm is

*How exactly do we find what information in our knowledge base is relevant to each task?*

ELLA does so by modifying our f functions for each t. Instead of being a function f(x) = y, we now have f(x, θ_t) = y where θ_t is unique to task t, and can be represented by a linear combination of the knowledge base vectors. With this system, we now have all tasks mapped out in the *same* basis dimension, and can measure similarity using simple linear distance!

*Now, how do we derive θ_t for each task?*

This question is the core insight of the ELLA algorithm, so let’s take a detailed look at it. We represent knowledge basis vectors as matrix L. Given weight vectors s_t, we represent each θ_t as Ls_t, the linear combination of basis vectors.

Our goal is to minimize the loss for each task while maximizing the shared information used between tasks. We do so with the objective function e_T we are trying to minimize:

Where ℓ is our chosen loss function.

Essentially, the first clause accounts for our task-specific loss, the second tries to minimize our weight vectors and make them sparse, and our last clause tries to minimize our basis vectors.

**This equation carries two inefficiencies (see if you can figure out what)! Our first is that our equation depends on all previous training data, (specifically the inner sum), which we can imagine is incredibly cumbersome. We alleviate this first inefficiency using a Taylor sum of approximation of the equation. Our second inefficiency is that we need to recompute every s_t to evaluate one instance of L. We eliminate this inefficiency by removing our minimization over z and instead computing s when t is last interacted with. I encourage you to read the original paper for a more detailed explanation!**

Now that we have our objective function, we want to create a method to optimize it!

In training, we’re going to treat each iteration as a unit where we receive a batch of training data from a single task, then compute s_t, and finally update L. At the start of our algorithm, we set T (our number-of-tasks counter), A, b, and L to zeros. Now, for each batch of data, we case based on the data is from a seen or unseen task.

If we encounter data from a new task, we will add 1 to T, and initialize X_t and y_t for this new task, setting them equal to our current batch of X and y..

If we encounter data we’ve already seen, our process gets more complex. We again add our new X and y to add our new X and y to our current memory of X_t and y_t (by running through all data, we will have a complete set of X and y for each task!). We also incrementally update our A and b values negatively (I’ll explain this later, just remember this for now!).

Now we check if we want to end our training loop. We set our (θ_t, D_t) equal to the output of our regular learner for our batch data.

We then check to end the loop (if we have seen all training data). If we haven’t ended, we move on to computing s and updating L.

To compute s, we first compute optimal model \theta_t using only the batched data, which will depend on our specific task and loss function.

We then compute D_t, and either randomly or to one of the θ_ts initialize any all-zero columns of L (which occurs if a certain basis vector is unused). In linear regression,

and in logistic regression

Then, we compute s_t using L by solving an L1-regularized regression problem:

For our final step of updating L, we take

, find where the gradient is 0, then solve for L. By doing so, we increase the sparsity of L! We then output the updated columnwise-vectorization of L as

so as not to sum over all tasks to compute A and b, we construct them incrementally as each task arrives.

Once we’ve iterated through all batch data, we’ve learned all tasks properly and have finished!

The power of ELLA lies in many of its efficiency optimizations, primarily of which is its method of using θ functions to understand exactly what basis knowledge is useful! If you care about a more in-depth understanding of ELLA, I highly encourage you to check out the pseudocode and explanation in the original paper.

Using ELLA as a base, we can imagine creating a generalizable AI, which can learn any task it’s presented with. We again have the property that the more our knowledge basis grows, the more ‘relevant information’ it contains, which will even further increase the speed of learning new tasks! It seems as if ELLA could be the core of one of the super-intelligent artificial learners of the future!

**Voyager**

What happens when we integrate the newest leap in AI, LLMs, with Lifelong ML? We get something that can beat Minecraft (This is the setting of the actual paper)!

Guanzhi Wang, Yuqi Xie, and others saw the new opportunity offered by the power of GPT-4, and decided to combine it with ideas from lifelong learning you’ve learned so far to create Voyager.

When it comes to learning games, typical algorithms are given predefined final goals and checkpoints for which they exist solely to pursue. In open-world games like Minecraft, however, there are many possible goals to pursue and an infinite amount of space to explore. What if our goal is to approximate human-like self-motivation combined with increased time efficiency in traditional Minecraft benchmarks, such as getting a diamond? Specifically, let’s say we want our agent to be able to decide on feasible, interesting tasks, learn and remember skills, and continue to explore and seek new goals in a ‘self-motivated’ way.

Towards these goals, Wang, Xie, and others created Voyager, which they called the first LLM-powered embodied lifelong learning agent!

*How does Voyager work?*

On a large-scale, Voyager uses GPT-4 as its main ‘intelligence function’ and the model itself can be separated into three parts:

**Automatic curriculum:**This decides which goals to pursue, and can be thought of as the model’s “motivator”. Implemented with GPT-4, they instructed it to optimize for difficult yet feasible goals and to “discover as many diverse things as possible” (read the original paper to see their exact prompts). If we pass four rounds of our iterative prompting mechanism loop without the agent’s environment changing, we simply choose a new task!**Skill library:**a collection of executable actions such as craftStoneSword() or getWool() which increase in difficulty as the learner explores. This skill library is represented as a vector database, where keys are embedding vectors of GPT-3.5-generated skill descriptions, and executable skills in code form. GPT-4 generated the code for the skills, optimized for generalizability and refined by feedback from the use of the skill in the agent’s environment!**Iterative prompting mechanism:**This is the element that interacts with the Minecraft environment. It first executes its’ interface of Minecraft to gain information about its current environment, for example, the items in its inventory and the surrounding creatures it can observe. It then prompts GPT-4 and performs the actions specified in the output, also offering feedback about whether the actions specified are impossible. This repeats until the current task (as decided by the automatic curriculum) is completed. At completion, we add the learned skill to the skill library. For example, if our task was create a stone sword, we now put the skill craftStoneSword() into our skill library. Finally, we ask the automatic curriculum for a new goal.

*Now, where does Lifelong Learning fit into all this?*

When we encounter a new task, we query our skill database to find the top 5 most relevant skills to the task at hand (for example, relevant skills for the task getDiamonds() would be craftIronPickaxe() and findCave().

Thus, we’ve used previous tasks to learn our new task more efficiently: the essence of lifelong learning! Through this method, Voyager continuously explores and grows, learning new skills that increase its frontier of possibilities, increasing the scale of ambition of its goals, thus increasing the powers of its newly learned skills, continuously!

Compared with other models like AutoGPT, ReAct, and Reflexion, Voyager discovered 3.3x as many new items as these others, navigated distances 2.3x longer, unlocked wooden level 15.3x faster per prompt iteration, and was the only one to unlock the diamond level of the tech tree! Moreover, after training, when dropped in a completely new environment with no items, Voyager consistently solved prior-unseen tasks, while others could not solve any within 50 prompts.

As a display of the importance of Lifelong Learning, without the skill library, the model’s progress in learning new tasks plateaued after 125 iterations, whereas with the skill library, it kept rising at the same high rate!

Now imagine this agent applied to the real world! Imagine a learner with infinite time and infinite motivation that could keep increasing its possibility frontier, learning faster and faster the more prior knowledge it has! I hope by now I’ve properly illustrated the power of Lifelong Machine Learning and its capability to prompt the next transformation of AI!

If you’re interested further in LLML, I encourage you to read Zhiyuan Chen and Bing Liu’s book which lays out the potential future paths LLML might take!

Thank you for making it all the way here! If you’re interested, check out my website anandmaj.com which has my other writing, projects, and art, and follow me on Twitter @almondgod.

**Original Papers and other Sources:**

Eaton and Ruvolo: Efficient Lifelong Learning Algorithm

Wang, Xie, et al: Voyager

Chen and Liu, Lifelong Machine Learning (Inspired me to write this!): https://www.cs.uic.edu/~liub/lifelong-machine-learning-draft.pdf

Unsupervised LL with Curricula: https://par.nsf.gov/servlets/purl/10310051

Neuro-inspired AI: https://www.cell.com/neuron/pdf/S0896-6273(17)30509-3.pdf

Embodied LL: https://lis.csail.mit.edu/embodied-lifelong-learning-for-decision-making/

LL for sentiment classification: https://arxiv.org/abs/1801.02808

Lifelong Robot Learning: https://www.sciencedirect.com/science/article/abs/pii/092188909500004Y

Knowledge Basis Idea: https://arxiv.org/ftp/arxiv/papers/1206/1206.6417.pdf

Q-Learning: https://link.springer.com/article/10.1007/BF00992698

AGI LLLM LLMs: https://towardsdatascience.com/towards-agi-llms-and-foundational-models-roles-in-the-lifelong-learning-revolution-f8e56c17fa66

DEPS: https://arxiv.org/pdf/2302.01560.pdf

Voyager: https://arxiv.org/pdf/2305.16291.pdf

Meta Reinforcement Learning Survey: https://arxiv.org/abs/2301.08028