Image by Author
In the past few years, and especially since the appearance of ChatGPT just over 12 months ago, generative AI models for creating realistic synthetic text, images, video, and audio have emerged and have been rapidly advancing since. What began as humble research quickly developed into systems with the capacity to generate high-quality, human-like outputs across the various mediums mentioned above. Propelled in particular by key innovations in neural networks and massive increases in computational power, more and more companies now offer free and/or paid access to these models that increase in ability at a remarkable pace.
Generative AI isn’t all rainbows and puppy dogs, however. While holding great promise to augment human creativity in a wide variety of applications, concerns remain about how to properly evaluate, test, and responsibly deploy these generative systems. There is particular unease related to the spread of misinformation, along with concerns of bias, truthfulness, and social impacts introduced by this technology.
However, the first thing to do with any new technology is to attempt to understand it before we either harness or criticize it. Getting a start at doing so is what we have planned for this article. We intend to lay out some key generative AI terms and do our best to make them understandable at an intuitive level for beginners, in order to provide an elementary foundation and pave the way for more in-depth learning ahead. In that vein, for each key term below you will find links to related material to begin to investigate further as desired.
Now let’s get started.
Natural Language Processing
Natural Language Processing (NLP) is an AI subfield focusing on enabling machines to understand, interpret, and generate human language, by programmatically providing these machines with the tools required to do so. NLP bridges the gap between human communication and computer understanding. NLP first employed rule-based methods, followed by “traditional” machine learning approaches, while most cutting edge NLP today relies on a variety of neural network techniques.
Neural networks are machine learning computational models inspired by (not replicas of) the human brain, used for learning from data. Neural networks consist of layers (many layers = deep learning) of artificial neurons processing and transmitting small individual pieces of data, fitting this data to function, and repetitively updating the weights associated with the processing neurons in an attempt to “better fit” the data to the function. Neural networks are essential for the learning and decision-making capabilities of today’s AI. Without the deep learning revolution started a little over a decade ago, much of what we refer to as AI would not have been possible.
Generative AI is a category of artificial intelligence, powered by neural networks, which is focused on the creation of new content. This content can take many forms, from text to images to audio and beyond. This differs from “traditional” types of AI which focus on classifying or analyzing existing data, embodying the capability to “imagine” and produce novel content based on training data.
Content generation is the actual process where trained generative models generate synthetic text, images, video, and audio, doing so with learned patterns from their training data, producing contextually relevant output in response to user input or prompts. These prompts can be in any of these mentioned forms as well. For example, text could be used as a prompt to generate more text, or to generate an image based on the text description, or a piece of audio or video instead. Likewise, an image could be used as a prompt to generate another image, or text, or video, etc. Multi-modal prompting is also possible, in which, for example, text and an image could be used to generate audio.
Large Language Models
Large Language Models (LLMs) are specialized machine learning models which are tailored to process and “understand” human language. LLMs are trained on vast amounts of text data, which enables them to analyze and replicate complex language structures, nuances, and contexts. Regardless of the exact LLM model and techniques being used, the entire essence of these models is to learn and predict what the next word, or token (group of letters) follows the current, and so on. LLMs are essentially incredibly complex “next word guessers,” and improving the next word guess is a very hot research topic at the moment, as you have likely heard.
Foundational models are the AI systems that have been designed with broad capabilities that can then be adapted for a variety of specific tasks. Foundational models provide a base for building more specialized applications, such as tweaking a general language model for specific chatbot, assistant, or additional generative functionalities. Foundational models are not limited to language models, however, and exist for generation tasks such as image and video as well. Examples of well-known and relied-upon foundational models include GPT, BERT, and Stable Diffusion.
In this context, parameters are numerical values that define a model’s structure, operational behavior, and capacity for learning and predicting. For example, the billions of parameters in OpenAI’s GPT-4 influence its word prediction and dialogue creation abilities. More technically, connections between each neuron in a neural network carry weights (mentioned above), with each of these weights being a single model parameter. The more neurons → the more weights → the more parameters → the more capacity for a (well-trained) network to learn and predict.
Word embeddings are a technique in which words or phrases are converted into numerical vectors of a predetermined number of dimensions, in an attempt to capture their meaning and contextual relationships in a multidimensional space of a size much smaller than what would be required to one-hot encode each word (or phrase) in a vocabulary. If you were to create a matrix of 500,000 words where each row was created for a single word, and every column in that row was set to “0” except for a single column representing the word in question, the matrix would be 500,000 x 500,000 rows x columns, and be incredibly sparse. This would be a disaster for both storage and performance. By setting columns to various fractional values between 0 and 1, and reducing the number of columns to, say, 300 (dimensions), we have a much more focused storage structure, and inherently increase operation performance. As a side effect, by having these dimensional embedding values learned by a a neural network, like terms will be “closer” in dimensional values than unlike terms, providing us with insights into relative word meanings.
Transformer models are AI architectures that simultaneously process entire sentences, which is crucial for grasping language context and long-term associations. They excel in detecting relationships between words and phrases, even when far apart in a sentence. For example, when “she” is established early in a chunk of text as a noun and/or pronoun referencing a particular individual, transformers are able to “remember” this relationship.
Positional encoding refers to a method in transformer models that helps to maintain the sequential order of words. This is a crucial component for understanding the context within a sentence and between sentences.
Reinforcement Learning From Human Feedback
Reinforcement learning from human feedback (RLHF) refers to a method of training LLMs. Like traditional reinforcment learning (RL), RLHF trains and uses a reward model, though this one comes directly from human feedback. The reward model is then used as a reward function in the training of the LLM by use of an optimization algorithm. This model explicitly keeps humans in the loop during model training, with the hopes that human feedback can provide essential and perhaps otherwise unattainable feedback required for optimized LLMs.
Emergent behavior refers to the unexpected skills displayed by large and complex language models, skills which are not displayed in simpler models. These unexpected skills can include abilities like coding, musical composition, and fiction writing. These skills are not explicitly programmed into the models but emerge from their complex architectures. The quesiton of emergent abilities can go beyond these more common skills, however; for example, is theory of mind an emergent behavior?
Hallucinations is the term given to when LLMs produce factually incorrect or illogical responses due to constraints in data and architecture. Despite whatever advanced capabilities the model possesses, these errors can still occur both when queries are encountered that have no grounding in the model’s training data, and when a model’s training data consists of incorrect or nonfactual information.
Anthropomorphism is the tendency to attribute human-like qualities to AI systems. It is important to note that, despite their ability to mimic human emotions or speech and our instinct to think of the models or as “he” or a “she” (or any other pronoun) as opposed to an “it,” AI systems do not possess feelings or consciousness.
Bias is a loaded term in AI research, and can refer to a number of different things. In our context, bias refers to the errors in AI outputs caused by skewed training data, leading to inaccurate, offensive, or misleading predictions. Bias arises when algorithms prioritize irrelevant data traits over meaningful patterns, or lack meaningful patterns altogether.
Matthew Mayo (@mattmayo13) holds a Master’s degree in computer science and a graduate diploma in data mining. As Editor-in-Chief of KDnuggets, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.