Large Language Models

This page is intended as a simplified primer on important concepts for developing and customizing Large Language Models (LLM).

Models

Training a completely new base model is extremely expensive (GPT-4: $100M) due to the required compute power to achieve good results, so this is only done by tech companies with enough capital:

  • Meta: Llama (public)
  • Google: Gemma (public), Gemini (API only)
  • OpenAI: GPT (API only)
  • Anthropic: Claude (API only)
  • Mistral AI: Mistral (public)
  • (Many other models and remixes of larger models)

These models are trained on billions of web pages, e.g. wikis, open source projects, etc.

These models often have different variants:

  • Different sizes (e.g. 2B, 70B or 405B parameters). Larger models are more powerful, but require more resources.
  • Fine tunings (e.g. “instruct models” optimized for chatbots)
  • Multimodal models (understand images or speech)

Models have different hardware requirements:

  • Small models (≤ 70B): Can run locally (no internet access) on consumer hardware (gaming graphics cards, M1 MacBooks).
  • Large models: Require expensive specialized hardware so cloud access is usually cheaper.
Input/Output

LLMs are simple when viewed from the outside:

  • Input: Text
  • Output: Text

E.g. when the input is an entire chat history in a chat-bot, the LLM can predict what the next word(s) of the text would look like (more chat messages).

By cleverly choosing the input (Prompt Engineering), you can influence the output, e.g. with examples, instructions for language patterns, more context, etc.

System Prompts

System prompts are predefined instructions given to an LLM in addition to user’s input. They set the context and tone for the interaction.

Examples:

  • Role: “You are a helpful assistant.”
  • Behavior: “Answer in English and provide detailed explanations.”
  • Context: “The time is {time} and the user’s name is {name}. Here is a document relevant to the user’s query: {document}”
Vector Databases

Data can be stored in vector databases via mathematical “feature vectors”. A collection of vectors is called an embedding.

These vectors lie in a multidimensional vector space (often plotted as a point cloud). Each dimension describes a semantic meaning (feature) of a data element. This has the advantage that related pieces of data are close to each other in the vector space.

Semantic search: A search query is converted to a feature vector and neighbors of this vector (related data) are retrieved.

Embedding Models: Text is converted to feature vectors with specialized embedding models, e.g. nomic-embed-text.

Source: Gutierrez-Osuna, R. Introduction to Pattern Analysis

Source: Google Developer - Machine Learning Crash Course

Retrieval Augment Generation

Retrival augmented generation (RAG) is used to enrich a model with knowledge from documents (e.g. PDFs, videos, websites, …).

These documents are split into chunks that are stored in a vector database as embeddings. Embeddings need to be regenerated if the documents change, so embeddings are not real-time capable.

A user request proceeds as follows:

  • The request is converted into a vector in the embedding’s vector space.
  • Neighbors of this vector are searched in the vector database (semantic search).
  • These vectors are added to the system prompt in their original text form.
  • The LLM generates a response based on the additional context in the prompt.
Tool Calling

Tool calling allows an LLM to call predefined functions (“tools”) in the executing framework’s code. With this it can call external APIs (search engines, databases, triggers, etc.). Therefore they are real-time capable.

LLMs still only handle text input and output:

  • A text description for each tool is added to the LLMs prompt, so it knows how to use them correctly.
  • It can emit JSON formatted text during generation.
  • The framework intercepts the output and calls the tool.
  • The tools result is then added to the system prompt and the LLM can continue with the next tokens.

This works better if the model was fine tuned for tool calling.

Example:

const getRecipesByMainIngedrientTool: RunnableToolFunction<any> = {
    type: 'function',
    function: {
        description: 'Search recipes by main ingredient. Ingredients should be in English and snakecase.',
        function: recipeApi.getRecipesByMainIngredient,
        parameters: {
            type: 'object',
            properties: { ingredient: { type: 'string' } },
        },
        parse: JSON.parse,
    },
};
Fine Tuning

Fine-tuning adjusts the general behavior of an existing model through further training. For example to use it as a chatbot (question/answer), as a code generator, to write books, or to use certain vocabulary.

There are two options for this:

  • Full parameter fine-tuning: An existing model is further trained, but with a specialized dataset. This adjusts all parameters over time, creating a new model. Requires a lot of computing power (expensive).
  • Low-rank adaptation (LoRA): Adds additional parameters to a model without affecting the original parameters. Requires less computing power.