Sunday, 2 February 2025

Introduction to Large Language Models (LLMs)

Single-turn vs Multi-turn conversation

...

Token

the smallest unit of text that the model recognizes
can be a word, a number, or even a punctuation mark
1 (English) word has approximately 1.3 tokens

Context Caching

In large language model API usage, a significant portion of user inputs tends to be repetitive. For instance, user prompts often include repeated references, and in multi-turn conversations, previous content is frequently re-entered.

To address this, Context Caching technology caches content that is expected to be reused on a distributed disk array. When duplicate inputs are detected, the repeated parts are retrieved from the cache, bypassing the need for recomputation. This not only reduces service latency but also significantly cuts down on overall usage costs.

Billing

The price of using some LLM is usually in units of per 1M tokens.

If 1 word has 1.3 tokens, let's see how many words is this: 1w : 1.3t = x : 10^6 => x = 10^6 / 1.3 = 769'230 words ~ 770k words.

Billing is usually based on the total number of input and output tokens by the model.

If Context Caching is implemented, input billing per 1M tokens can further be split into two categories:

1M tokens - Cache Hit (1M tokens that were found in cache)
1M tokens - Cache Miss (1M tokens that were not found in cache

Use Case: Chatbots

Chat History

LLMs don't have any concept of state or memory. Any chat history has to be tracked externally and then passed into the model with each new message. We can use a list of custom objects to track chat history. Since there is a limit on the amount of content that can be processed by the model, we need to prune the chat history so there is enough space left to handle the user's message and the model's responses. Our code needs to delete older messages.

Retrieval-Augmented Generation (RAG)

If the responses from the chatbot are based purely on the underlying foundation model (FM), without any supporting data source, they can potentially include made-up responses (hallucination).

Retrieval-augmented generation LLMs create a more powerful chatbots that incorporates the retrieval-augmented generation pattern to return more accurate responses.

...