Introduction

In this article we will explore how to build a Retrieval Augmented Generation (RAG) pipeline to create an assistant to help build a grain bin. We will use Numpy for calculations and OpenAI for their embedding and chat models. Using just these two packages lets us focus on creating the pipeline and it’s details.

Objective

The goal is to create an assistant for building a grain bin using a RAG that outperforms non augmented generative AI while being cost effective. We will adopt a naive strategy of how we could improve our interaction with the chat model gpt-4o-mini.

To evaluate our progress in this goal we have 3 questions that we will using as a test of our system.

“What is the min and max torque is used for the bolts?”
- The answer to this question is provided in a table on page 25.
- Can we extract and format this exact answer correctly?
“What voids the warranty?”
- The warranty information is scattered across several parts of the document.
- How effectively can we retrieve this disconnected information?
“How do you install a 2-ring door?”
- This question pertains to a specific section in the manual which includes detailed instructions and images.
- Can we retrieve a comprehensive list of these complex instructions in a coherent manner?

Using these 3 questions will allow us to make evaluations of how well our strategy for interacting with the chat model is performing.

The question about torque values will be used as our litmus test since it has the most straightforward answer and is the simplest to extract from the manual.

Setup

OpenAI started the generative AI revolution and remains a leader in this field. They offer a cost effective way to achieve great results. To connect to OpenAI you’ll need an API token which you can easily find instructions for obtaining online.

The first step is to make your OPENAI_API_TOKEN available to your Python program. You can do this by using the dotenv Python package along with a .env file that contains your token.

Python Setup

load_environment.py

from dotenv import load_dotenv
load_dotenv()

.env File

.env

OPENAI_API_KEY=sk-123...

Then we will connect to the OpenAI client with the model gpt-4o-mini model.

openai_setup.py

from openai import OpenAI
client = OpenAI()  # Connects to OpenAI and reads OPENAI_API_TOKEN
 
def get_chat_response(question):
    response = client.chat.completions.create(  # Setup chat model
        messages=[
            {
                "role": "user",  # Question asked by user
                "content": question,  # The actual question
            }],
        model="gpt-4o-mini",  # Using GPT-4o-mini model
    )
    return response.choices[0].message.content  # Extract response string

We will use the get_chat_response function from this point forward to interact with the gpt-4o-mini model. That’s all it takes to get started with OpenAI.

The OpenAI package conveniently looks for the OPENAI_API_TOKEN environment variable and loads it automatically.

To simplify asking our base questions we put them in an enum called EvalQ which is short for evaluation questions.

evaluation_questions.py

from enum import StrEnum
 
class EvalQ(StrEnum):
    TORQUE = "What is the min and max torque is used for the bolts?"
    WARRANTY_VOID = "What voids the warranty?"
    TWO_RING_DOOR = "How do you install a 2 ring door?"

And the final thing we will setup is the data we will be using, the grain bin manual.

The manual (will refer to it as this form here on) is a PDF file that can be found here. This manual provides detailed instructions for constructing a CB34 Grain Bin Sidewall covering everything except the roof.

Here we will load the PDF into a Python object and then split the document in various ways for later use.

load_manual.py

from pypdf import PdfReader
 
manual = 'grainbin_manual.pdf'
reader = PdfReader(manual)
 
manual_pages = [page.extract_text() for page in reader.pages]  # Split into pages
manual_string = "\n\n".join(manual_pages)  # One string
manual_words = manual_string.split(" ")  # Split into words
manual_pages_words = [page.split(" ") for page in manual_pages]  # Words per page

Key Terminology

Name	What	Example
encoding	Text converted to tokens	”Hi there” → [123, 2, 0, 2]
tokens	Numeric representation of text chunks	Individual pieces of text
embedding	Text made into numerical vectors	High-dimensional vector space
chat model	Generative AI model	gpt-4o-mini
embedding model	Converts text into numerical values	text-embedding-3-small

Baseline Answer

Before we measure how well the RAG pipeline is performing we need something to measure it against. To do that we will ask the chat model without any context to see what a non augmented answer looks like.

Prepare Question

To make it fair to the chat model we will give each question the context of for a CB34 Chief Industries Grain Bin Sidewall so it knows what the questions are pertaining to.

add_context.py

def add_withwhat(question):
    return f"{question} This is for a CB34 Grain Bin Sidewall from Chief Industries."
 
torque_question_withwhat = add_withwhat(EvalQ.TORQUE)
torque_answer_withwhat = get_chat_response(torque_question_withwhat)

Here is what the question looks like now and the answer we get from the chat model.

Question: What is the min and max torque is used for the bolts? This is for a CB34 Grain Bin Sidewall from Chief Industries.

Answer: “I don’t have specific information about the torque specifications for the bolts used in the CB34 Grain Bin Sidewall from Chief Industries. For accurate details, I recommend checking the installation manual…” (approximately 80 words)

This answer is telling you to get your answer somewhere else, not very helpful. And on top of that it takes many words to do it.

Words are a proxy of sorts for how long it takes to get a response and how much it costs.

So if the chat model doesn’t know, it should just let us know!

dont_know_template.py

def question_template_idontknow(question):
    return f"""
{question} If you don't know the answer just say I don't know.
"""
 
torque_question_idontknow = question_template_idontknow(torque_question_withwhat)
torque_answer_idontknow = get_chat_response(torque_question_idontknow)

Answer: “I don’t know.”

That is a drastic improvement. In 3 words we now know that the chat model cannot provide the answer we are looking for.

A very simple modification of the question resulted in an answer that is concise and understandable and was consistent across for all 3 questions.

Prompt Template

A final point on enhancing our string is the need for a more generic way to modify our questions. Up until now, we’ve relied on individual functions, but this approach is becoming unwieldy.

To streamline this process, let’s create a prompt template that ensures our questions follow best practices when querying the model.

prompt_template.py

def generate_prompt(question, context=""):
    return f"""
You are an Chief Industries bin dealer. Use the following pieces of retrieved context to answer the question.
 
If you don't know the answer, just say that you don't know.
 
If appropriate write out as a table and only answer the exact question given. Do not offer safety advice. No need to restate the question in any way.
 
Context:
{context}
 
Question:
{question}
 
Answer:
 
"""

Key elements of the template:

Tell the chat model who to act as and what to look for in the question
“I don’t know” addition
Only respond with actionable language instead of fine print and repetition
Context section for the grain bin manual
The question being asked
Answer prompt for expected formatting

Having this template allows us to not have to dial in boilerplate every time we want to get a question answered. Let’s see if this makes a difference in responses.

test_prompt.py

torque_prompt_question = generate_prompt(EvalQ.TORQUE)
torque_prompt_answer = get_chat_response(torque_prompt_question)

All three questions now return “I don’t know” in a consistent, concise manner.

This is our baseline answer we are going to work on improving.

Encoding

Tokens represent the way text is broken down for the chat model’s consumption, and context length refers to the number of tokens batched together. These two concepts are crucial in determining how long a query will take and its associated cost.

We are using the gpt-4o-mini chat model which is OpenAI’s current budget option. While it is on the cheaper side of models it can be easy to have an explosion of tokens being sent and received. Here are the pricing details for the gpt-4o-mini model:

15¢ per million input tokens
60¢ per million output tokens

The number of tokens used when asking a question and receiving an answer impacts both cost and computation time. The maximum number of tokens a model can handle at once is known as the context window. A larger context window allows for more data to be processed but also increases computation time and cost.

See What You Got

Encoding is the process of converting our text into tokens. This step is normally obfuscated by an embedding model but doing it yourself shows what is being consumed by our model and is a good step in exploratory analysis.

The tiktoken package allows us to count the number of tokens in a piece of text locally. It uses Byte Pair Encoding (BPE) for encoding which has the property that it can go from text to tokens and back to text which allows for examination of the encoding process.

encoding.py

import tiktoken
 
model_encoding = tiktoken.encoding_for_model("gpt-4o-mini")
manual_string_encoded = model_encoding.encode(manual_string)

The entire manual contains approximately 18,000 tokens. We are already working our way to the context window limit of 128k and well over some model limits like embedding models which we will look at in the next chapter.

It is important to note that answering our questions doesn’t require the entire document. By constraining the context to only relevant information we can not only reduce noise but also improve cost effectiveness.

How Does It Break Down

Understanding what is being encoded is crucial for effectively managing and optimizing our use of chat models.

By gaining insight into this process we’ll be better equipped to make informed decisions about constraining our context to what is relevant. Below are the first 5 tokens that are encoded from our data:

Encoding 1: (space)
Encoding 2: CB
Encoding 3: 34
Encoding 4: -
Encoding 5: Installation

The first encoding represents a space. Subsequent encodings involve letters and numbers with a full word not appearing until the 5th encoding.

Most encodings are relatively short but do not have to be. The distribution typically shows that most tokens are 1-4 characters, with the majority being 2-3 characters per token.

Now we know how text is broken down into tokens. Let’s break up the manual so we can keep our token count more constrained when interacting with our chat model.

Split Up the Manual

Splitting the text will allow us to not have to use the entire dataset every time we make a call to our chat model. Here are 3 simple but effective strategies for splitting up the manual:

Split by page: This is done when the file was read
Split by word count: This gives a more granular approach and is not restricted by page start and stop
- A hard cutoff could cut something off in the middle of explanation
Recursive text splitting: Create a window of overlapping text where we control the length and the overlap
- Overlapping the windows of text reduces the hard cutoff problem from splitting by word

split_strategies.py

import more_itertools as mi
 
manual_page_split = manual_pages
manual_word_split = [" ".join(chunk) for chunk in mi.chunked(manual_words, 101) if None not in chunk]
manual_recursive_split = [" ".join(window) for window in mi.windowed(manual_words, 98, step=60) if None not in window]

The next section is about embedding our text into numerical values using a model called text-embedding-3-small. As a primer we will use the encoding model that represents this embedding model so we have better insight into what we are about to use and why it is necessary to constrain our input tokens.

encode_splits.py

embedding_model_encoding = tiktoken.encoding_for_model("text-embedding-3-small")
manual_page_split_encoded = embedding_model_encoding.encode_batch(manual_page_split)
manual_word_split_encoded = embedding_model_encoding.encode_batch(manual_word_split)
manual_recursive_split_encoded = embedding_model_encoding.encode_batch(manual_recursive_split)

Embeddings

An embedding model is a model that converts text into numbers. In our case it will take care of the encoding of the data for us but it is something we need be aware of for context window limits and billing purposes.

With a numerical representation of our text we can treat them as context aware vectors whose context is given by the data the embedding model utilized and how it was trained.

This is where we start getting into the retrieval part of our RAG setup. How do we add the relevant parts of our data as context to the original question?

Let’s connect to OpenAI’s text-embedding-3-small embedding model and see what that looks like.

Measuring Context

What is context? Context is anything relevant to the question that allows the answerer to have the information required to construct an answer.

To measure context then we need to have a way of scoring similarity between two pieces of text embedded into a context aware vector space.

A common measure of similarity is the Cosine Similarity metric. Cosine Similarity is a measure of how close 2 vectors are and gives a value of 0 for far apart and 1 for being the same.

cosine_similarity.py

import numpy as np
 
def cosine_similarity(x, y): 
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

Let’s do a quick hello world with the Cosine Similarity metric and the text-embedding-3-small model.

Hello World

embeddings_hello_world.py

def get_embeddings(text, model_name="text-embedding-3-small"):
    return client.embeddings.create(input=text, model=model_name).data[0].embedding
 
hello_embedding = get_embeddings("Hello world")
hello_again_embedding = get_embeddings("Hello world again")
dinos_embedding = get_embeddings("Dinosaurs are real")

The text-embedding-3-small model returns a vector of float values with size 1536. These values are centered and standardized and don’t offer much information on their own.

Vector 1	Vector 2	Cosine Similarity
Hello	Hello again	0.91
Hello	Dinos	0.18
Hello again	Dinos	0.16

As one would hope, the hello statements are “close” and the dino statement is “further” away. We can now extend this to our data and how to find a split that is relevant to the question being asked.

What Makes the Cut

Using the similarity score let’s see if we can extract the correct information from our splits based on using a similarity score.

embed_splits.py

manual_page_split_embeddings = [get_embeddings(split) for split in manual_page_split]
manual_word_split_embeddings = [get_embeddings(split) for split in manual_word_split]
manual_recursive_split_embeddings = [get_embeddings(split) for split in manual_recursive_split]

Now we get the embedding spaces for our questions and compare them to the split text embedding spaces from above.

embed_questions.py

torque_embeddings = get_embeddings(EvalQ.TORQUE)
warrantyvoid_embeddings = get_embeddings(EvalQ.WARRANTY_VOID)
tworingdoor_embeddings = get_embeddings(EvalQ.TWO_RING_DOOR)

compare_embeddings.py

def compare_embeddings(compare_with, compare_to):
    results = []
    for embedding in compare_with:
        results.append(cosine_similarity(compare_to, embedding))
    return results
 
torque_pages_similarity_scores = compare_embeddings(manual_page_split_embeddings, torque_embeddings)
torque_word_similarity_scores = compare_embeddings(manual_word_split_embeddings, torque_embeddings)
torque_recursive_similarity_scores = compare_embeddings(manual_recursive_split_embeddings, torque_embeddings)

Each splitting strategy has different characteristics:

Page split: Shows a clear separation with similarity scores, with one page (page 25) having significantly higher score (~0.65) containing the torque information
Word split: Highest similarity score (~0.68) but information is cut off in the middle
Recursive split: More consistent scores with the relevant information spread across overlapping windows

The manual page split has a score that is separated and it is the page that the torque information is on (page 25).

The text-embedding-3-small embedding model has a maximum allowed number of tokens of 8,191. The entire manual would exceed this, requiring splitting.

The recursive text split gives a good mix of lowering our token count while also lowering our chances of missing out on context.

RAG Time

Now we have all the elements to create our RAG pipeline:

Load PDF file as a string
Split string text recursively
Embed each split of text using OpenAI’s text-embedding-3-small model
Embed the question being asked with the same model
Retrieve 2 most similar text splits to our question
Input question and the similar splits into our prompt template
Query OpenAI chat model with our augmented question

Our approach has been simplified to the point that each step could benefit from optimization. That will be for another article.

I Know Too Much

The chat model lacks the information needed to provide a useful response. The manual does have less tokens than the context window limit for the gpt-4o-mini model so let’s look at what happens when we add the entire manual to our prompt template. This is essentially augmenting our question without retrieval.

full_context.py

torque_question_manual_context = generate_prompt(EvalQ.TORQUE, context=manual_string)
 
warrantyvoid_question_manual_context = generate_prompt(EvalQ.WARRANTY_VOID, context=manual_string)
tworingdoor_question_manual_context = generate_prompt(EvalQ.TWO_RING_DOOR, context=manual_string)
 
warrantyvoid_answer_manual_context = get_chat_response(warrantyvoid_question_manual_context)
tworingdoor_answer_manual_context = get_chat_response(tworingdoor_question_manual_context)
torque_answer_manual_context = get_chat_response(torque_question_manual_context)

Whoa! Another vast improvement to the response we receive without much effort. All 3 questions now get detailed, accurate answers from the manual.

Augment with Context

Can we get the same or better results if we reduce the extraneous information given as context?

No doubt our computation times improve and cost goes down but quality of the response should not suffer. In fact it can be improved since this is essentially removing noise from the context.

We will now get the correct splits for all of our questions like we did in the “What Makes the Cut” section.

create_final_answer.py

def create_final_answer(question, question_embedding, 
                       text_split_embeddings=manual_recursive_split_embeddings, 
                       text_splits=manual_recursive_split, k=2):
    similarity_scores = compare_embeddings(text_split_embeddings, question_embedding)
    top_k_idxs = np.argsort(similarity_scores)[-k:]
    context_string = "\n".join([text_splits[i] for i in top_k_idxs])
    final_question = generate_prompt(question, context_string)
    final_answer = get_chat_response(final_question)
    return final_answer
 
torque_final_answer = create_final_answer(EvalQ.TORQUE, torque_embeddings, k=2)

Torque Question Results (k=2)

Answer: The manual specifies min torque of 30 ft-lbs and max torque of 35 ft-lbs for the bolts.

That is a perfect answer! Looks like that is all we needed and gained an amazing performance boost from trimming down our input tokens.

Warranty Question Results

This information is strewn throughout the manual, so we need more context:

k=2: Missing context
k=12: Better but still incomplete
k=22: Good coverage of warranty void conditions
k=150: Most comprehensive answer

Since the information is found throughout the document it is not surprising that we need to have a lot more context than we did when we asked about the torque values.

Two Ring Door Question Results

This information is on a few pages but lives together:

k=2: Incomplete instructions
k=12: Better step-by-step instructions
k=50: Comprehensive detailed instructions

We get pretty good results but with less detail than when we used the entire manual for context. Overall we can get similar results by retrieving chunks of the manual by measuring how similar it is to our question.

Next Steps

Now that we’ve built the RAG pipeline by hand there is plenty of room for optimization. Here are just a few areas where improvements can be made:

Implement a semantic text splitter for more meaningful text chunking
Experiment with different embedding spaces to find the best fit for your data
Explore advanced methods for selecting the most similar documents
Design a more refined and effective prompt template
Investigate vector databases and relevant frameworks for enhanced performance

These suggestions are just the beginning. Each of these topics deserves deeper exploration in future articles.

Conclusion

Building a RAG pipeline is both straightforward and incredibly effective for enhancing the analysis of personal data. Even with the simple walkthrough we’ve covered here, it’s evident how quickly you can achieve impressive results. This approach is sure to catch the attention of anyone looking to integrate AI into their workflows offering a powerful way to elevate data-driven insights.

Glossary

Name	What	Aliases
Retrieval Augmented Generation	Update questions asked of a chat model by adding context	RAG
Encoding	Convert bits of string into numbers	tokenization
Embedding	Convert bits of string into numerical vectors	vectorization
Chat model	Models that you can have a conversation with	gpt-4*, claude, llama
Grain bin	Large metal storage container for grain	Bin

Search

Simple Grain Bin RAG