5. Design of an Initial Prototype

Hardware and Software specifications:

Inference was performed on a system powered by the NVIDIA GH200 Grace Hopper Superchip, which combines a 72-core Arm Neoverse V2 Grace CPU and an H100 GPU with 141GB of HBM3e, connected via 900GB/s NVLink-C2C. The system includes 480GB of LPDDR5X CPU memory and runs Rocky Linux 9.4, enabling efficient execution of large language models through unified high-bandwidth memory access.

We have implemented the whole pipeline using Python (v.10.3) and the following libraries and tools:

  • Gradio (v. 5.16.2): for creating a web-interface for users.

  • Chromadb (v.0.6.3): for storing documents used in the RAG pipeline

  • Ollama (v. 0.6.1): for running downloaded LLMs locally

  • Langchain (v.0.3.21): for connecting all components.

Creating the Vector DB

The process of ingesting data into the Vector DB included multiple steps including deciding on the data input format, the input data filtering approach, how to split the data, and embed it in the vector store. The following addresses each of these steps.

Input Data Format

As described earlier, the Jupyter Notebook examples published by the FABRIC development team as a guide to FABlib serve as the data source for the vector store used in our RAG model. We discovered early on, however, that ingesting unmodified Notebooks did not perform well. Notebooks, by design, include a fair amount of extraneous data that is not only unnecessary, but undesirable, in the generated code. Since LLMs are largely influenced by the input data, when this superfluous information is sent as part of the input, the output showed a tendency to generate useless text such as the cell type (code vs. Markdown), execution count, and cell ID number. Similarly, loading them as JSON files picked up too much unneeded information and generated code that contained too much similar, yet incorrect, metadata information.

Because FABRIC Jupyter Notebooks are essentially Python files, we decided to try converting the Notebooks to Markdown before ingesting them. We also tried converting to simple python scripts. In the Markdown format, all the “Markdown” cells in the original Notebook become plain Markdown texts while the “code” cells are turned into cell blocks. For Notebooks converted to Python files, the “Markdown” cells are included in the Python script as multi-line comments. When the vector store was built with either of these formats, the system produced desirable output in most cases, although we chose the Python script conversion in the end as it had a tendency to generate more concise and clean code at the end of the pipeline.

Input Data Filtering

Data cleansing is usually the key to success in AI/ML projects and this app is no exception. We first included all the Notebooks in the repository, converted to Python scripts, as our input data in the vector store. It was immediately clear from the LLM output that the retrieved documents often included obsolete or irrelevant examples, resulting in the generated code following out-dated examples. Consequently, we manually reviewed all directories and removed unnecessary and out-dated files from the input data collection.

Splitting

It is a standard practice to split documents into smaller chunks as long chunks can make the augmented input string too long for the context window of smaller LLMs, or can include too much extraneous data that could potentially confuse the LLM. At the same time, when the retrieved documents are in small segments, they naturally tend to lose the contextual information. This is especially problematic for code examples as they demonstrate not only syntactical usage of the API, but also the logical sequence of operations that cannot be randomly ordered. After initial experimentation, we decided to keep each converted python script intact as a single document without further splitting and focus on improving the “accuracy rate” of retrieving a smaller number of well-chosen documents for each query.

Embeddings

We have experimented with a few popular embedding models. Perhaps because expected queries are generally simple, and the comments included in stored examples are explicit, we did not observe a significant performance difference in finding relevant documents. Currently, we are using the all-mpnet-base-v2 model with good success.

Selecting a Large Language Model (LLM)

The primary goal for our work has been to find open-source LLMs that can be run locally. Furthermore, we focused on small LLMs (up to 25B parameters) because larger LLMs tend to require an astonishingly large amount of computational power and/or time. The only exception included in our experiment is OpenAI’s gpt-4o-mini model, which we consider to be a reputable, proprietary, cloud-based, low-cost model. The figure below lists models that we considered and/or explored as part of our testing.

Table 1: Large Language Models

Model Name

Parameter Count

Size

Known Dates

Free?

codellama:7b[Roziere et al., 2023]

6.7B

4GB

2023 (trained and released)

Yes

codellama:13b[Roziere et al., 2023]

13.0B

7GB

2023 (trained and released)

Yes

codestral[Codestral | Mistral AI — mistral.ai, n.d.]

22.2B

12GB

2022 (trained) 2024 (released)

Yes

mistral-small[Jiang et al., 2023]

23.6B

14GB

2024 (released)

Yes

codegemma:7b[Team et al., 2024]

8.5B

5GB

2024 (released)

Yes

phi4[Abdin et al., 2024]

14.7B

9GB

2024 (released)

Yes

deepseek-coder-v2[Guo et al., 2024]

15.7B

9GB

2024 (released)

Yes

gpt-4o-mini[OpenAI and Josh Achiam, 2024]

8B (estimated)

N/A

2024 (released)

No

In addition to the LLM selected, Prompt Engineering (i.e., how the user question is written) also plays a key role in the correctness of the resulting code. With or without RAG, prompt engineering is an enigmatic aspect of LLM that has a major consequence on the model efficacy[Sahoo et al., 2024]. Not only the inclusion or exclusion of words and phrases, but also the word order can make a difference. We have learned during our exploration phase that specifically naming the most important object class used in the FABRIC Python API was crucial. After multiple trials and errors, we determined that the following system prompt, that wraps the user query as well as the relevant Notebook examples retrieved from the vector store, consistently performed well:

You are an AI Code Assistant. Use the following pieces of context (examples) to generate python code to implement the user's question specifically for the FABRIC testbed. 
Use FablibManager whenever possible. Make sure to include correct import statements.
Generate the answer in Markdown.
If the question is very different from the context provided, simply say you cannot help.
{retrieved documents as a string}
    
    
Question: On FABRIC Testbed, {user query} Use FablibManager as much as possible. Include import statements.
    
Here is how you will implement that (in markdown):

In addition to prompt engineering and selection of the LLM, we experimented with different Temperatures. The Temperature setting can be thought of as the LLM’s confidence level (or creativity level). On one end of the spectrum, the LLM has little confidence in its answers and tends to answer by saying it has no information or context to answer the question (low creativity). On the other end of the spectrum, the LLM has great confidence and outputs answers as if they are correct, when they may be anything but correct (highly creative). While temperature adjustment can be useful especially in generating interesting output, the non-deterministic nature of LLMs poses a significant challenge to researchers and practitioners trying to evaluate model performance. To minimize the stochasticity, we are using the temperature of 0 (least creative, most deterministic) setting for all models throughout the experiment.