1. Introduction

Jupyter Notebooks are widely used by the research community. They enable scientific software and experiments to be easily shared by researchers, and have resulted in the wide-spread availability of shared Notebook repositories. The massive number of example Notebooks available not only has made it easier for researchers to find example code needed to write new Notebooks, but also represents a wealth of data that can be used by Generative AI systems to automatically generate experiment-specific Notebooks. While automated code generation has been used in other contexts [Lewis et al., 2020], generating experiment-specific code for specialized research (cyber) infrastructure, still remains a challenge. To address the problem of code generation for advanced research infrastructure, this paper explores the use of RAG-based AI techniques to automatically generate Jupyter Notebooks in the context of the NSF FABRIC testbed.

FABRIC [FABRIC Portal — portal.fabric-testbed.net, n.d.] is a next generation network testbed consisting of over 30 sites across the U.S., Asia, and Europe, that interconnect many national research supercomputing facilities and other specialized testbeds via high speed links. Each FABRIC node (“router”) is an advanced compute cluster with GPUs, FPGAs, programmable NICs, and large amounts of storage. Using FABRIC, researchers are able to run completely new experiments that not only leverage the computation and storage capabilities of national research facilities, but can also program and control the network that interconnects them.

Setting up experiments on the FABRIC testbed, however, requires that researchers learn and use FABlib [fabrictestbed-extensions documentation — fabric-fablib.readthedocs.io, n.d.], FABRIC’s unique python API. Like much of the research software today, FABRIC examples are given in the form of Jupyter Notebooks, which serve as templates for various experimental tasks. While there is an extensive collection of FABRIC examples, the learning curve for first-time users can be steep. Moreover, as the capabilities and scope of FABRIC expands, the number of Notebook examples increases as well, making it challenging even for seasoned users to find example code and use it to create new Notebooks. . While LLM-based code-assisting tools, such as Microsoft Copilot [Microsoft Copilot: Your AI companion — copilot.microsoft.com, n.d.], have become increasingly more useful and “smarter”, their performance is still severely limited when the requested code requires understanding and familiarity with a library that is used by a relatively small number of researchers and students.

To address this problem, we have implemented an AI-based Natural-Language-to-Code tool, leveraging the power of large language models (LLMs) and Retrieval Augmented Generation(RAG) [Lewis et al., 2020], that generates FABlib Python scripts for FABRIC users. It allows researchers to say (natural language) things like “Create a FABRIC slice that has 3 interconnected nodes at UCSD, Utah, and Tokyo and run an iperf test” and have it produce a FABRIC Notebook that will create the specified slice. RAG enhances LLMs by integrating external information retrieval with the model’s generative capabilities: Instead of relying solely on pre-trained knowledge, RAG allows the model to search a vector database in real time, retrieving relevant documents or data to incorporate into its responses. This improves the accuracy and relevance of the output, especially for questions or tasks that require up-to-date or domain-specific information beyond the model’s training data. Our results show that incorporating RAG, where FABRIC Notebook examples have been ingested into a vector database for retrieval, significantly improves the performance of the LLMs for FABRIC code generation tasks. Even in advanced use cases where the user’s request is too complex for the code-generator to produce a completely error-free script, the output can serve as a great starting point.

The rest of the paper is organized as follows: Sections 2 and 3 provide the context and the background for this project. Section 4 presents a high-level view of our tool while Section 5 describes the technical details of our initial prototype, including how we have created the vector store, refined our prompts, and run LLMs locally. Section 6 summarizes the results of our performance test, in which different LLMs with and without RAG were evaluated for FABRIC-specific code generation. Sections 7 and 8 cover current issues in implementing and testing RAG-based applications and possibilities for further development in this area.