2. Motivation
Many research domains now utilize Jupyter Notebooks [Project Jupyter — jupyter.org, n.d.] to both write the code that drives their experiments and use it as a place to document and display results from their experiments. (FABRIC Notebook Example) As a result there exists a wealth of example Jupyter Notebooks (often in repositories) that experimenters are able to browse or search to find Notebook code examples to learn from or to use directly. Users are often looking for Notebooks that perform tasks identical or similar to what they are trying to do to find code that they can use in their own notebooks. While the set of available Notebook examples often cover a wide variety of use cases, first-time users often run into the difficulty of finding the examples that they can use. Even advanced users experience challenges. For example, setting up a complex testing environment often requires combining different sections from numerous Notebook examples, a task that can be time-consuming, error-prone, and hard to trouble-shoot. Another common problem is that Notebooks can get out-of-date with example code that is no longer correct, causing unsuspecting users to use code that is no longer valid.
Users today are increasingly turning to Generative AI tools to help them with coding tasks., Some GenAI tools are designed specifically for coding, and, while the newer and larger models have become noticeably more performant even in the last 12 months, they still tend to hallucinate for research-specific tasks as they have never been designed or tested specifically for usage with research-specific infrastructure such as FABRIC. Moreover, even if the model was trained on FABRIC’s Jupyter Notebooks repository on GitHub [GitHub - fabric-testbed/jupyter-examples: Templates for Jupyter notebooks — github.com, n.d.], it has not “seen” the more recent versions of Notebooks. Our tests indeed demonstrate that even some of the best models available today do not perform well on their own when asked to generate python scripts for relatively simple FABRIC experimental tasks.