7. Discussion

Large language models (LLMs) for text and code generation are evolving at an astonishing pace. New models are becoming more efficient and gaining an ability to generate more coherent code with fewer mistakes. Only about a year ago, when we first started exploring the possibility of using LLMs for FABRIC code generation, most models failed to produce anything similar to FABRIC examples even for the simplest queries. Most newer models, in contrast, have evidently been trained on FABRIC Jupyter Notebook repository on GitHub – even without RAG, they are able to generate almost correct code as long as the query is very simple, as in Question #1. It is conceivable, therefore, for most LLMs to reach a point in the future where no RAG is needed even for more complex use cases if the GenAI continues to advance at this pace.

Evaluating output of generative AI is difficult, especially for code generation. Despite the large research effort that has been poured into the evaluation problem, there are no generally accepted guidelines for evaluating the quality of source code generated by LLMs at this point[Yeo et al., 2024]. For tool-specific APIs, as in the case of FABRIC, the only option currently is to manually run tests as there are no automated tools, labeled data, or established rubric for evaluation. The non-deterministic nature of generative AI also poses a challenge. Even the same query, using the same RAG architecture, vector store, and LLM can generate different results each time. While keeping the LLM temperature to 0 (zero) tends to reduce the variance, we often see slightly different results at different times.