In a previous post we have seen the code analyzing capabilities of Large Language Models (LLMs). But verification and validation of code is more than static code analysis. In the software industry many other techniques are used as well. In this post, I will focus on testing the dynamic behavior of the code.
Dynamic testing means that code is executed with different inputs to verify the correct behavior by checking the internal states and/or outputs.
According to testing theory, the behavior during normal operation is often error free, and bugs are mostly hiding around edge cases. These often lie in the most complex parts and require tricky inputs. Occasionally, these edge cases are forgotten by the developers, testers or reviewers. The consequences are well known. Significantly increased correction effort and cost. The later in the lifecycle you find it the more expensive it is to fix. And, while customers are great testers, they usually don't like finding bugs.
LLMs can help in dynamic testing. You will see that, not only will it find errors, it's capable of generating tests, fixes, and verification tests as well. Wow!
Let's get started.
Our target is a relatively simple standalone C function. We'll generate a whole test environment with test cases. For better understanding, we'll keep things simple. At the end, I will show you a couple ideas you could explore further.
I used the Chatbot UI framework to access OpenAI's GPT-4 model. It's mostly like ChatGPT, but has a couple handy features, like saving prompts and search.
Chatbot UI enables the creation of prompts with variables. Once a prompt is phrased, we can save it and use it later. When reusing a saved prompt, all we need to do is pass the variables and off we go. Chatbot UI is open-source, check it out on Github.
Here is what we want GPT-4 to do:
I created the following prompt to inspect and test any C function. The whole prompt is available in the Appendix.
The prompt variable is the C function itself, we shall provide it when using the saved prompt.
The code under test is the following.
After the function was analyzed, the findings were listed by GPT-4 (step #1). It found several, and gave corrections. Here is the most interesting one. It changed the datatype of iterator i and the return value of calculate was modified from 16 bit to 32 bit to avoid the possibility of overflow. The corrected source code (step #2):
Let's see how it did with generating test cases.
GPT-4 provided the following test environment and gave detailed information on how to run it.
This isn't too complicated, but it doesn't need to be. The two functions are clearly visible, as well as the test cases. We also get some nice logging of the results. The instructions on how to use it were the following.
Almost off-the-shelf, easy to follow the instructions: copy the original and corrected code into the file, compile and run it. Only the names needed to be customized and that instead of Linux I used Windows (although, if I told it I probably would have gotten Windows specific instructions 🙂). The test environment itself works immediately, zero-shot prompting.
Test cases were provided also for the edge cases that fail for the original code and pass for the corrected code. It is worth noting that, it needed a bit more guidance. An additional question was needed, after which GPT-4 corrected itself, and the newly received input vector could be added to the list. To be fair, with a bit of prompt engineering we could probably make this work out-of-the-box. Also, if we do prompt chaining, it is very simple to generalize this.
Running the test environment, I got the following.
Boom, exactly what I wanted! These tests verify the proper dynamic behavior during normal operation, and reveal the vulnerability of the original code by showing the possibility of overflow (Test cases #8 and #9). The corrections are also working as expected.
Impressive, isn't it?
Generative AI is very powerful. We started out with a single implemented C function, and voilà, with an okay prompt, we got a test environment, bugs, fixes and test cases. Notice the "okay" in the previous sentence. We could have spent some time fine-tuning the prompt to get even better results.
The examined code was very basic. Even with its simplicity, it demonstrated the power of LLMs and their semantic understanding. Naturally, there are many paths you can take from here. Here are a few examples.
On the other hand, we should never forget to verify the answers of LLMs. It may contain mistakes and incomplete parts, just as we have seen in the edge case generation above. LLMs are great tools, but for now, they need supervision for certain tasks. That's probably going away, but no one knows when. And remember, a year ago most AI experts thought today would only happen in 5-10 years.
The following prompt template was provided for GPT-4.
Take the first step toward harnessing the power of AI for your organization. Get in touch with our experts, and let's embark on a transformative journey together.
Contact Us today