Harnessing LLMs with TDD

18 Oct 2024

Reflections on a weekend project with an AI co-programmer

Almost four years intoAriga, I don’t get to write code as much as I’d like, but occasionally I encounter some interesting problem that seems small enough to tackle in a shortamount of time and I dive in.

This was the case with an issue that came up from some of our lovely customers forAtlas.One of the great things about Atlas is it’s ability to inspect a database and automatically generate a schema filethat can be used to manage the database schema in a declarative way. The issue many users were facing was that for manyuse cases, databases can contain thousands of objects and the schema file was becoming too large to manage effectively.

Atlas usesHCL as the default data definition language (DDL) fordefining schemas, so I knew it should be pretty straightforward to write a tool that would parse the schema file andsplit it into multiple files based on some criteria.

Using LLMs to write code?

Putting the buzzwords and loads of BS-aside, even to a contrarian cynic like me, recent advancements in AI in generaland specifically LLMs (Large Language Models) are truly amazing.

As an individual contributor, I am constantly using tools like GitHub Co-pilot or OpenAI ChatGPT to help me write code snippets, improve documentation, and come up with ideas for copy.

However, as a CTO, I am still very skeptical about their maturity and constantly question their value in creatingproduction-grade code.

Despite my best efforts in prompting the AI to generate code that is readable, maintainable, and efficient, I often findthat the generated code is not up to the standards I would expect from my team, and I end up rewriting it.

My Strategy: Test-driven Brute-force-based LLM-assisted Development (T.B.L.D.)

For this project, I decided to try a different approach to using LLMs to write code. It would work like this:

Write tests to define the expected behavior of the application.
Use an LLM to generate code to pass the tests.
Feed any compilation error or test failure back into the LLM to generate new code.
Repeat until all tests pass.
Rejoice!

The idea here is to take the old TDD approach of defining the desired behavior of the application with tests and thenusing the LLM to generate code that passes the tests.

Recognizing that the LLM-generated code is still far from perfect, I would feed the test and compilation errors backinto the LLM to generate new code, iterating until the code compiles and passes all tests.

I dub this approachTest-driven Brute-force-based LLM-assisted Development (T.B.L.D.).

The Results

While the problem at hand was not super-complicated, it was not trivial either. HCL has great support in Go, but asan infra library used mostly by Terraform-internals devs and other HashiCorp expert-level users, it’s not the moststraightforward thing to work with.

Figuring out exactly how to split the schema into files while preserving the original schema’s structure has some subtleties that would take some research and experimentation to get right.

All of this to say that it’s pretty impressive that I was able to get a working solution fairly quickly while chattingto Claude 3.5, my pair-AI programmer!

You can find the final code in therotemtam/splt repository on GitHub.

Our session would go something like:

Me: Hey Claude, here’s a test that I’m Running:

# The test case

The test is failing, here’s the output:

# the full test output

Claude: “I understand the issue. Let me explain the problem:The current implementation is not correctly preserving the structure and formatting of the original HCL file when writing the split files. Specifically, there are two main issues:

Nested blocks are not being properly reconstructed. In the original file, there’s a nested block extra inside the new_block, but it’s not being written to the output file.
The original formatting, including newlines and indentation, is not being preserved. This is why there’s an extra newline after the schema block in the original file that’s missing in the output.”

The root cause of these issues is in thewriteFile function. Currently, it’s trying to manually reconstruct the HCLstructure, but it’s not doing so completely or accurately. It’s only handling top-level attributes and one level of nested blocks, but not deeper nesting or formatting details.”

Me: “Thanks for the explanation. Can you suggest a fix?”

Claude: Here’s some code:

# The code snippet

Me: There’s an error. “Unresolved reference SrcTokens”

Claude: “I apologize for the confusion. You’re right to point that out. It seems I made an incorrect assumption about the availability of SrcTokens. Let’s correct this and use a different approach that should work with the available API.

Here’s a revised version:

# The revised code snippet

And so on, until we had a working version.

Caveats and takeaways

With a structured feedback loop - LLMs can generate working code

Anyone who has used LLMs in the past months to generate code knows that lots of it is garbage. However, as many arecurrently discovering in the industry, if we combine LLMs with “agentic” workflows that provide “real world” (e.g compilation error, test failure) feedback, and iterate a few times - we can get to a working solution.

Statically typed languages have a strong advantage

Whenever I can, I useGo for my projects, including for the entirety of Ariga’s codebase and ofcourse for this project.

What I found is that the very thing that makes Go lovely to work with for humans - its strong typing - also makes itan ideal target language for LLMs. The compilation failures from Go’s type system provide a very clear signal to the LLMabout API hallucinations and other issues.

Code-quality is still a human thing

Claude was able to help me generate code that passed the tests, with some intervention from me to guide the process.Some edge cases (like handling deterministic ordering of attributes in nested blocks) required me to pull up my sleevesand write a few blocks of code myself, but overall, I was impressed with the model’s ability to generate code that wasclose to what I needed.

Having said that, good code is about much more than just passing tests. It’s about readability, maintainability, andperformance. While the code generated by Claude was functional, it was pretty ugly and not something I would want toput my name on. I ended up spending a good amount of time refactoring the code to make it more readable and maintainable.

It’s all about the feedback loop

My main takeaway from this project is that one of mykey contributions was in defining the tests and constructing the feedback loop that allowed me to iterate on the code generation process.

I know that some very cool companies in our industry are building solutions to generate full test suites using LLMs.I am far from an expert on the subject, and I’m sure there are lots of nuance to the topic, but I am left with a feelingthat perhaps defining the verification process should be left to us humans.

Conclusion

RoboCode. Generated by DALLE

I had a great time working on this project, and I was very thrilled to see that with a structured feedback loop, I wasable to get a working solution fairly quickly. There are so many important problems in the world that can be solved with software, and if we figure out how to leverage LLMs to help us write it faster, we can make a real impact.

On the other hand, after the initial excitement of seeing tests pass and the code compile, I still spent an additionalcouple of days refactoring the code to make it something I would be willing to put my name on.

Personally, I don’t see serious software engineering being taken over by the robots any time soon, but as I told myco-founder, Ariel, after I showed him this project, it sure does make me feel a little bit like RoboCop.

(or should I say RoboCode?)

OlderNewer

Movatterモバイル変換

archive |
about
placeholder
thoughts and learnings in software engineering by Rotem Tamir