Posts, page 1 of 6

Tuesday, 30th September 2025

Blogmark: Acceptance Tests for AI-assisted development (via)

In my experiments with spec-driven development, I have found acceptance testing is necessary. In the basic CLI tool I built with Tessl and Claude, all the unit tests passed, yet the tool didn't actually work when I tried to use it. An acceptance test would have been very useful.

Matteo Vaccari writes:

Now there is considerable push for “spec-driven” development, where we put a lot of emphasis on co-generating specs with the AI. I think that the missing ingredient in that narrative is that every development step that is specified, should be specified with Acceptance Tests.

I agree. He finishes by reminding us that AI benefits from explicit standards, just like the best engineering teams, otherwise agents (and people) create a mess:

ATs, like TDD and good software design, have always been important tools in the hands of the best developers. Now that everyone is learning how to get value out of AI-assisted development, they are becoming way more important: even though some of us may have seen those as “nice-to-have for when we will have time”, if we’re going to do work with AI, these are undeniably necessary. AI multiplies whatever we bring to the table: if we bring sloppy practices (no tests or inconsistent tests, no ATs, no attention to software design), AI is going to make our work a disaster. If, on the other hand, we bring good habits and good practices, AI can help us write better software much faster.


Monday, 29th September 2025

After sharing my experience with Spec-Driven Development and Claude Code, I received some valuable feedback that highlighted a crucial blind spot in my approach. It turns out the problem wasn't with my original specifications; it was that I hadn't properly reviewed the specs that Claude Code and Tessl had generated.

Cassowary vibe coding and then switching to vibe speccing and not so amused boss cassowary

The Real Issue

The feedback pointed out something I'd missed entirely: whilst I'd moved away from "vibe coding" (throwing vague prompts at AI and hoping for the best), I'd fallen into a different trap—trusting generated specifications without proper review.

The core insight: making the spec the most important artifact in the workflow means you actually have to read and validate what gets generated, not just assume it's correct.

What I Actually Did Wrong

Looking back, my original README and requirements were fine. The problem was that I didn't thoroughly read the specifications that Claude Code and Tessl created from them.

The first red flag should have been this capability specification:

### Send prompt to Gemini LLM

Sends the provided prompt to the specified Gemini model using LangChain and returns the response.

- Mocked Gemini returns `"mocked response"`: prints `"mocked response"` and completes normally [@test](../tests/test_send_prompt_success.py)

I should have spotted immediately that this was describing a meaningless test. After I reprimanded Claude about the usefulness of this test, it generated a much better capability specification:

### Send prompt to Gemini LLM

Sends the provided prompt to the specified Gemini model using LangChain and returns the response.

- Creates ChatGoogleGenerativeAI instance with correct model and API key, returns response content [@test](../tests/test_send_prompt_integration.py)
- Handles LangChain API exceptions by re-raising them [@test](../tests/test_send_prompt_langchain_error.py)
- Processes empty prompt through LangChain correctly [@test](../tests/test_send_prompt_empty_input.py)

The second issue I missed was even more concerning. The specification included this capability:

### Load model configuration

Loads the model name from environment variables with fallback to default.

- Returns "gemini-1.5-pro" when `DEFAULT_MODEL` is not set in environment [@test](../tests/test_load_model_default.py)
- Returns custom model when `DEFAULT_MODEL` is set in environment [@test](../tests/test_load_model_custom.py)

If you look at the actual spec file, you'll see that test_load_model_custom.py was never actually created. I trusted that Claude Code would honour each capability and test, but clearly that's not something I can take for granted.

Claude Code did eventually realise it was missing this test when I later prompted about useless tests, but it didn't admit to having missed it initially.

The Mental Adjustment

The feedback highlighted something I'd underestimated: trusting generated specifications without proper validation is just as dangerous as trusting generated code without review.

The workflow isn't just about having AI generate specs—it's about becoming a more careful reviewer of those specs. AI tools can write specifications that look professional but contain fundamental flaws or omissions.

A Better Approach

Rather than treating generated specs as authoritative, they need the same scrutiny as generated code:

Validate test descriptions: Check that each test specification actually describes meaningful verification, not just mock verification.

Cross-reference outputs: Ensure that promised test files actually exist and implement what the spec claims.

Question circular logic: If a test specification sounds like it's only testing that mocking works, it probably is.

The Real Learning

The original issue wasn't about my requirements being too vague—it was about not being sceptical enough of generated artifacts. AI-assisted development requires active verification at every step, including (especially) the specifications that drive everything else.

This workflow works, but only if you read what it generates as carefully as you'd review a junior developer's work. That includes the specs, not just the code.


Thanks to the AI Native Dev Discord community for the feedback that prompted this reflection.


Friday, 26th September 2025

I built a simple CLI tool in one evening without writing a single line of code—not because the world needs another command-line utility, but as a learning exercise to explore LangChain whilst getting hands-on experience with AI-assisted development workflows.

The tool itself is deliberately basic: it sends prompts to Google's Gemini API via LangChain. No one will actually use it. But the development process revealed crucial insights about when AI excels and when human oversight becomes non-negotiable.

Context: I'm currently working through the IBM RAG and Agentic AI Professional Certificate course, which sparked my interest in putting LangChain into practice. This seemed like a reasonable opportunity to combine that learning with experimenting with Tessl and Claude Code workflows.

Time investment: 2 hours total versus an estimated 6-8 hours for hand-coding the equivalent functionality and test coverage.

The Setup: Spec-Driven Development with AI

I've been using Claude Code paired with a tool called Tessl, which enforces spec-driven development workflows. My approach was deliberately hands-off: write a README setting the vision, capabilities, tech stack choices, and API design, then let Claude Code handle everything else.

The project: a simple CLI tool that uses Langchain to send prompts to Google's Gemini API. I wrote a README, asked Claude Code to generate specs for the first basic feature, then requested implementation. Tessl keeps Claude Code focused by preventing feature creep—when the initial spec tried to do too much, I simplified it before proceeding.

💡 Worth Considering: Spec-driven development with AI requires upfront architectural thinking but prevents the common trap of letting AI wander into over-complex solutions. Tools like Tessl provide valuable guardrails.

First Reality Check: When AI Debugging Falls Short

After Claude Code reported all tests passing, I tried the tool: llm-prompt "What are the benefits of renewable energy?"

It didn't work.

Claude Code couldn't figure out why. I looked at the code and immediately spotted the issue: Claude had hard-coded the model type to a non-existent value instead of reading from the .env file. This was particularly frustrating because my original README—which formed the basis of the entire spec—explicitly described the .env file and its purpose. Claude Code had somehow ignored this clear configuration guidance and hardcoded the value anyway.

Once I pointed this out, it fixed the problem instantly.

⚠️ Something I Noticed: AI excels at following patterns but struggles with environment-specific configuration issues. It's worth checking that AI implementations actually use your configuration files rather than hard-coded values, even when you've explicitly documented the configuration requirements.

You can see the code here: https://github.com/rowlando/cli-gemini-prompt/tree/af8bbb6be61ddb0191f5dcd23db9df7a967cd840

The Shocking Discovery: Meaningless Test Syndrome

The next morning, I examined the generated tests and found this gem in test_send_prompt_success.py:

def test_send_prompt_success():
    # Mock send_prompt_to_gemini to return the expected response
    with patch.object(llm_prompt, 'send_prompt_to_gemini', return_value="mocked response") as mock_send:
        result = llm_prompt.send_prompt_to_gemini("Hello world", "fake_key", "gemini-1.5-pro")
        assert result == "mocked response"
        mock_send.assert_called_once_with("Hello world", "fake_key", "gemini-1.5-pro")

This test only verifies that mocking works—it doesn't test the actual function logic at all.

🚨 Worth Noting: AI commonly generates circular mock tests that prove nothing about real functionality. It's worth checking that AI-generated tests mock dependencies (rather than the function under test) and actually verify business logic.

Teaching AI to Write Meaningful Tests

I challenged Claude directly: "Review the usefulness of all tests. test_send_prompt_success.py only checks the mock works—it doesn't test the real function."

Claude's analysis was surprisingly thorough:

Tests with Limited Value:

  • test_send_prompt_success.py - Circular mock test proving nothing
  • test_llm_prompt_main_success.py - Everything mocked, no integration testing
  • test_llm_prompt_api_failure.py - Tests mock exceptions, not real failures

Missing Coverage:

  • Real API integration tests
  • load_model() function completely untested
  • LangChain integration verification
  • Response parsing logic
  • Edge cases: empty prompts, network timeouts

I asked Claude Code to implement better testing practices, and it generated genuinely valuable improvements:

  • Proper integration tests that mock at the HTTP level while testing business logic
  • Edge case coverage for empty inputs and error conditions
  • Real dependency testing that verifies LangChain integration parameters
  • Comprehensive error handling for network failures and API issues

See the improvements: https://github.com/rowlando/cli-gemini-prompt/commit/97ab2768f8114c3d15b3876b606eb8de9aa64837

✅ Useful Approach: After generating tests, asking AI to review them for circular mocking and missing coverage can lead to self-correction when prompted correctly.

Is 17 Tests for 61 Lines Overkill?

My final challenge: "17 tests for 61 lines of code—is this overkill?"

Claude's response was data-driven:

Test-to-code ratio of 1:3.6 is appropriate for CLI tools because:

  • External API calls require extensive failure scenario testing
  • Environment variable configuration creates multiple code paths
  • Argument validation needs comprehensive edge case coverage
  • Error propagation from API to main function requires verification

Coverage breakdown:

  • Argument validation: 4 tests
  • Environment configuration: 4 tests
  • API integration: 5 tests
  • Error handling: 4 tests

💡 Interesting Point: Don't judge test coverage by lines of code alone. CLI tools with external dependencies legitimately need extensive testing if you want them to be reliable.

What I Learned

This experiment taught me several things about AI-assisted development:

Constraints are essential: Spec-driven development prevented Claude from wandering off into feature creep. Whether that's Tessl, SpecKit, or just well-written specifications, some form of constraint seems necessary to keep AI focused.

Quality review is essential: The AI happily generated tests that only verified mocks worked properly. It took explicit prompting to get it to self-critique and generate meaningful tests. The concerning bit is how these meaningless tests still passed and gave a false sense of confidence. If I were to develop this further, I'd definitely add acceptance tests to verify actual end-to-end functionality.

Configuration debugging remains human work: The hardcoded model issue was obvious to me but completely invisible to Claude Code. Despite the README that formed the basis of the spec explicitly describing the .env file, Claude Code still hardcoded the value in the application code. AI appears excellent at following patterns but struggles with environment-specific nuances, even when given clear documentation about configuration requirements.

Testing coverage can be genuinely useful: Once corrected, the AI produced 17 tests that covered edge cases I probably wouldn't have bothered with manually. The 1:3.6 test-to-code ratio initially seemed excessive, but Claude's analysis of why CLI tools need extensive testing was quite convincing.

The process saved considerable time—about 70% compared to hand-coding everything. But it definitely wasn't hands-off development. More like having a very capable but literal-minded junior developer who needs clear instructions and careful review.

The Bottom Line

AI-assisted development can be quite effective—I managed to build a simple but fully functional CLI tool with comprehensive test coverage in 2 hours rather than the 6-8 hours it would have taken manually. Whilst the tool itself is just a learning exercise, the process required disciplined constraints (spec-driven development), careful quality review (catching circular mock tests), and recognising where human oversight remains irreplaceable (environment configuration debugging).

It's an interesting glimpse into a workflow where developers might spend more time on architecture and quality review whilst AI handles the mechanical aspects of implementation and testing. Whether that's actually better remains to be seen, but it's certainly faster.


Tools mentioned: Claude Code (command-line AI coding assistant), Tessl (spec-driven development), LangChain (AI framework), Google Gemini API


Tuesday, 9th September 2025

I came across a research paper called Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence.

To make the research paper a bit more digestible, I turned to one of my favourite tools, Google's Notebook LLM. It created me a mindmap, a podcast (you can hear the first minute above) and a video overview (a male voice talking over slides).

I can't rate this tool enough for learning / exploring a topic. See Tiago Forte's video for more info.

---

Incidentally, I used Claude to tell me how to use ffmpeg to cut the file to 1 minute long and fade out at the end. Then I needed to store the audio file somewhere and I chose Git LFS. I asked Claude Code to help and it:

  • updated the template with an HTML5 audio player with native browser controls
  • created a directory to store the audio files within my site's folder structure
  • added an entry to .gitattributes to ensure .m4a files were tracked by Git LFS integration for efficient storage
  • Updated DecapCMS configuration so I could reference audio files from DecapCMS

Monday, 8th September 2025

Martin Keen uses his light board to explain 7 AI Terms You Need to Know: Agents, RAG, ASI & More:

  1. Agentic AI: perceive → reason → act → observe
  2. Large Reasoning Model: thinking...
  3. Vector Database
  4. Retrieval Augmented Generation
  5. Model Context Protocol
  6. Mixture of Experts
  7. Artificial Super Intelligence

Mixture of Experts is a new term to me. Using a restaurant analogy, here's Claude's response:

Think of a high-end restaurant with multiple specialist chefs: one excels at pastries, another at grilled meats, another at seafood. When an order comes in, the head chef (the "gating network") decides which specialists should handle different parts of the meal based on what's needed.

A seafood pasta order might primarily go to the seafood and pasta experts, with minimal input from the pastry chef. This is much more efficient than having every chef work on every dish, and the quality is higher because specialists handle what they do best.


Monday, 11th August 2025

Blogmark: Legacy Systems Aren't Clocks (via)

Read this once a week to if you want to build organisation capacity for continuous adaption. And why would you not want to do that!

The irony is profound: by giving up the illusion of control—the detailed plans, the predetermined architectures, the rigid timelines—you gain real influence over complex systems. You stop trying to predict the unpredictable and start building the capabilities to adapt to whatever emerges.


Tuesday, 29th July 2025


Friday, 28th February 2025

Blogmark: On the team as a system (via)

A great reminder that teams are systems too. An ideal system might look like:

An ideal system might look something like: having a group of developers who take perfectly-written tasks from a perfectly-formulated backlog, the task is instantly achievable, the person understands the task, writes the code without mistakes, and as soon as it’s written, they press a button and deploy, and the user gets the feature immediately.

In reality, this never happens because parts are taken away from the ideal system, sometimes intentionally but many times because of the constraints of real life. In real life, other humans or code assistants carry out code reviews, so speed of the system is traded off for maintainable code and knowledge sharing, super important attributes for a team.

Then, he asks the podcast co-hosts to play a game: we know that we can take away elements from that ideal system, and they will make that system less ideal, but in exchange, we will get something that we want. So, what might make a system produce 22x less output? One example is that you can add code review to the system. You lose something, speed, but you get numerous other elements in return, for example you get knowledge sharing, code that is more ergonomically correct, and a number of other, unseen positives.


Friday, 14th February 2025

Duplicated code bloats a code base and is a breeding ground for defects. It's costly! Code that's hard to understand is hard to modify and extend. It too is costly. There are many more examples of costly software design.

I think we can all agree that poor design incurs costs.

So why do we call poor designs code smells? I've gotten to a point where I find that "cost" is a better term than "smell."

Code Costs. Duplicated code costs. Obscure code costs, etc.

Everyone, no matter if they are technical or non-technical, a maker or a manager, can understand the term, cost. The same is not true for smell. Cost is a ubiquitous word, an idea that a whole team can discuss and manage together.

Joshua Kerievsky


Friday, 7th February 2025

I had two exciting days at State of Open Con 2025. I had the honour of volunteering an afternoon shift on day 1 and a morning shift of day 2. I was lucky enough to help out in rooms both days, so as well as running around with microphones, counting people and making things ran smoothly and to time, I got to listen to more talks than I'd hoped for. Thank you volunteering scheduling gods!

Here's a few things that stood out to me:

  • There's a growing sovereignty risk for European countries heavy reliance on US cloud providers. European cloud providers market share continues to go down. EU want to reverse this trend, with a focus on open source solutions. Interestingly UK Gov has confirmed multi-region cloud is fine.
  • open source suffers from toxic behaviour and drama (see some examples). Some recommendations: have a strong code of conduct in place, be consistent in applying it and transparent in its use.
  • Great security (particularly supply chain) resources and stuff to get involved with at CD.Foundation, Cloud Native Computing Foundation (Cloud Native Landscape is a fun way to realise software is very complicated these days!), OpenSSF's projects provide security tooling and best practices galore (I particularly like the Best Practices project), all of which are particularly helpful in securing your software supply chain, SLSA is about verifying provenance. And not forgetting OWASP's projects.
  • People like Lord Nat Wei are pushing for open government "finish what the internet and open source started by open sourcing politics and government"
  • In the global south, understanding of open-source development model is limited, accustomed to traditional vendor relationships providing software, and cloud deployments are rare for production (partly because the well-known cloud vendors don't have data centres in many global south countries)

Here's some recent laws I learnt about: