I built a simple CLI tool in one evening without writing a single line of code—not because the world needs another command-line utility, but as a learning exercise to explore LangChain whilst getting hands-on experience with AI-assisted development workflows.
The tool itself is deliberately basic: it sends prompts to Google's Gemini API via LangChain. No one will actually use it. But the development process revealed crucial insights about when AI excels and when human oversight becomes non-negotiable.
Context: I'm currently working through the IBM RAG and Agentic AI Professional Certificate course, which sparked my interest in putting LangChain into practice. This seemed like a reasonable opportunity to combine that learning with experimenting with Tessl and Claude Code workflows.
Time investment: 2 hours total versus an estimated 6-8 hours for hand-coding the equivalent functionality and test coverage.
The Setup: Spec-Driven Development with AI
I've been using Claude Code paired with a tool called Tessl, which enforces spec-driven development workflows. My approach was deliberately hands-off: write a README setting the vision, capabilities, tech stack choices, and API design, then let Claude Code handle everything else.
The project: a simple CLI tool that uses Langchain to send prompts to Google's Gemini API. I wrote a README, asked Claude Code to generate specs for the first basic feature, then requested implementation. Tessl keeps Claude Code focused by preventing feature creep—when the initial spec tried to do too much, I simplified it before proceeding.
💡 Worth Considering: Spec-driven development with AI requires upfront architectural thinking but prevents the common trap of letting AI wander into over-complex solutions. Tools like Tessl provide valuable guardrails.
First Reality Check: When AI Debugging Falls Short
After Claude Code reported all tests passing, I tried the tool: llm-prompt "What are the benefits of renewable energy?"
It didn't work.
Claude Code couldn't figure out why. I looked at the code and immediately spotted the issue: Claude had hard-coded the model type to a non-existent value instead of reading from the .env
file. This was particularly frustrating because my original README—which formed the basis of the entire spec—explicitly described the .env
file and its purpose. Claude Code had somehow ignored this clear configuration guidance and hardcoded the value anyway.
Once I pointed this out, it fixed the problem instantly.
⚠️ Something I Noticed: AI excels at following patterns but struggles with environment-specific configuration issues. It's worth checking that AI implementations actually use your configuration files rather than hard-coded values, even when you've explicitly documented the configuration requirements.
You can see the code here: https://github.com/rowlando/cli-gemini-prompt/tree/af8bbb6be61ddb0191f5dcd23db9df7a967cd840
The Shocking Discovery: Meaningless Test Syndrome
The next morning, I examined the generated tests and found this gem in test_send_prompt_success.py
:
def test_send_prompt_success():
# Mock send_prompt_to_gemini to return the expected response
with patch.object(llm_prompt, 'send_prompt_to_gemini', return_value="mocked response") as mock_send:
result = llm_prompt.send_prompt_to_gemini("Hello world", "fake_key", "gemini-1.5-pro")
assert result == "mocked response"
mock_send.assert_called_once_with("Hello world", "fake_key", "gemini-1.5-pro")
This test only verifies that mocking works—it doesn't test the actual function logic at all.
🚨 Worth Noting: AI commonly generates circular mock tests that prove nothing about real functionality. It's worth checking that AI-generated tests mock dependencies (rather than the function under test) and actually verify business logic.
Teaching AI to Write Meaningful Tests
I challenged Claude directly: "Review the usefulness of all tests. test_send_prompt_success.py
only checks the mock works—it doesn't test the real function."
Claude's analysis was surprisingly thorough:
Tests with Limited Value:
test_send_prompt_success.py
- Circular mock test proving nothingtest_llm_prompt_main_success.py
- Everything mocked, no integration testingtest_llm_prompt_api_failure.py
- Tests mock exceptions, not real failures
Missing Coverage:
- Real API integration tests
load_model()
function completely untested- LangChain integration verification
- Response parsing logic
- Edge cases: empty prompts, network timeouts
I asked Claude Code to implement better testing practices, and it generated genuinely valuable improvements:
- Proper integration tests that mock at the HTTP level while testing business logic
- Edge case coverage for empty inputs and error conditions
- Real dependency testing that verifies LangChain integration parameters
- Comprehensive error handling for network failures and API issues
See the improvements: https://github.com/rowlando/cli-gemini-prompt/commit/97ab2768f8114c3d15b3876b606eb8de9aa64837
✅ Useful Approach: After generating tests, asking AI to review them for circular mocking and missing coverage can lead to self-correction when prompted correctly.
Is 17 Tests for 61 Lines Overkill?
My final challenge: "17 tests for 61 lines of code—is this overkill?"
Claude's response was data-driven:
Test-to-code ratio of 1:3.6 is appropriate for CLI tools because:
- External API calls require extensive failure scenario testing
- Environment variable configuration creates multiple code paths
- Argument validation needs comprehensive edge case coverage
- Error propagation from API to main function requires verification
Coverage breakdown:
- Argument validation: 4 tests
- Environment configuration: 4 tests
- API integration: 5 tests
- Error handling: 4 tests
💡 Interesting Point: Don't judge test coverage by lines of code alone. CLI tools with external dependencies legitimately need extensive testing if you want them to be reliable.
What I Learned
This experiment taught me several things about AI-assisted development:
Constraints are essential: Spec-driven development prevented Claude from wandering off into feature creep. Whether that's Tessl, SpecKit, or just well-written specifications, some form of constraint seems necessary to keep AI focused.
Quality review is essential: The AI happily generated tests that only verified mocks worked properly. It took explicit prompting to get it to self-critique and generate meaningful tests. The concerning bit is how these meaningless tests still passed and gave a false sense of confidence. If I were to develop this further, I'd definitely add acceptance tests to verify actual end-to-end functionality.
Configuration debugging remains human work: The hardcoded model issue was obvious to me but completely invisible to Claude Code. Despite the README that formed the basis of the spec explicitly describing the .env
file, Claude Code still hardcoded the value in the application code. AI appears excellent at following patterns but struggles with environment-specific nuances, even when given clear documentation about configuration requirements.
Testing coverage can be genuinely useful: Once corrected, the AI produced 17 tests that covered edge cases I probably wouldn't have bothered with manually. The 1:3.6 test-to-code ratio initially seemed excessive, but Claude's analysis of why CLI tools need extensive testing was quite convincing.
The process saved considerable time—about 70% compared to hand-coding everything. But it definitely wasn't hands-off development. More like having a very capable but literal-minded junior developer who needs clear instructions and careful review.
The Bottom Line
AI-assisted development can be quite effective—I managed to build a simple but fully functional CLI tool with comprehensive test coverage in 2 hours rather than the 6-8 hours it would have taken manually. Whilst the tool itself is just a learning exercise, the process required disciplined constraints (spec-driven development), careful quality review (catching circular mock tests), and recognising where human oversight remains irreplaceable (environment configuration debugging).
It's an interesting glimpse into a workflow where developers might spend more time on architecture and quality review whilst AI handles the mechanical aspects of implementation and testing. Whether that's actually better remains to be seen, but it's certainly faster.
Tools mentioned: Claude Code (command-line AI coding assistant), Tessl (spec-driven development), LangChain (AI framework), Google Gemini API