I Built a Simple LLM-Based Test Failure Explainer
And Learned Why "Smart" Automation Fails Quietly
You know what’s wild? After 14 years in test automation, the failure analysis process hasn’t really changed.
A test breaks. There’s a mountain of logs. Maybe a screenshot. And then... crickets. Someone reruns it. Or worse, someone ignores it.
I kept seeing all this hype about “AI in testing” and figured, why not actually build something? Not another think piece—something real. Something small. Something that solves an actual problem I deal with every day.
So I built an LLM that reads test failures and explains why they probably failed. In plain English.
Here’s what happened.
The Real Problem Nobody Talks About
Everyone’s excited about AI for test generation, self-healing locators, autonomous testing—all that sexy stuff.
But that’s not where teams bleed time.
The real cost? It’s the 20-30 minutes you spend per failure, re-reading the same stack traces, trying to figure out if it’s a product bug, a test bug, or just flaky infrastructure.
After doing this for over a decade, I can tell you: the bottleneck isn’t writing tests. It’s understanding why they fail, quickly.
So that’s all I focused on.
What I Didn’t Build (And Why That Matters)
This is actually more important than what I built.
I deliberately avoided:
Auto-fixing tests
Modifying code
Re-running pipelines
Deep framework integration
Why? Because I’ve seen what happens when automation starts making decisions without human context. It becomes this black box that nobody trusts. You end up debugging the automation instead of the tests.
My only goal was simple: turn raw failure data into something a human can actually understand in under a minute.
The Architecture (Boring on Purpose)
I kept it stupidly simple.
Inputs:
Test name
Failure message
Stack trace
Last step that executed
Screenshot filename (I’m not processing images yet, just metadata)
Processing:
Normalize the logs
Strip out noise
Chunk large stack traces so they’re digestible
LLM Task:
Classify the failure type
Explain the likely cause
Suggest what to check next (not what to fix—important distinction)
Output:
5-7 bullet points
One-line “most probable cause”
That’s it.
Step 1: Cleaning Up the Logs (Where Most Tools Die)
Raw logs are terrible for LLMs. They’re terrible for humans too, but at least we can skim.
Before I send anything to the model, I strip out:
Timestamps (who cares?)
Repeated stack frames (framework noise)
Boilerplate garbage
Everything after the meaningful exception
This did three things:
Cut token usage way down
Reduced hallucinations significantly
Made the explanations actually relevant
Here’s what I learned: LLMs don’t need more data. They need cleaner signals. Same as humans, honestly.
Step 2: Prompt Design (No Magic Here)
Early on, I tried asking the model to “analyze the failure.”
Big mistake. Too vague.
The responses sounded intelligent but were completely useless.
So I forced structure into the prompt:
Identify the failure category first
Explain the cause using testing terminology, not developer jargon
Distinguish between test issues, product bugs, and environment problems
Do not suggest code fixes (this is critical)
When the prompt was loose, I got smart-sounding nonsense.
When the prompt was strict, I got boring, actionable answers.
I’ll take boring over clever every single time.
Step 3: Early Results (Better Than Expected, But...)
For straightforward failures—missing elements, timeouts, assertion mismatches—it actually worked really well.
The explanations were genuinely helpful:
“Locator is probably too specific”
“Page load was delayed”
“Assertion is checking dynamic text”
But then I noticed something dangerous.
The model was always confident. Even when it was wrong.
That’s when I realized I needed guardrails.
Step 4: Embracing Uncertainty (This Changed Everything)
I added one simple rule to the prompt:
If multiple causes are plausible, say so explicitly.
This one change:
Reduced false confidence
Increased trust from the team
Stopped people from blindly following “what the AI said”
Turns out, sometimes the best AI improvement isn’t making it smarter. It’s making it honest about what it doesn’t know.
What Broke (Learn From My Mistakes)
1. Flaky Tests Broke the Model’s Brain
LLMs hate randomness. They try to find patterns even when there aren’t any. They’ll give you these elaborate theories about why a flaky test failed when the real answer is just “because it’s Tuesday.”
My fix: flag known flaky tests in the system, and change the explanation tone to acknowledge probability instead of certainty.
2. Screenshots Are Overrated (Without Context)
Everyone thinks adding image analysis will be a game-changer. It wasn’t.
Without the DOM state or step context, the screenshots didn’t help much at all. The model would describe what it saw, but couldn’t really explain why it mattered.
Lesson learned: multimodal AI is useless if your test framework doesn’t capture semantic structure. Fix your instrumentation first.
3. Long Stack Traces Degraded Everything
Even after trimming, really deep stack traces made the output quality drop hard.
Solution: hard cap on stack depth. Focus only on the first meaningful failure point. More isn’t better.
What This Thing Is Actually Good For
Let me be realistic about this.
It doesn’t:
Replace debugging
Fix your tests
Remove the need for engineering judgment
It does:
Reduce cognitive load on the team
Speed up initial triage
Help junior engineers reason through failures better
Create consistency in how we explain what happened
In other words: it improves thinking, not execution.
And honestly? That’s exactly where AI belongs in testing.
The Bigger Lesson Here
AI in testing doesn’t fail because the models aren’t good enough.
It fails because we:
Try to automate judgment instead of augmenting it
Feed it garbage data and expect gold
Chase full replacement instead of useful assistance
The moment I started treating the LLM like “a smart junior engineer who can read logs really fast but doesn’t have all the context”—everything clicked.
Final Thought
If you’re experimenting with AI in your testing workflow, start where your team wastes the most thinking time. Not where the demos look impressive.
For most of us, that’s still the same question we’ve been asking for years:
“Why did this test fail... again?”

