Jovil’s Newsletter

Manual Testing is NOT Dead — It’s Becoming Elite (Part 2)

Jovil Pasrija — Mon, 04 May 2026 03:47:54 GMT

Automation is powerful, and AI is moving faster than most teams are ready for - so it's worth asking honestly:

- if machines can test at scale, what's left for humans to do?

Because it challenges a belief many people have already accepted.

Automation Works Inside Boundaries

Automation is incredibly efficient.

It:

Executes predefined steps
Validates expected outcomes
Runs thousands of checks in minutes

But all of this happens within one constraint:

👉 It only operates within boundaries you define.

And here’s the problem:

Real-world systems don’t fail inside boundaries.
They fail outside them.

What Automation Can't Ask?

Automation answers one question extremely well:

✔ Did the system behave as expected?

But it cannot ask:

❌ Was the expectation correct in the first place?

No tool closes that - it requires someone who can question the assumption before the test is ever written

And no framework, tool, or AI model can fully close it.

Where the Real Problems Actually Live?

Automation tests correctness.

Manual testing explores experience + risk.

That gap is where most real problems live.

Because users don’t think in terms of:

“Expected vs actual result”

They think in terms of:

👉 “Does this make sense?”
👉 “Why is this confusing?”
👉 “Can I trust this product?”

And those questions don’t return a boolean result.

A Quiet Failure Scenario

Imagine this:

Your automation suite passes 100%.

Payments work
APIs respond correctly
UI elements function as expected

Everything is green.

But users were abandoning the checkout flow at the pricing step - not because it was broken, but because the copy created hesitation nobody had thought to test for”

Why?

Pricing feels unclear
Flow creates hesitation
Something feels… off

No test failed.

But the product did.

👉 If this resonates, don’t miss future breakdowns like this.

Where Manual Testers Still Win

Even in highly automated systems, manual testers outperform in key areas:

1. Ambiguity

Requirements are rarely perfect.
Humans interpret. Automation executes.

2. Exploratory Thinking

Unexpected flows. Strange inputs. Real-world chaos.

3. UX Judgment

“Feels wrong” cannot be asserted in code.

4. Risk Anticipation

Experienced testers see patterns before failure happens.

And What About AI?

AI changes the game.

It can:

Generate test cases
Suggest edge scenarios
Analyze logs faster than humans

But here’s what AI still depends on:

👉 Data
👉 Context
👉 Problem framing

If those are flawed…

AI doesn’t fix the problem.

It scales the flaw.

The More Automation Grows, The More Thinking Matters

The more automation grows…

The more valuable thinking becomes.

Because someone still needs to:

Decide what to automate
Interpret results beyond pass/fail
Identify what’s missing, not just what’s covered

Automation increases speed.

But thinking defines direction.

Final Thought

Automation is not the enemy of manual testing.

It’s an amplifier.

But amplification without direction is dangerous.

Because you don’t just scale what’s right.

You scale what you assume is right.

👉 Want to stay ahead in the AI + testing era?

Manual Testing is NOT Dead - It’s Becoming Elite (Part 1)

Jovil Pasrija — Thu, 23 Apr 2026 04:02:12 GMT

There’s a sentence floating around the tech world that refuses to die:

“Manual testing is dead.”

It shows up in posts, conference talks, job descriptions, and sometimes… in quiet conversations between colleagues.

Not loud. Not aggressive.
Just enough to plant doubt.

And over time, that doubt turns into pressure.

👉 Am I behind?
👉 Am I becoming irrelevant?
👉 Should I drop everything and chase automation?

If you’ve ever felt that… you’re not alone.

I’ve been in testing long enough to watch this narrative evolve.

It didn’t start as “manual testing is dead.”

It started as something more reasonable:

“Automation is important.”

Which is true.

But somewhere along the way, the message got distorted.

And now we have an entire generation of testers questioning the value of the very skill that built their foundation.

Where This Narrative Comes From

To understand why this belief exists, you have to look at how the industry changed.

Software delivery accelerated.

Agile replaced long release cycles
DevOps removed handoffs
CI/CD pipelines made releases continuous

Speed became the currency.

And automation became the engine powering that speed.

Naturally, companies started prioritizing:

Test automation engineers
SDET roles
Engineers who could code and test

From a business perspective, it makes sense.

If you can run thousands of tests in minutes… why rely on humans?

That’s the surface-level logic.

And it’s convincing.

But There’s a Problem Hidden Beneath It

Automation solves a very specific problem:

👉 Repetition at scale.

It is exceptional at:

Running predefined checks
Validating known behavior
Preventing regressions

But here’s what it does not solve:

Discovering unknown risks
Understanding user confusion
Questioning product decisions
Identifying gaps in requirements

In other words:

Automation protects what you know.
Manual testing explores what you don’t.

And modern systems fail more often in the unknown.

The Part No One Says Out Loud

Let’s be brutally honest for a moment.

A certain type of manual testing is disappearing.

The kind where:

Test cases are followed step-by-step without thinking
Bugs are logged without understanding impact
Testing is treated as execution, not investigation

That version of testing is easy to replace.

Not just by automation.

But by:

junior resources
outsourced teams
or even AI-assisted tools

So when people say “manual testing is dead,”
what they’re actually seeing is this version fading away.

Meanwhile… Something Else is Rising

While one version declines, another version is becoming incredibly valuable.

This tester:

Challenges assumptions
Asks uncomfortable questions
Connects technical behavior to business impact
Thinks in edge cases, not just happy paths

They don’t just test features.

They interrogate systems.

They don’t wait for instructions.

They create direction.

The Shift Most People Miss

Testing used to be about:

✔ Verification
“Does the system work as expected?”

Now it’s about:

⚠ Risk Discovery
“What could go wrong when real users interact with this?”

That shift changes everything.

Because risk is not always visible in test cases.

It lives in:

incomplete requirements
misunderstood users
assumptions nobody challenged

And those are human problems.

Why This Matters for You

If you believe the narrative blindly, you’ll react like most people do:

Learn tools quickly
Copy automation frameworks
Compete in a crowded space

But here’s the irony:

The more people chase tools…

The more valuable thinking becomes.

Because tools are accessible.

Thinking is not.

A Quiet Realization

After years in testing, here’s something I’ve learned:

The industry doesn’t eliminate roles.

It evolves expectations.

Manual testing didn’t disappear.

It matured.

And like anything that matures…

It became harder.

What This Series Will Show You

This isn’t a motivational piece.

It’s a reality check.

In Part 2, we’ll go deeper:

Why automation cannot replace thinking
Where even advanced automation fails in real-world systems
The invisible gap between “working software” and “usable software”

And in Part 3:

How to evolve your skillset
What actually makes a tester valuable today
A practical roadmap you can start immediately

If you’ve ever felt uncertain about your place in this industry…

You’re asking the right questions.

Now it’s time to find better answers.

Part 2 is where things get uncomfortable… and clear.

Upcoming:
Part 2: Why Automation Will Never Be Enough

I Built an AI-Assisted Release Confidence Engine

Jovil Pasrija — Mon, 16 Feb 2026 19:21:16 GMT

After building that test failure explainer I wrote about last week, something became really obvious.

Failures aren’t the real decision point.

Shipping is.

Most release decisions still look like this:

All tests passed → Ship it
Some tests failed → Hold everything
Flaky failures → “Probably fine?”

That’s not quality engineering. That’s checkbox engineering.

So I built something slightly more ambitious: a system that converts test signals into a structured release confidence score.

Not automation of approval. Not replacing QA sign-off. Just structured reasoning at release time.

Here’s what happened.

The Real Problem (It’s Not What You Think)

CI pipelines are noise machines.

A release decision actually depends on a bunch of signals:

How many failures and what type
Historical flakiness patterns
Which components are impacted
The risk profile of this specific release

But these signals live in completely different places. Test reports. Jira tickets. Git diffs. Random Slack threads at 4 PM on Friday.

Humans manually synthesize all of this under time pressure.

That’s the actual bottleneck.

After 14 years of watching release meetings, I can tell you: the problem isn’t that we don’t have enough data. It’s that we have too much, and no structured way to think about it.

What I Explicitly Did NOT Build

This matters more than what I built.

I avoided:

Auto-blocking releases
Auto-approving builds
Touching the pipeline itself

The system produces a structured “confidence narrative.” That’s it.

Humans still make the call.

Why? Because I’ve seen what happens when automation starts making release decisions without context. It becomes this thing that everyone routes around. You end up with more problems than you started with.

The Inputs (Nothing Magical)

For each release candidate, I aggregated:

Total tests run
Failure categories (using my earlier failure explainer)
Flaky test history
Code change size
Impacted services
Recent defect trends

Just structured signals. Nothing you don’t already have.

The Key Design Choice That Changed Everything

Instead of asking “Is this release safe?”

I reframed it:

What are the risk drivers?
What failure types increased compared to baseline?
Are failures clustered in critical paths?
Is this regression pattern consistent with the change size?

That framing changed everything.

Here’s what I learned: LLMs are actually pretty good at structured synthesis. They’re terrible at binary judgment.

So I stopped asking it to judge and started asking it to synthesize.

How the System Actually Works

The workflow is straightforward:

Aggregate structured release data
Normalize it (same principle as the failure explainer—clean inputs matter)
Send it to the LLM with very specific constraints:
- Highlight risk clusters
- Compare against historical baseline
- Avoid absolute statements
- Explicitly state uncertainty

The output format:

Risk factors identified
Stability signals
Areas requiring human review
Suggested confidence band (High / Moderate / Low)

Not a verdict. A reasoning summary.

The kind of thing that would take a human 20 minutes to compile, but the LLM does in seconds.

What Surprised Me

Small releases sometimes got flagged as low confidence.

Large releases sometimes scored higher.

This felt wrong at first. Surely bigger changes = more risk?

But then I looked closer:

Change size doesn’t equal risk
Failure type matters way more than failure count
Clustering predicts instability better than totals

A 5-line config change that breaks auth is riskier than a 500-line refactor in a well-tested util library.

This completely changed how I think about regression signals.

What Broke (Learn From My Mistakes)

The first version over-weighted failure count.

It would punish releases for:

Known flaky tests that nobody cares about
Low-impact modules that don’t affect users

So I introduced:

Historical flakiness discounting
Service criticality weighting
Regression vs known issue differentiation

The system improved immediately.

Not because I changed the model. Because I changed the framing.

Same lesson as the failure explainer: clean inputs and good prompts beat smart models.

Where This Actually Helps

This system is useful for:

QE leads making decisions under time pressure
Engineering managers reviewing release health
Cross-team alignment discussions
Avoiding those emotional “ship vs don’t ship” debates at 5 PM

It creates structured conversation instead of gut feelings and arguments.

That’s valuable.

What This Is NOT

Let me be very clear.

This is not:

A replacement for QA
A safety certification
A risk oracle

It is: a reasoning layer over noisy test systems.

It’s the same philosophy as the failure explainer. It doesn’t make decisions. It helps humans make better decisions, faster.

The Pattern I Keep Seeing

Both the failure explainer and this confidence engine follow the same principle:

Don’t automate judgment. Automate synthesis.

The judgment still needs human context. But the synthesis—pulling together scattered signals, normalizing data, identifying patterns—that’s where AI actually shines.

Every time I’ve tried to push AI further into the “decision” territory, it breaks down. Trust erodes. People route around it.

But when I keep it in the “助理” territory—assistant, not replacement—it works.

If I Were Scaling This

At 10× usage, I’d add:

Release confidence trends over time
Prediction accuracy tracking (did low-confidence releases actually have issues?)
Team-specific risk tolerance calibration
Integration with incident management systems

At 100×, this becomes a release intelligence platform that learns from outcomes and gets better at synthesis.

But I’d never let it auto-approve or auto-block. That line is important.

Final Thought

Look, I’ve been in enough release meetings to know that most of the stress comes from uncertainty, not risk.

“Should we ship this?” is a hard question when you’re staring at 15 test failures, 3 known flaky tests, a code change that touches 12 files, and a deployment window that closes in 2 hours.

This system doesn’t eliminate risk. It eliminates confusion.

And honestly? That’s more valuable than any “smart” automation that tries to make the call for you.

I Built a Simple LLM-Based Test Failure Explainer

Jovil Pasrija — Mon, 09 Feb 2026 04:12:23 GMT

You know what’s wild? After 14 years in test automation, the failure analysis process hasn’t really changed.

A test breaks. There’s a mountain of logs. Maybe a screenshot. And then... crickets. Someone reruns it. Or worse, someone ignores it.

I kept seeing all this hype about “AI in testing” and figured, why not actually build something? Not another think piece—something real. Something small. Something that solves an actual problem I deal with every day.

So I built an LLM that reads test failures and explains why they probably failed. In plain English.

Here’s what happened.

The Real Problem Nobody Talks About

Everyone’s excited about AI for test generation, self-healing locators, autonomous testing—all that sexy stuff.

But that’s not where teams bleed time.

The real cost? It’s the 20-30 minutes you spend per failure, re-reading the same stack traces, trying to figure out if it’s a product bug, a test bug, or just flaky infrastructure.

After doing this for over a decade, I can tell you: the bottleneck isn’t writing tests. It’s understanding why they fail, quickly.

So that’s all I focused on.

What I Didn’t Build (And Why That Matters)

This is actually more important than what I built.

I deliberately avoided:

Auto-fixing tests
Modifying code
Re-running pipelines
Deep framework integration

Why? Because I’ve seen what happens when automation starts making decisions without human context. It becomes this black box that nobody trusts. You end up debugging the automation instead of the tests.

My only goal was simple: turn raw failure data into something a human can actually understand in under a minute.

The Architecture (Boring on Purpose)

I kept it stupidly simple.

Inputs:

Test name
Failure message
Stack trace
Last step that executed
Screenshot filename (I’m not processing images yet, just metadata)

Processing:

Normalize the logs
Strip out noise
Chunk large stack traces so they’re digestible

LLM Task:

Classify the failure type
Explain the likely cause
Suggest what to check next (not what to fix—important distinction)

Output:

5-7 bullet points
One-line “most probable cause”

That’s it.

Step 1: Cleaning Up the Logs (Where Most Tools Die)

Raw logs are terrible for LLMs. They’re terrible for humans too, but at least we can skim.

Before I send anything to the model, I strip out:

Timestamps (who cares?)
Repeated stack frames (framework noise)
Boilerplate garbage
Everything after the meaningful exception

This did three things:

Cut token usage way down
Reduced hallucinations significantly
Made the explanations actually relevant

Here’s what I learned: LLMs don’t need more data. They need cleaner signals. Same as humans, honestly.

Step 2: Prompt Design (No Magic Here)

Early on, I tried asking the model to “analyze the failure.”

Big mistake. Too vague.

The responses sounded intelligent but were completely useless.

So I forced structure into the prompt:

Identify the failure category first
Explain the cause using testing terminology, not developer jargon
Distinguish between test issues, product bugs, and environment problems
Do not suggest code fixes (this is critical)

When the prompt was loose, I got smart-sounding nonsense.

When the prompt was strict, I got boring, actionable answers.

I’ll take boring over clever every single time.

Step 3: Early Results (Better Than Expected, But...)

For straightforward failures—missing elements, timeouts, assertion mismatches—it actually worked really well.

The explanations were genuinely helpful:

“Locator is probably too specific”
“Page load was delayed”
“Assertion is checking dynamic text”

But then I noticed something dangerous.

The model was always confident. Even when it was wrong.

That’s when I realized I needed guardrails.

Step 4: Embracing Uncertainty (This Changed Everything)

I added one simple rule to the prompt:

If multiple causes are plausible, say so explicitly.

This one change:

Reduced false confidence
Increased trust from the team
Stopped people from blindly following “what the AI said”

Turns out, sometimes the best AI improvement isn’t making it smarter. It’s making it honest about what it doesn’t know.

What Broke (Learn From My Mistakes)

1. Flaky Tests Broke the Model’s Brain

LLMs hate randomness. They try to find patterns even when there aren’t any. They’ll give you these elaborate theories about why a flaky test failed when the real answer is just “because it’s Tuesday.”

My fix: flag known flaky tests in the system, and change the explanation tone to acknowledge probability instead of certainty.

2. Screenshots Are Overrated (Without Context)

Everyone thinks adding image analysis will be a game-changer. It wasn’t.

Without the DOM state or step context, the screenshots didn’t help much at all. The model would describe what it saw, but couldn’t really explain why it mattered.

Lesson learned: multimodal AI is useless if your test framework doesn’t capture semantic structure. Fix your instrumentation first.

3. Long Stack Traces Degraded Everything

Even after trimming, really deep stack traces made the output quality drop hard.

Solution: hard cap on stack depth. Focus only on the first meaningful failure point. More isn’t better.

What This Thing Is Actually Good For

Let me be realistic about this.

It doesn’t:

Replace debugging
Fix your tests
Remove the need for engineering judgment

It does:

Reduce cognitive load on the team
Speed up initial triage
Help junior engineers reason through failures better
Create consistency in how we explain what happened

In other words: it improves thinking, not execution.

And honestly? That’s exactly where AI belongs in testing.

The Bigger Lesson Here

AI in testing doesn’t fail because the models aren’t good enough.

It fails because we:

Try to automate judgment instead of augmenting it
Feed it garbage data and expect gold
Chase full replacement instead of useful assistance

The moment I started treating the LLM like “a smart junior engineer who can read logs really fast but doesn’t have all the context”—everything clicked.

Final Thought

If you’re experimenting with AI in your testing workflow, start where your team wastes the most thinking time. Not where the demos look impressive.

For most of us, that’s still the same question we’ve been asking for years:

“Why did this test fail... again?”

Jovil’s Newsletter

Manual Testing is NOT Dead — It’s Becoming Elite (Part 2)

Automation Works Inside Boundaries

What Automation Can't Ask?

Where the Real Problems Actually Live?

A Quiet Failure Scenario

Where Manual Testers Still Win

1. Ambiguity

2. Exploratory Thinking

3. UX Judgment

4. Risk Anticipation

And What About AI?

The More Automation Grows, The More Thinking Matters

Final Thought

Manual Testing is NOT Dead - It’s Becoming Elite (Part 1)

Where This Narrative Comes From

But There’s a Problem Hidden Beneath It

The Part No One Says Out Loud

Meanwhile… Something Else is Rising

The Shift Most People Miss

Why This Matters for You

A Quiet Realization

What This Series Will Show You

Upcoming:Part 2: Why Automation Will Never Be Enough

I Built an AI-Assisted Release Confidence Engine

The Real Problem (It’s Not What You Think)

What I Explicitly Did NOT Build

The Inputs (Nothing Magical)

The Key Design Choice That Changed Everything

How the System Actually Works

What Surprised Me

What Broke (Learn From My Mistakes)

Where This Actually Helps

What This Is NOT

The Pattern I Keep Seeing

If I Were Scaling This

Final Thought

I Built a Simple LLM-Based Test Failure Explainer

The Real Problem Nobody Talks About

What I Didn’t Build (And Why That Matters)

The Architecture (Boring on Purpose)

Step 1: Cleaning Up the Logs (Where Most Tools Die)

Step 2: Prompt Design (No Magic Here)

Step 3: Early Results (Better Than Expected, But...)

Step 4: Embracing Uncertainty (This Changed Everything)

What Broke (Learn From My Mistakes)

1. Flaky Tests Broke the Model’s Brain

2. Screenshots Are Overrated (Without Context)

3. Long Stack Traces Degraded Everything

What This Thing Is Actually Good For

The Bigger Lesson Here

Final Thought

Upcoming:
Part 2: Why Automation Will Never Be Enough