0%
Still working...

How Claude Used Git History to Uncover a Ghostscript Overflow

In this blog post How Claude Used Git History to Uncover a Ghostscript Overflow we will explore how an AI model could identify a decades-old class of memory bug by treating Git history as a security signal, not just a development diary.

The title How Claude Used Git History to Uncover a Ghostscript Overflow sounds like a party trick. In my experience, it’s actually a mirror held up to how enterprises still do “security work” in 2026: we scan the present, but we rarely interrogate the past.

Ghostscript is one of those foundational components that quietly sits under printing, PDF workflows, document pipelines, and legacy integrations. It’s been around for a long time, it’s widely deployed, and it’s exactly the kind of dependency leaders assume is “already hardened.”

The interesting part here isn’t “AI found a bug.” The interesting part is how it found it: by reading commit history, learning the intent of past security patches, and spotting a pattern that humans (and fuzzers) can miss when the fix is incomplete or inconsistent.

A high-level explanation of what happened

What’s been reported publicly is that Claude (in a controlled security research setup) reviewed Ghostscript’s code and its Git history, found an older security-related change, and then noticed the same type of bounds check wasn’t applied everywhere the logic existed.

That last bit matters. In real systems, “the same logic” often exists in more than one place because of copy-paste, parallel implementations, platform-specific branches, or older refactors that were never fully consolidated.

So the model didn’t need to invent a brand-new vulnerability class. It needed to do something that sounds simple but is painfully time-consuming at scale: diff the intent of a fix against the whole codebase, including older paths and duplicated code.

The main technology behind it is not magic, it is workflow

When leaders hear “Claude found a buffer overflow,” they often imagine an AI doing exotic reverse engineering. In practice, the core enabling technologies are much more grounded:

  • Large Language Models (LLMs) that can read code, comments, commit messages, and diffs as one connected narrative.
  • Tool-using AI agents that can run developer tools (search, grep/ripgrep, Git commands, build/test tooling) instead of relying on memory.
  • Security reasoning that combines “what changed historically” with “what else looks structurally similar today.”

In my architecture work, I’ve seen organisations treat Git as an audit trail for who changed code, and for when they changed it. The shift here is treating Git as evidence for why code changed, and whether the fix pattern was applied consistently.

Why Git history is a security dataset

Git history is full of security-relevant signals that rarely make it into formal documentation:

  • Commit messages that mention “bounds,” “sanitize,” “CVE,” “crash,” “fuzz,” “overflow,” “underflow,” “OOB,” or “hardening.”
  • Diffs that show a developer adding checks in one file, but not in similar code paths elsewhere.
  • Reverts and follow-up commits that hint the original fix had side effects or was partial.

Humans can do this too, of course. The constraint is time. An LLM doesn’t get bored scanning for “other places this pattern appears.”

Buffer overflows explained for leaders in plain language

A buffer overflow is what happens when software writes more data into a fixed-size space than it can hold. Imagine a mailroom shelf built for 10 envelopes, and someone forces in 50. The extra envelopes spill into other shelves, and now other mail gets mixed up or damaged.

In low-level languages like C/C++, that “spill” can overwrite nearby memory. Best case, the program crashes. Worst case, an attacker can shape the overflow to change program behaviour in a controlled way.

Ghostscript processes complex file formats (PostScript/PDF). Complex parsers are historically fertile ground for memory issues because they combine untrusted input, intricate logic, and performance-sensitive code.

How an AI model can use Git history to find a missed fix

Here’s a simplified version of the pattern I believe leaders should take away, regardless of the exact file names or patch details.

1) Find a security-relevant patch in history

An AI agent can start with Git queries that a good engineer would run, but do it relentlessly and systematically:

git log --oneline --grep="overflow"
git log --oneline --grep="bounds"
git log --oneline --grep="security"
git log --oneline --grep="CVE"

Then it can open the diff and identify the intent: “They added a bounds check before copying data into a stack buffer,” or “They validated a length field before indexing an array.”

2) Generalise the fix pattern

This is where LLMs are strong. They can convert a specific patch into a reusable mental template:

  • What is the risky variable?
  • What is the buffer size?
  • What check was added?
  • What happens when the check fails?

In my experience, most incomplete fixes fail in one of two ways: the check exists but is off-by-one, or the check exists in one path but not in a parallel path.

3) Search for “the same thing” implemented elsewhere

Humans typically search by function name. An AI can search by structure. It can look for:

  • Similar blocks of code (copy-paste variants).
  • Same data structure used in multiple modules.
  • Same parsing logic in different backends or device drivers.
# Rough idea of what an agent might do
rg -n "blend" ./
rg -n "stack" ./
rg -n "\bmemcpy\b|\bstrcpy\b|\bstrcat\b" ./

Then it asks a very human question: “Why is this check present over here but missing over there?”

4) Produce a minimal proof of failure

Reportedly, the workflow included crafting an input that triggers the edge case. You don’t need a Hollywood exploit to make this valuable. A reproducible crash with a clear call stack is often enough to confirm the bug is real and actionable.

That’s also where traditional fuzzing can struggle. Fuzzers are great at exploring input variations, but they can miss deep states or rare sequences unless the harness and coverage strategy are tuned for that specific code path.

What I think is the real lesson for enterprise teams

I’m a published author and I’ve spent 20+ years in enterprise IT across Azure, Microsoft 365, AI, and cybersecurity. One pattern I keep running into is that organisations invest heavily in “finding known bad patterns,” but underinvest in “verifying the completeness of fixes.”

Git-history-driven auditing is basically a completeness check at scale.

Key point 1: A patch is not the end of the story

In mature environments, we like to believe that once a security fix lands, the problem is solved. In practice, patches are negotiated outcomes: they’re constrained by time, risk of regressions, and incomplete understanding of all code paths.

So the right question becomes: did we fix the class of bug, or just this instance?

Key point 2: “Duplicate logic” is a security smell

From an architecture standpoint, duplicated parsing or validation logic is a long-term liability. It creates parallel universes where one path gets hardened and another quietly stays weak.

If you’re running Essential Eight-aligned programs in Australia, consider where application control, patching, and configuration hardening interact with old components. Even with good controls, latent bugs in widely-used parsers can still matter when documents cross trust boundaries.

Key point 3: Treat AI as a reviewer of intent, not just code

The most useful way I’ve seen leaders adopt AI is not “replace engineers,” but “reduce blind spots.” AI can act like a tireless reviewer that asks: “If you believed this was the risk, where else did you need the same control?”

That’s closer to architecture governance than it is to code generation.

Key point 4: Your secure SDLC should include patch-diff audits

Most secure SDLC programs focus on:

  • Dependency scanning
  • SAST/DAST
  • Fuzzing (sometimes)
  • Threat modelling (occasionally)

A practical addition is a lightweight “patch-diff audit” step for high-risk libraries and parsers. The goal is not to relitigate every commit. It’s to ask whether security fixes were applied consistently across the codebase.

An anonymised scenario I have seen in the real world

A while back, I worked with a large organisation (anonymised) that had a “document in, PDF out” workflow. It looked safe on paper: sandboxing, antivirus, logging, tight egress controls, and patching SLAs.

But the pipeline had two separate parsing paths depending on where the document came from. One path had been hardened after an incident years earlier. The other path was considered “internal only” and had quietly drifted.

The security team had evidence of the earlier fix, but not a mechanism to ensure the hardening pattern was applied everywhere. That is exactly the kind of gap Git-history-driven review can uncover quickly.

Practical steps if you want to operationalise this idea

If I were advising a CIO or CTO on what to do next (without turning this into a product pitch), I’d focus on small, high-leverage moves.

  • Identify “parser-class” dependencies in your estate (PDF, image, archive, font, media codecs). These are high-risk because they process untrusted input.
  • Pick your top 5 based on reach (how many systems use them) and exposure (internet-facing, email-facing, partner-facing).
  • Run a patch-intent review: search Git history for security-related commits and ask whether the fix pattern appears everywhere it should.
  • Use AI carefully: constrain it to read-only repos, log tool actions, and require human validation for any claim of exploitability.
  • Feed the learning back into your engineering standards: reduce duplicate logic, centralise validation, and add regression tests that lock the fix in.

For Australian organisations, this complements (not replaces) baseline controls like Essential Eight. The controls reduce blast radius; this approach reduces the chance the vulnerability exists in the first place.

A forward-looking reflection

I don’t think the headline is “Claude found a Ghostscript overflow.” I think the headline is that we’re entering an era where the cheapest security wins will come from connecting context: code, commits, past incidents, and patch intent.

If AI can reliably find incomplete fixes by reading Git history, what other “we fixed it years ago” assumptions in our environments deserve a second look?

Leave A Comment

Recommended Posts