In this blog post I Used Claude Opus 4.6 to Audit My Infrastructure and It Found More Than I Expected we will walk through what I actually did, what Claude Opus 4.6 flagged, and how I validated (and fixed) the findings without turning it into an AI-fuelled panic.
The title of this post is I Used Claude Opus 4.6 to Audit My Infrastructure and It Found More Than I Expected, and I’m not exaggerating. The surprising part wasn’t that it found issues. It was how quickly it surfaced patterns that normally take days of spreadsheet work, interviews, and “wait, where does that subnet route again?” conversations.
I’m a Solution Architect / Enterprise Architect based in Melbourne. I’ve spent 20+ years around Azure, Microsoft 365, identity, networks, and security programs that look tidy in PowerPoint and messy in production.
So I treated this like any other audit. AI doesn’t replace engineering judgement. But it can be an extremely sharp assistant if you feed it the right artefacts and force it to show its working.
High-level what this is and what it isn’t
At a high level, I used Claude Opus 4.6 as a review engine across my own infrastructure artefacts. Think: configuration exports, Terraform/Bicep snippets, policy definitions, firewall rules, identity settings, and a simplified architecture description.
This wasn’t a penetration test. Claude didn’t “hack” anything. It did something closer to what a strong security architect does in week one: read everything, ask annoying questions, and map what you say you intended against what your configs actually enforce.
The main technology behind it
Claude Opus 4.6 is a large language model that can reason across big piles of text. The practical advantage for infra reviews is that it can keep a lot of context in its head at once, then spot contradictions and gaps.
In plain language: it’s good at reading messy real-world inputs and turning them into a structured list of risks, with hypotheses you can confirm or reject. If you use it well, it becomes a tireless “second set of eyes” that never gets bored of rule lists and JSON.
My guardrails before I started
- Assume it can be wrong. Every finding needs validation.
- Don’t upload secrets. I redacted keys, tokens, passwords, and internal hostnames.
- Keep it reproducible. I used a consistent prompt structure so I could rerun after changes.
- Focus on control intent. I asked “what control is missing or misconfigured?”, not “is this scary?”
What I gave Claude
I didn’t dump an entire environment and hope for magic. I gave it a curated set of artefacts that represent how the system is meant to operate.
- Network topology summary (VNets, subnets, peering, on-prem connectivity assumptions)
- Firewall/NVA rule exports (redacted, but with ports, sources, destinations, and justifications)
- Identity posture notes (MFA approach, privileged access model, break-glass intent)
- Key policy snippets (Azure Policy initiatives, conditional access intent, baseline standards)
- Representative IaC modules (Terraform/Bicep), plus a “known exceptions” list
Then I asked it to do three passes: (1) quick triage, (2) deep dive by domain (identity, network, logging, data), and (3) fix-first prioritisation aligned to likely impact.
Every vulnerability it found (grouped into what matters)
I’m sharing these as categories because the exact details are environment-specific. But the patterns are painfully common in Australia and internationally, across organisations of all sizes.
1) “Accidental public” exposure paths
Claude repeatedly flagged situations where something wasn’t “public” by intention, but could become reachable due to a chain of small decisions.
- Over-broad inbound rules justified as “temporary” and never revisited.
- Management ports exposed (SSH/RDP/WinRM) with IP allowlists that were too wide or poorly governed.
- Legacy load balancer rules still present after app migrations.
- DNS and routing ambiguity where a private endpoint existed, but clients still resolved to public in some cases.
In my experience, the worst incidents aren’t caused by one catastrophic config. They’re caused by three “small” configs that align into an exposure path no one modelled end-to-end.
2) Identity and privilege drift
This is where AI assistance shines, because privilege problems are rarely one setting. They’re a story spread across role assignments, group nesting, conditional access, service principals, and exceptions.
- Privileged roles assigned to groups with unclear membership governance.
- Service principals with broad permissions that made sense during build, but not after stabilisation.
- Break-glass accounts existing, but not tested, monitored, or protected the way the policy said.
- MFA inconsistencies between interactive users, admins, and automation identities.
For Australian organisations, this maps cleanly to Essential Eight thinking. Not because Essential Eight is “the answer”, but because it forces you to treat identity and admin control as first-class security work, not an IT footnote.
3) Logging that looks enabled but isn’t usable
Claude caught something I’ve seen a lot in Azure and Microsoft 365 environments: logs technically exist, but they’re not operationally useful.
- Diagnostic settings not applied consistently across resource types.
- Retention too short for real investigations, especially when you factor in detection latency.
- High-value logs missing (identity signals, admin actions, key vault access patterns).
- No “what good looks like” baseline, so the SOC (or whoever is on call) drowns in noise.
The lesson isn’t “turn on all logs.” It’s “decide what you would do at 2am if something goes wrong, then ensure you have the evidence to do it.”
4) Encryption and secrets handling gaps
AI was particularly good at spotting where secret-handling intentions didn’t match the artefacts.
- Secrets referenced in pipeline variables where a managed identity pattern would reduce exposure.
- Key rotation assumptions stated in docs, but no automation or runbook evidence.
- Inconsistent TLS expectations between internal services, especially when “internal” spans multiple networks.
What I liked here was the phrasing of risk: not “you’re doomed,” but “here’s the most likely operational failure mode.” That’s the kind of language executives actually understand.
5) Policy gaps and exception sprawl
Claude didn’t just flag missing policies. It flagged the exception story.
- Policies defined but not assigned at the right scope.
- Initiatives with too many exclusions to be meaningful as a baseline.
- Non-compliance accepted silently because nobody owned the remediation backlog.
In practice, “policy” becomes culture. If exceptions don’t expire, you don’t have a baseline. You have a set of suggestions.
An anonymised scenario from the audit
One pattern Claude highlighted was a classic “it’s fine, it’s internal” situation.
An internal API was reachable only from a private subnet. That sounded safe. But the subnet also hosted a jump box used for admin tasks, and the jump box had a wide inbound allowlist due to contractor access during a project phase.
Claude’s point wasn’t that jump boxes are evil. It was that I’d unintentionally created an access concentrator with two different trust models applied to the same place.
The fix wasn’t dramatic. It was three small steps: tighten inbound, move admin access to a more controlled pattern, and apply explicit segmentation between admin tooling and workloads.
How I validated the findings without getting tricked by AI confidence
This is the part people skip, and it’s where most AI-assisted audits go off the rails.
- Force the model to state assumptions. If it can’t, the finding is probably guesswork.
- Ask for the exact evidence line. “Which rule, which role, which setting?”
- Reproduce with native tools. Azure Resource Graph queries, portal checks, IaC plan output, M365 admin centre evidence.
- Convert to a control statement. “We will require X for Y,” not “Claude said this is bad.”
When something looked questionable, I asked Claude to propose a minimal-change remediation. Then I did the same myself. If the two approaches converged, it was usually a good sign.
A practical prompt structure you can reuse
I kept the prompts boring on purpose. Boring prompts are easier to rerun and compare.
You are acting as a senior cloud security architect.
Context:
- Environment: Azure + Microsoft 365
- Constraints: production, minimal disruption, Australian compliance context
- Artefacts: (paste redacted configs / summaries)
Task:
1) Identify potential vulnerabilities and misconfigurations.
2) For each, provide: Evidence, Risk, Likelihood, Impact, Suggested Fix, Validation Steps.
3) Call out any assumptions.
4) Prioritise top 10 fixes for risk reduction in 30 days.
Tone: practical and concise.
What I’d do differently next time
If I repeat this, I’ll spend more time on two things.
- Better inventory first. AI is only as good as the artefacts you feed it. Unknown resources become unknown risks.
- Attach owners to findings immediately. Not “security team,” but an actual system owner with a date.
I’d also explicitly map findings to Essential Eight maturity conversations. Not as a checkbox exercise, but as a shared language for prioritisation across IT and security.
The takeaway
Claude Opus 4.6 didn’t magically secure my infrastructure. What it did was compress the “reading and reasoning” part of the audit so I could spend my time where humans still win: validating reality, negotiating trade-offs, and making changes safely.
My working view is that AI-assisted audits will become normal in enterprise IT, the same way infrastructure-as-code and policy-as-code did. The winners won’t be the teams with the fanciest model. They’ll be the teams with the best artefacts, the clearest control intent, and the discipline to validate.
If you tried this in your own environment, what would you want the model to optimise for: fewer false positives, deeper architectural reasoning, or faster “fix-first” recommendations?