🔴 Research

AI Research Watch — edition 2 (June 2026): agent safety becomes a CVE class

Five May 2026 developments that matter for builders: prompt injection ships as CVE, EU AI Act delays high-risk to 2027, agent SDKs consolidate, SWE-bench Verified loses credibility, Anthropic launches a partners-only model tier.

June 15, 2026 · 12 min · ai-research-watch

May 2026 was a month of wake-up calls — two on security, one regulatory, one tooling, one market. Prompt injection stopped being a paper category and became a CVE class with host-level impact. The European Union pushed Annex III back but pulled transparency forward. The three major vendors closed their agent SDK stories. And the most-cited coding benchmark fell publicly. This edition covers the five items with the usual filter: what changes the practice of the people building.

This edition 2 covers May 2026.

1 · Prompt injection became an RCE class — Microsoft disclosure (May 7)

What changed: Microsoft Security disclosed on May 7 a new vulnerability class in agent frameworks, with two concrete CVEs: CVE-2026-25592 and CVE-2026-26030, both in Semantic Kernel .NET SDK versions older than 1.71.0 (Microsoft Security Blog, May 7 2026). Content retrieved from an external document via RAG flows straight into a tool call and bypasses every guardrail running at the prompt level — the outcome is remote code execution on the agent host. The same quarter OWASP confirmed prompt injection as LLM01 for the third consecutive year, and the community reported equivalent vulns in Copilot Studio (CVE-2026-21520) and ModelScope’s ms-agent (CVE-2026-2256).

Why it matters for builders: until now “prompt injection” was a blog post discussion. In May 2026 it became an audit item your security team will ask for, with a CVE to cite and a patch to apply. Any agent that combines retrieval + tool calling fell into the attack surface. It’s no longer “if”, it’s “what’s your plan”.

Limitation: Microsoft’s patch closes the Semantic Kernel door, but the vulnerability class is architectural — it exists in CrewAI, LangGraph, AutoGen, and any home-grown stack that mixes the two ingredients. A single patch does not solve it.

Action: today, audit your stack. Every tool call after retrieval needs an explicit argument allowlist + sandbox for shell commands + privilege limits on the agent process. If you have an agent in production and you can’t answer those three questions in 5 minutes, pause before your next deploy.

2 · EU AI Act Omnibus pushes Annex III back, pulls transparency forward (May 7)

What changed: on the same May 7, EU legislative bodies reached political agreement on amendments to the AI Act — the “AI Act Omnibus” (Latham & Watkins, May 2026). Two effects for any non-EU company that sells into or operates inside the EU: Annex III (high-risk systems) was postponed from Aug 2, 2026 to Dec 2, 2027 — 16 extra months of runway. But Article 50 (transparency obligations, including the nudifier ban and mandatory marking of synthetic content) was pulled forward to Dec 2, 2026, shortening the runway for anyone generating image, voice or video.

Why it matters for builders: two distinct calendars now. If you’re in high-risk Annex III (HR, scoring, biometrics, hiring), you can breathe until Dec 2027 — but start documenting audit trail now, not in the last month. If you generate synthetic media for any purpose (marketing, education, internal comms), you have 6 months to solve watermarking + mandatory disclosure.

Limitation: the Omnibus still has to clear the European Parliament and Council with a final text. The political timeline can slip. What’s safe: Annex III does not go live in Aug 2026; transparency hardens in Dec 2026.

Action: review your AI Act internal risk classification timeline this month. US/LATAM/APAC companies serving the EU need to map which systems fall under Annex III versus Article 50, because the deadlines diverged. For synthetic content, decide now whether you’ll use C2PA, your own watermark, or both — and who owns that pipeline.

3 · Agent SDKs consolidated — 3 major vendors + interop (Mar–May)

What changed: between March and May 2026 the three major vendors closed their agent SDK stories. OpenAI shipped the Agents SDK in March, Google introduced ADK in April, and Anthropic released its Agent SDK alongside Claude 4.6 (gurusup, May 2026). Plus: MCP became the de facto standard for tool exposure and A2A (Agent-to-Agent) emerged as a standard for multi-agent communication. In open-source, LangGraph passed CrewAI on GitHub stars in Q1 — CrewAI became the “easy for business workflow” reference, LangGraph the “granular execution control with checkpointing and human-in-the-loop” option.

Why it matters for builders: greenfield now has 5 serious paths (3 vendor SDKs + 2 open-source frameworks) and two interop standards (MCP + A2A). The choice you make in the next 8 weeks locks architecture for 18 months — changing later is expensive because each SDK has its own state model, its own tool API and its own session lifecycle.

Limitation: vendor SDKs still have stronger gravity toward the same vendor’s model. OpenAI Agents SDK is optimized for GPT, Google ADK for Gemini, Anthropic Agent SDK for Claude. Multi-model is possible but friction-loaded. MCP/A2A reduce that friction but don’t eliminate it.

Action: if you have more than one model in production, keep CrewAI or LangGraph as the orchestration layer + MCP as the tool transport. If you’re single-vendor by contract (Anthropic Partner, Microsoft Copilot), accept the vendor SDK gravity and gain latency + features. Long-horizon decision worth an afternoon of planning.

4 · SWE-bench Verified lost credibility — first month with no vendor reporting (May)

What changed: on Feb 23, 2026, OpenAI’s Frontier Evals team stopped reporting scores on SWE-bench Verified (OpenAI, Feb 2026). Reason: internal audit of 138 problems that o3 didn’t consistently solve over 64 independent runs showed 59.4% of cases had flaws in the test or the description. Worse: GPT-5.2, Claude Opus 4.5 and Gemini 3 Flash Preview could reproduce the gold patches verbatim with only the task ID as a prompt — a clear sign of training contamination. May 2026 was the first full month in which no major vendor reported Verified — all migrated to SWE-bench Pro (held-out + GPL-licensed to resist contamination).

Why it matters for builders: if you used Verified to pick a coding-agent vendor or to defend a model choice internally, those scores aged badly. The 27-point gap between the top of Verified (81%) and the top of Pro (54%) is the measure of how much signal the benchmark lost. More general: any benchmark that pulls tasks from public open-source repos post-Jun 2024 has high contamination risk for frontier models.

Limitation: SWE-bench Pro is not perfect either — GPL discourages commercial training but does not block it; the held-out set will leak over time. The “new benchmark → contamination → new benchmark” cycle is structural.

Action: stop quoting SWE-bench Verified in pitches or internal decisions. Migrate to SWE-bench Pro or set up an internal eval on private code (your own repo, real problems, gold patches that nobody trained on). For your own evals: rule of thumb is “if the problem exists in a public commit older than 18 months, assume it’s contaminated”.

5 · Anthropic Project Glasswing — a partners-only model tier (May 12)

What changed: Anthropic launched Project Glasswing on May 12 — access to Claude Mythos Preview restricted to 12 launch partners (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + Anthropic) and extended to 40+ additional orgs that maintain critical infrastructure (Anthropic, May 2026). Mythos Preview was used to find thousands of critical zero-days in operating systems and browsers, including a 27-year-old vuln in OpenBSD. Pricing: $25/$125 per million input/output tokens — premium. Anthropic explicitly states no plans for general availability of this model: the stated goal is to “enable safe deployment of Mythos-class models at scale” later.

Why it matters for builders: Glasswing is the first concrete example of a trust-gated model tier, not capability-gated or price-gated. The pattern: capability dangerous enough (industrial-scale vuln hunting) only goes to partners who pass due diligence. If this gets replicated by OpenAI/Google in the next 12–18 months (Anthropic estimates 6–18m for the capability to proliferate), you’ll have a two-tier market of models: GA for everyone + restricted for infrastructure critical. Anyone in the software supply chain will want to be on the defender side before equivalent tooling shows up underground.

Limitation: you can’t test Mythos today. Public evaluations will be indirect — through the disclosures partners make (we already have the OpenBSD 27-year-old vuln; more will come). For a builder not in one of the 52+ orgs, the item is strategic, not tactical.

Action: if your company maintains critical software (OS, browser, runtime, banking core, government), apply to the program. For everyone else: publicly document your stance on responsible vuln disclosure now — in 12–18 months, similar tools will be on the market, and whoever doesn’t have a process becomes a target.

Theme of the month

May 2026 hardened the production agent. Two of the five items are pure security (prompt injection as CVE class, Glasswing as a new defensive structure); one is regulatory (AI Act Omnibus); one is tooling (SDKs); one is eval hygiene (SWE-bench Pro). The common denominator: a builder who was comfortable in April running an “experimental” agent had to rethink posture in May. Expect more of this — not less.

Column cadence

Monthly. Always day 15. Covers the prior month. Curation biased toward “changes the practice of the people building”.

Filters applied to this edition:

Replication or public implementation exists?
Effect is robust, not optimistic demo?
Concrete builder implication?
Industry-shifting, not incremental?

Where to go deeper

The AI Research Watch cluster accumulates the monthly series.
For the 2 security items (prompt injection, Glasswing), the Agent Safety cluster has technical grounding.
For the SDK item, Multi-agent Orchestration Patterns covers the framework tradeoffs.
For the regulatory item, Brazil’s LGPD parallel to GDPR covers a useful benchmark for non-EU compliance posture.

Back to edition 1 for the column history.