Back to Blog
AIAgent ReadinessAGENTS.mdEngineering

Is My Code Repo AI Ready?

AGENTS.md, CLAUDE.md, skills, workflows, OpenSpec — the conventions stack keeps growing, but how do you actually know your repo is legible to coding agents? A look at the agent-readiness tooling, what the ETH Zurich research really says, and the minimum path to a sensible maturity level for a small team.

Alejandro TamayoMay 9, 202610 min read

You ask Claude to add a feature. The output looks right. Clean code, correct imports, confident comments. Then you read it carefully and realize the agent invented an internal convention that does not exist, ignored the one that does, and named the function in a style you banned six months ago in a PR nobody linked to the AI. The agent was not wrong. It was working from what it could see.

That is the honest question behind "is my repo AI ready?": not whether you have installed the latest model, but whether an agent landing in your codebase for the first time can actually understand how things work. AGENTS.md, CLAUDE.md, DESIGN.md, skills, workflows, postmortems, OpenSpec — the conventions stack keeps growing, and growing, and it is nearly impossible to know from the inside whether the signal you are emitting is the signal the agent is receiving.

How AI reads a repository

When you ask an agent to do something, it does not load your entire repo into its head. It cannot. Anthropic's Effective context engineering for AI agents puts the constraint plainly: a model has a finite attention budget, and "every new token introduced depletes this budget by some amount." A 200K-token context window is not a guarantee that the model is paying equal attention to every token in it. It is paying progressively less attention as the budget fills with things that are not load-bearing for the current task.

So agents work the way a careful new contractor works: they look for an index, then follow it. The first place they look is the agreed-upon entry file. AGENTS.md is now an open standard — a single plain-markdown file supported by OpenAI Codex, Cursor, Amp, Google Jules, and Factory. GitHub Copilot took a parallel route with .github/copilot-instructions.md. The interface is simple: one file, plain text, tool-agnostic. The agent reads it before it reads your code, and uses it to decide what else to load.

This is also where most teams get the design wrong. They turn AGENTS.md into a 500-line spec document — testing rules, deployment notes, architecture overview, naming conventions, security policy, glossary, all in one file. The agent dutifully reads it. By the time it reaches line 300, the rule on line 50 is fading. Anthropic's guidance is direct: find "the smallest possible set of high-signal tokens" the model needs to do the work.

The trap is mistaking file existence for readiness. ETH Zurich's AGENTbench study ran 138 Python tasks across 12 real repositories and found that LLM-generated context files made things worse in 5 of 8 settings. They added 2.45–3.92 extra reasoning steps per task and pushed inference costs up 20–23%. The culprit was not the idea of a context file. It was the content: verbose, auto-generated, filled with things the agent could have inferred on its own. A healthcheck that only asks "does the file exist?" is testing the wrong thing. The question is whether the file contains signal.

AGENTS.md as an index, not a wall

The pattern that actually works mirrors how the model wants to operate: just-in-time loading. Anthropic describes the technique directly — agents "maintain lightweight identifiers... and use these references to dynamically load data into context at runtime." Translated to a code repo: keep AGENTS.md as a small index, and let it point to topic files the agent pulls in only when the task actually needs them.

AGENTS.mdIndex · ~50 lines
  • Build / test / lint commands
  • Rules that apply to every change
  • Links to the topic docs below
docs/testing.md
Top 200 lines — high signal
Rest of file — loaded only if relevant
docs/api.md
Top 200 lines — high signal
Rest of file — loaded only if relevant
docs/architecture.md
Top 200 lines — high signal
Rest of file — loaded only if relevant
The agent loads the index first; topic files are pulled in only when the task touches them.

A concrete shape that scales to real codebases:

  • AGENTS.md — 50 to 100 lines. Project name, install / test / lint commands, the two or three rules that apply to every change, and a list of links to the topic docs below. Nothing else.
  • docs/testing.md — How tests are organized in this repo, what to mock, what the CI runner expects. Loaded when the agent edits or adds tests.
  • docs/api.md — Endpoint conventions, error shapes, auth patterns. Loaded when the change touches API code.
  • docs/architecture.md — Module boundaries and dependency rules. Loaded for refactors.

Two practical rules make this pattern work. First, the most important content goes in the top of every file the agent will read — roughly the first 200 lines. That is where the model's attention is sharpest, and it is also where Claude Code's CLAUDE.md line limit applies. Anything past that limit is invisible. Second, the index links to specific files, not to generic folders — "see docs/testing.md before changing tests" is a signal the agent can act on; "documentation lives in docs/" is not.

When the agent decides it is editing tests, it follows the link and loads docs/testing.md. When it is shipping an API change, it loads docs/api.md. The 200 lines that matter are in context. The 1,800 that do not are left on disk. The attention budget stays focused on the work.

One Claude-specific footnote: Claude Code looks for CLAUDE.md specifically, not AGENTS.md, so the file still has to exist in repos where Claude is part of the workflow. The pragmatic fix is a symlink — ln -s AGENTS.md CLAUDE.md — so Claude and every other tool read the same single source of truth instead of two files that quietly drift apart. Treat it as a tooling quirk, not a reason to maintain two indexes.

Why more is worse — context rot

The ETH Zurich finding has a structural explanation. Chroma's Context Rot research tested 18 frontier models and found that accuracy degrades by more than 30% for content sitting in the middle of a long context — even within a 200K token window. A bloated AGENTS.md buries the rules that matter under the rules that were easy to write. The agent reads the file, surfaces a confident answer, and the relevant instruction was on page three of a document that should have been half a page.

Anthropic's own Claude Code best practices put it directly: concise, focused context outperforms exhaustive context. HumanLayer's guide on writing a good CLAUDE.md reaches the same conclusion from field experience. The failure mode is not a missing file. The failure mode is a long one.

The question is not "do I have an AGENTS.md?" The question is "would a sharp contractor, landing cold, act differently after reading it?"

How to score your repo — automated readiness tools

Factory.ai introduced Agent Readiness as a framework for scoring repos against five maturity levels across nine pillars: style, build, testing, documentation, dev environment, code quality, observability, security, and AI tooling. Microsoft followed with microsoft/agentrc, a CLI and VS Code extension with the same five-level, nine-pillar shape. kodustech/agent-readiness is the open-source alternative that does the same scan without the commercial layer.

L1
L2
Target
L3
L4
L5
Discoverable

Agent can navigate the file structure but has no context on conventions, commands, or constraints.

Structured

Consistent style, linter and formatter configured, basic README present.

Guided

Terse AGENTS.md, working tests, CI green, reproducible dev environment. Most teams stop here.

Observable

Structured logging, metrics, and feedback loops — agent output is verifiable against real signals.

Autonomous

Full coverage, security hardened, agent operates end-to-end without hand-holding.

Agent readiness maturity levels — L3 is the common target for most small teams

The logic behind all of them is identical: scan the repo, score each pillar, produce a maturity level. L1 is a repo the agent can barely navigate. L3 is the common target — a repo where an agent can be productive without hand-holding. L5 is full autonomous operation with observability and security hardened. Most small teams are between L1 and L2 and do not know it.

These tools are useful as checklists. They are not a substitute for actually reading the file an agent consumes. A repo can pass every automated check and still have an AGENTS.md that contradicts the codebase because nobody updated it after the last refactor. The score is a signal, not a guarantee.

One practical constraint worth knowing: Claude Code reads CLAUDE.md up to a line limit. If your file is long enough to hit that limit, the instructions at the bottom are invisible. Context rot applies to your own config files, not just your application context.

How to obtain L3

For a healthy codebase, L3 is a few weeks of focused work. For a legacy repo with weak tests and no docs, it is months — not because the list is long, but because each item compounds with the others. The path is short to describe and honest to walk:

  • A linter and formatter that runs automatically. Not configured-but-ignored. Running in CI, blocking merge on failure. This alone removes an entire category of noise from agent output.
  • Type checking or static analysis. TypeScript strict, mypy, equivalent. Lint catches style; types catch a category of bugs the agent will otherwise ship with confidence.
  • A test suite with meaningful coverage. Not 100%. Meaningful. Tests that exercise the happy path and the obvious edge cases give an agent a signal it can read: did the change break something real?
  • Reproducible installs. A lockfile and a pinned runtime version. "Works on my machine" undoes every other guarantee.
  • A reproducible dev environment. A devcontainer.json, a Makefile, a working docker-compose — pick one. The agent needs to be able to run the code to verify its output.
  • A README that explains how to run things. One page. Clone, install, run. If a new contributor cannot be productive in an hour, the agent cannot either.
  • A .env.example or equivalent. The agent needs to know what env vars exist without seeing your secrets. Missing this is one of the most common silent failure modes.
  • Commit and PR conventions written down. Title format, squash policy, who reviews. Agents that open PRs need this or they ship code in a shape humans bounce back.
  • An AGENTS.md that contains only non-inferable information. Custom build commands, in-house naming conventions, constraints the agent cannot derive from the code. If the agent could figure it out by reading the codebase, it does not belong in the file.
  • A "what not to do" list inside AGENTS.md. "Do not call the legacy auth helper." "This folder is being deleted, do not extend it." "Tests in /integration need Docker running." Positive instructions miss the traps; the gotcha list is often the highest-signal section in the file.
  • A CI signal the agent can read. Pass or fail, not "the build passed but there are 47 warnings you should probably look at."

The ETH Zurich and Chroma findings run through every one of these items: terse, human-curated, signal-rich beats verbose and comprehensive every time.

IBERANT recommendations

These are the things worth doing before you stress about maturity levels.

Write one terse AGENTS.md. One page maximum. If you cannot fit the non-obvious rules on one page, the problem is not the page limit. It is that you have too many non-obvious rules.

Only write what the agent cannot infer from the code. Your naming convention is in every file. Your CI pipeline is in .github/workflows. Write the things that are not already visible.

Treat context files as code. They go through pull request review. Dead lines get deleted. A stale instruction is actively harmful — the agent follows it and produces something wrong that looks right.

Make the commands obvious. AGENTS.md should tell the agent how to install, test, and lint in three lines. Not a tutorial. Three lines.

Pick L3 as your target, not L5. L5 is a reasonable goal for a mature platform team with dedicated tooling. For a team of three to ten, L3 is where you get real productivity gains without the overhead of full autonomous operation.

Re-read AGENTS.md every quarter. Not every week. Every quarter. After a major refactor, immediately. Stale instructions are worse than missing ones because the agent reads them with confidence.

Measure on real PRs, not on file existence. Does the agent's first attempt need fewer corrections this month than last month? That is the only metric that matters.

Conclusion

An up-to-date, terse AGENTS.md is worth more than a perfect score on any readiness checker. The real test behind "is my repo AI ready?" is older than any of these tools: can a new contributor — human or agent — be productive within a day? If the answer is no, the problem is not which model you are using.

Start with one page. Keep it honest. Delete what rots. The repo that is hardest for a new engineer to read is the one that will confuse every agent you throw at it.

If you want to make your codebase genuinely legible to AI agents — and to the humans working alongside them — let us talk.