LLM Coding Agents Buckle Under Structural Constraints in Backend Tasks

A new arXiv study introduces “constraint decay” to describe how LLM coding agents lose accuracy as structural requirements pile up. Across 100 backend generation tasks built against a fixed API contract and spanning eight web frameworks, strong agent configurations dropped roughly 30 points in assertion pass rates between loose baselines and fully specified tasks; weaker setups collapsed toward zero. Existing benchmarks miss this because they reward functionally correct output without checking whether it respects architectural patterns, database layouts, or ORM conventions.

Framework choice matters as much as model capability. Agents handle minimal, explicit stacks like Flask reasonably well but struggle in convention-heavy environments such as FastAPI and Django, where implicit framework rules compound the constraint load. Error analysis pins the bulk of failures on the data layer: malformed queries, broken ORM usage, and runtime violations against the persistence model.

The upshot for teams shipping agent-generated backend code is that passing end-to-end tests is not the same as producing maintainable, idiomatic code that fits a real codebase. Joint satisfaction of functional and structural requirements remains unsolved, and the paper argues evaluation suites need static verifiers alongside behavioral tests to catch the difference.