ProgramBench: Benchmarking LLM Capabilities in Whole-Repo Code Generation

llm benchmarkingcode generationsoftware engineeringartificial intelligenceprogrambench

May 5, 2026

ProgramBench is a rigorous benchmarking framework designed to evaluate the ability of Large Language Models (LLMs) to write complete, functional software from scratch. The initiative, introduced by John Yang, challenges models to design, build, and deploy programs using only an executable as a reference, with no starter code or internet access provided. Core Framework: The benchmark consists of 200 distinct, whole-repository generation tasks that assess an LLM's capacity for end-to-end software development. By requiring models to function without external dependencies, it tests their architectural planning, synthesis, and reasoning capabilities under real-world development constraints. Target Applications: The framework includes complex projects such as SQLite, FFmpeg, and the PHP compiler, forcing models to handle sophisticated software logic. Key Takeaways: This benchmark represents a significant shift from simple snippet generation to full-scale software engineering evaluation. It serves as a comprehensive tool to stress-test how LLMs handle complex, end-to-end coding requirements, helping researchers understand the limitations and progress of AI in software development.

Want AI summaries like this for everything you read?

Timeln saves articles, videos, and posts — then summarizes, tags, and connects them so you never lose a good find again.

Save anything

one click

AI summaries

instant

Connected ideas

automatic

Start saving for free

Free forever · No credit card · 30 seconds to start