Skip to main content

    ProgramBench: Benchmarking LLM Capabilities in Whole-Repo Code Generation

    llm benchmarkingcode generationsoftware engineeringartificial intelligenceprogrambench
    May 5, 2026

    ProgramBench is a rigorous benchmarking framework designed to evaluate the ability of Large Language Models (LLMs) to write complete, functional software from scratch. The initiative, introduced by John Yang, challenges models to design, build, and deploy programs using only an executable as a reference, with no starter code or internet access provided. Core Framework: The benchmark consists of 200 distinct, whole-repository generation tasks that assess an LLM's capacity for end-to-end software development. By requiring models to function without external dependencies, it tests their architectural planning, synthesis, and reasoning capabilities under real-world development constraints. Target Applications: The framework includes complex projects such as SQLite, FFmpeg, and the PHP compiler, forcing models to handle sophisticated software logic. Key Takeaways: This benchmark represents a significant shift from simple snippet generation to full-scale software engineering evaluation. It serves as a comprehensive tool to stress-test how LLMs handle complex, end-to-end coding requirements, helping researchers understand the limitations and progress of AI in software development.

    Share this

    Want AI summaries like this for everything you read?

    Timeln saves articles, videos, and posts — then summarizes, tags, and connects them so you never lose a good find again.

    Save anything

    one click

    AI summaries

    instant

    Connected ideas

    automatic

    Start saving for free

    Free forever · No credit card · 30 seconds to start