AI News

Epoch & METR Unveil 19-Day MirrorCode AI Test – How It Works

MirrorCode-The-New-AI-Benchmark-that-Tests-whether-Models-can-Recreate-Entire-Software-Programs

Artificial intelligence keeps getting better at writing code but mostly coding benchmarks focus on pretty small jobs like writing a single function, fixing a bug, or finishing off a code snippet. Those tests do a decent job at comparing different AI models, but they do not really tell us if an AI system can actually handle the bigger, long term task of putting together an entire software application from scratch.

That is why Epoch and METR came up with MirrorCode, which is a new benchmark built to see how well AI performs on big, complex software engineering projects. MirrorCode is not just interested in tweaking an existing codebase. Instead, it pushes models to recreate a full command line program without ever seeing the original source code. It is meant to check a wider range of engineering skills; the model has to figure out how the program works, guess its behaviour, and build something that works the same way, all with limited information.

How Does MirrorCode Work?

MirrorCode makes things both simple and tough. The AI models do not get the source code. All they get is the software’s documentation and ability to run the original program’s executable. This allows the model to run the software and observe its outputs but prevents it from inspecting how the program was originally written.

With only the documentation to explain what the program should do, and execute only access for experimenting and observing, the model has to rebuild the application step by step, without any shortcuts. Once the AI finishes its version, it goes through a whole set of public and hidden tests. These check if the new program actually acts like the original one across a wide range of scenarios.

Epoch AI points out that this is a pretty big shift from usual coding benchmarks. The model is not modifying existing code or solving isolated programming problems. Instead, it independently recreates an entire application based on only documentation and observations. The benchmark is designed to measure long horizon software engineering ability rather than short coding tasks.

Key Takeaways from the Early Results

So, does it actually work? Epoch AI shared some first results, and it turned out frontier AI models can tackle these bigger challenges.

Take Anthropic’s Claude Opus 4.6, for example. This model managed to recreate gotree, which is a major bioinformatics toolkit with about 16,000 lines of Go code and 40 plus commands. According to Epoch AI, a typical engineer would need anywhere from 2 up to 17 weeks to build that by hand.

The test also showed that when models have more time to think things through, they get better at these demanding projects. Basically, the more time you let the model reason before building its software, the better the end result.

MirrorCode differs from traditional coding evaluations because it requires models to build an entire application rather than modify an existing codebase. The benchmark focuses on measuring whether AI systems can independently recreate software by understanding its observable behaviour instead of relying on access to the original implementation.

Epoch AI says they will put out an open source version of MirrorCode, but keep some parts private. That way, future AI models will still face unseen challenges instead of tests that have been already publicly available.

As AI gets smarter, MirrorCode gives researchers the right benchmark to measure true progress in building complete software, rather than just modifying small snippets or bug fixes.

Devanshi Kashyap
Devanshi is a curious learner who enjoys exploring new ideas and expressing creativity through art.
You may also like
More in:AI News