What it is

A research nonprofit that publishes resources and protocols for evaluating frontier AI systems’ autonomous/agentic capabilities and associated catastrophic-risk concerns.

Gabriel’s notes

Quick take: METR is one of the more “serious grown-up” corners of the AI internet: less demo theater, more measurement. If you care about frontier-model autonomy and risk-relevant capabilities, this is a good bookmark.

METR (pronounced “meter”) is a research nonprofit focused on scientifically measuring whether and when AI systems might pose catastrophic harm to society. Their public-facing materials include evaluation writeups and a dedicated set of autonomy-evaluation resources: a task suite, software tooling, and an example evaluation protocol intended to help evaluators measure potentially dangerous autonomous capabilities of frontier models. They explicitly frame these materials as an early/beta draft (v0.1) they plan to iterate on with versioning.

METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems. For me, it’s a handy AI eval site to keep around for the moments when you’re trying to decide which models to use for which tasks—and you want something more rigorous than vibes, screenshots, and a thread from a guy with a wolf avatar.

I saved this under Policy & safety because evaluation is where “we should be careful” becomes something you can actually test, document, and argue about like adults.

Good fit if you want to:

Track how “agentic” or long-horizon capability seems to be changing in frontier models over time.
Borrow an evaluation protocol (and the thinking behind it) instead of inventing an eval process from scratch.
Run or adapt tasks that look more like real work (software eng / ML eng / cyber / research) rather than quiz-bowl benchmarks.
Sanity-check model selection for complex workflows where tool use, persistence, and error recovery matter.
Get a clearer picture of what an eval does not measure (which is usually the part people skip).

Pricing snapshot (auto-enriched)

Unknown / not confirmed for any paid engagements. The autonomy-evaluation resources and protocol are published openly on their site; access to the “full suite” of tasks appears to require contacting METR (details are provided on the resources site).

Work-use / compliance snapshot (auto-enriched)

METR states they sometimes work with AI developers, governments, and research orgs that provide nonpublic model access and proprietary information, and they describe internal confidentiality training and access controls in their evaluation platform. If you’re adapting their materials internally, treat your eval artifacts like security-sensitive data: transcripts, tool outputs, and prompts can leak proprietary context surprisingly fast. (If you need formal assurances for vendor risk, you’ll want to do your own due diligence—this bookmark alone won’t magically satisfy procurement.)

Alternatives (auto-enriched)

OpenAI Evals (open-source): Great if you want a general-purpose eval framework + benchmark registry; less opinionated about “dangerous autonomy” protocols than METR’s autonomy-focused materials.
EleutherAI LM Evaluation Harness: Strong for standardized academic-style task evaluation across many backends; less tailored to long-horizon, tool-using agent workflows than METR’s autonomy resources.

Before you adopt it:

Decide what you’re measuring: capability, reliability, safety-relevant behaviors, or “my product feels better.” Don’t mix them and call it science.
Watch for eval contamination and overfitting—especially if you publish tasks or reuse them repeatedly.
Write down your thresholds and decision rules before you run the eval. Otherwise you’ll discover the ancient art of motivated reasoning.

Sources

https://metr.org/
https://evaluations.metr.org/
https://metr.org/blog/2026-02-17-how-we-protect-confidential-information/
https://github.com/openai/evals
https://github.com/EleutherAI/lm-evaluation-harness

Visit the resource

METR (Model Evaluation & Threat Research)