What it is
This website compiles and organizes evidence on the reasoning capabilities of AI models, particularly o1, compared to previous models across various domains. It highlights both improvements and shortcomings, providing links to sources and detailed findings, while also noting potential selection bias in the evaluations.
Gabriel’s notes
This website compiles available evidence on how o1’s reasoning capabilities compare to previous models. The evidence is organized by domain and includes both improvements and areas without significant progress. Each entry includes links to sources and detailed findings. Note: There is a selection bias in the available evaluations, as researchers focus on tasks where they anticipate improvements and may be less likely to report negative results. The evidence presented here should be interpreted with this limitation in mind.
Good fit if you want to:
- Use this when you want a practical starting point for exploring the topic.
Pricing snapshot (auto-enriched): No free tier available; usage-based pricing with input tokens priced at $15 per million and output tokens at $60 per million; pricing applies per token usage with no mention of per seat pricing or hidden limits.
Work-use / compliance snapshot (auto-enriched): The OpenAI o1 model and related business products are suitable for workplace use, offering strong data handling and encryption practices, no default training on customer data, configurable data retention, SSO (SAML) and SCIM support, and compliance with SOC 2 Type 2, HIPAA (via BAA), GDPR, and other major privacy standards.
Alternatives (auto-enriched): Alternative: GPT-4o | Comparison: GPT-4o is a predecessor model with strong general capabilities but shows less improvement in reasoning tasks compared to o1. Alternative: Claude 3.5 Sonnet | Comparison: Claude 3.5 Sonnet outperforms o1-mini in computational reproducibility but lacks the reasoning improvements seen in o1.
Note: pricing and policy details can change—verify on the official site before making decisions.