A new, challenging AGI test stumps most AI models

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general intelligence of leading AI models.

So far, the new test, called ARC-AGI-2, has stumped most models.

“Reasoning” AI models like OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%.

The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before.

The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people got 60% of the test’s questions right — much better than any of the models’ scores.

a sample question from Arc-AGI-2 (credit: Arc Prize).

In a post on X, Chollet claimed ARC-AGI-2 is a better measure of an AI model’s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation’s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.

Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on “brute force” — extensive computing power — to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1.

To address the first test’s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.

“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, ‘Can AI acquire [the] skill to solve a task?’ but also, ‘At what efficiency or cost?’”

ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3’s performance gains on ARC-AGI-1 came with a hefty price tag.

The version of OpenAI’s o3 model — o3 (low) — that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.

Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).

The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity.

Alongside the new benchmark, the Arc Prize Foundation announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.

Original Source Link

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

What's Hot

Jamie Raskin Sues Trump For Blocking Access To Federal Immigration Facilities

NFL Hall of Fame Game: What to know about Lions-Chargers matchup

8 Best Sexy Gifts for Lovers (2025)

A new, challenging AGI test stumps most AI models

8 Best Sexy Gifts for Lovers (2025)

SpaceX faces two new lawsuits alleging safety‐related retaliation

The Hyperflexible People Who May Help Unlock Better Sleep Apnea Treatments

Luma and Runway expect robotics to eventually be a big revenue driver for them

The Real Demon Inside ChatGPT

Flexport sells former freight unicorn Convoy’s tech 2 years after buying it

Jamie Raskin Sues Trump For Blocking Access To Federal Immigration Facilities

NFL Hall of Fame Game: What to know about Lions-Chargers matchup

8 Best Sexy Gifts for Lovers (2025)

Scientists Say New Government Climate Report Twists Their Work

10 TV Shows That Made Netflix What It Is Today

Jensen Ackles Confirms Season 3 Return as Russell — See Him on Set

17 Easy One-Pot Dinner Recipes for Low-Effort Weeknight Meals

Monkeypox: Could it become a pandemic? Here’s everything you need to know

How Deep Throat and Pleasure ask questions about safety on porn sets

American Kristen Faulkner authors stunning gold medal victory in women’s road race at Paris Olympics

Fox News Admits Durham’s Investigation Is Imploding After Acquittal of Key FBI Defendant

Our Picks

Jensen Ackles Confirms Season 3 Return as Russell — See Him on Set

17 Easy One-Pot Dinner Recipes for Low-Effort Weeknight Meals

The 2025 Lambda Literary Awards Finalists

Subscribe to Updates

What's Hot

A new, challenging AGI test stumps most AI models

RELATED POSTS