AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

Despite impressive progress in artificial intelligence, the field is still missing a definitive way to measure whether a machine truly thinks like a human. From prize challenges to philosophical debates, researchers are wrestling with the limits of current benchmarks—and the daunting prospect of testing superintelligence.

The challenge of designing the ultimate AI test

Two major San Francisco AI players—Scale AI and the Center for AI Safety (CAIS)—recently launched “Humanity’s Last Exam,” a global call to devise questions capable of truly testing large language models (LLMs) such as Google Gemini and OpenAI’s o1. Offering $5,000 prizes for the top 50 questions, the challenge aims to probe how close we are to “expert-level AI systems.”

Manikin with an intelligence cloud obscuring their head

The problem is that today’s LLMs excel at many existing benchmarks in intelligence, mathematics, and law—sometimes to a degree that raises suspicion. These models are trained on colossal amounts of data, possibly including the very questions used to assess them. As the AI analytics group Epoch predicts, by 2028 AIs may have effectively “read” everything humans have ever written, leaving almost no established test safe from contamination. That looming moment forces researchers to rethink what testing should look like once AIs know all the answers.

Complicating matters further, the internet’s rapid growth is now intertwined with “model collapse,” a phenomenon where AI-generated content recirculates in training data, degrading model quality. Some companies are countering this by collecting fresh data from human interactions or real-world devices—from Tesla’s sensor-rich vehicles to Meta’s smart glasses—to ensure diversity in both training and evaluation datasets.

Why current intelligence benchmarks fall short

Magnus Carlsen thinking about a chess move

Defining intelligence has always been contentious. Human IQ tests have been criticized for capturing only narrow cognitive skills, and AI assessments suffer from a similar problem. Many benchmarks—summarizing text, interpreting gestures, recognizing images—test specific tasks but fail to measure adaptability or reasoning across domains.

A striking example is the chess engine Stockfish. It surpasses world champion Magnus Carlsen by a wide Elo margin, yet cannot interpret a sentence or navigate a room. This illustrates why task-specific dominance cannot be equated with general intelligence, and why AI researchers are searching for more holistic measures.

As AI systems now exhibit broader competencies, the challenge is to create tests that capture this expansion. Benchmarks must go beyond scoring isolated abilities to gauge how well a model can generalize knowledge and adapt to new, unseen problems—skills considered the hallmark of human intelligence.

The rise of reasoning-based evaluation

French Google engineer François Chollet’s “abstraction and reasoning corpus” (ARC) is one of the most ambitious attempts to move testing in this direction. Launched in 2019, ARC presents AIs with visual puzzles that require inferring and applying abstract rules with minimal examples—far from the brute-force pattern recognition of traditional image classification.

ARC remains a steep challenge for machines. Humans regularly score above 90%, but leading LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 hover around 21%. Even GPT-4o, which reached 50% through an extensive trial-and-error approach, stayed far from the $600,000 prize threshold of 85%. This gap reassures researchers that despite hype, no AI has yet matched the flexible reasoning humans demonstrate with ease.

However, ARC is not the only experiment in play. The Humanity’s Last Exam project deliberately keeps its winning questions offline to avoid AI models “studying” the answers, a move that reflects how high the stakes are for ensuring a test’s integrity in a world where machines might pre-learn anything published.

What happens when we approach superintelligence?

The urgency to develop robust tests goes beyond academic curiosity. The arrival of human-level AI—let alone superintelligence—would bring profound safety, ethical, and governance challenges. Detecting that moment requires tools capable of identifying not just task mastery, but deep reasoning, adaptability, and value alignment with human norms.

Yet, as history shows, each time AI reaches parity in a narrow domain, the goalposts shift. What once counted as “intelligent” becomes routine automation. This means that measuring superintelligence will likely demand frameworks that we cannot fully conceive yet—ones that might integrate philosophical inquiry, social sciences, and long-term behavioral observation.

For now, the race is on to design the next generation of tests before AI’s knowledge eclipses our ability to measure it. And when machines finally pass “Humanity’s Last Exam,” the next question might be even more daunting: how do we examine something smarter than ourselves?

AI’s intelligence test dilemma: why we still don’t know how to measure human-level thinking in machines

The Cost of Entry: A New Financial and Logistical Barrier to US Travel

The Global Shift: Lonely Planet’s Best in Travel 2026 Crowns the Unfamiliar and the Experiential

The New Era of Access: How US National Parks Are Trading Spontaneity for Sustainability

Steel Horses and Skylight Views: A Definitive Guide to North America’s Greatest Rail Journeys

The Sun-Drenched Oasis: Unearthing the Epicurean Secrets of Colorado’s Grand Valley

More Articles Like This