AI specialists prepare 'Humanity's Last Exam' to challenge advanced technology

On Monday, a group of technology experts made a global appeal for the most challenging questions to ask artificial intelligence systems, which have been effortlessly tackling widely recognized benchmark tests.

Sep 17, 2024 - 21:00

AI specialists prepare 'Humanity's Last Exam' to challenge advanced technology

A group of technology specialists issued a global appeal on Monday for the most challenging questions to ask artificial intelligence systems, which have effortlessly tackled popular benchmark tests recently.

Named "Humanity's Last Exam," this initiative aims to identify when AI reaches an expert level. It is designed to remain relevant as capabilities evolve in the coming years, according to the organizers—non-profit Center for AI Safety and startup Scale AI.

This announcement follows closely the unveiling of a new model, OpenAI o1, which "destroyed the most popular reasoning benchmarks," according to Dan Hendrycks, executive director of CAIS and advisor to Elon Musk's xAI.

Hendrycks, who co-authored two 2021 papers that introduced tests for AI systems now in widespread use, highlighted one that assesses undergraduate-level knowledge in subjects like U.S. history and another that evaluates reasoning skills in competition-level math. The undergraduate-style test is the most downloaded dataset on the online AI platform Hugging Face.

When those papers were published, AI systems often provided near-random answers. "They're now crushed," Hendrycks remarked.

For instance, the Claude models from Anthropic improved from scoring around 77 percent on the undergraduate-level test in 2023 to nearly 89 percent a year later, as noted by a leading capabilities leaderboard.

Consequently, the significance of these common benchmarks has diminished.

Meanwhile, AI has struggled with less commonly used tests related to planning and visual pattern-recognition, as highlighted in Stanford University's AI Index Report from April. For example, OpenAI o1 scored approximately 21 percent on a version of the pattern-recognition ARC-AGI test, as reported by the ARC organizers on Friday.

Some AI researchers propose that such outcomes indicate that planning and abstract reasoning serve as better indicators of intelligence. However, Hendrycks stated that the visual component of the ARC test makes it a poor fit for language models. He affirmed that "Humanity's Last Exam" will necessitate abstract reasoning.

Industry observers have suggested that answers from popular benchmarks may have unintentionally been included in the training data for AI systems. Consequently, Hendrycks mentioned that some questions for "Humanity's Last Exam" will be kept confidential to prevent AI from relying on memorized responses.

The exam is set to feature at least 1,000 crowd-sourced questions due by November 1, focusing on topics that are difficult for non-experts. These questions will undergo a peer-review process, and winning submissions can earn co-authorship along with prizes up to $5,000, sponsored by Scale AI.

"We desperately need harder tests for expert-level models to measure the rapid progress of AI," stated Alexandr Wang, CEO of Scale AI.

One guideline prohibits questions related to weapons, which some experts believe could pose significant risks if studied by AI.

Rohan Mehta contributed to this report for TROIB News

Discover more Science and Technology news updates in TROIB Sci-Tech