Close Menu
Beverly Hills Examiner

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    The Cure’s Perry Bamonte Dies at 65

    January 1, 2026

    Copper records biggest annual gain since 2009 on supply bets

    January 1, 2026

    Trump Takes One Final Big Loss In Court Before The End Of The Year

    January 1, 2026
    Facebook X (Twitter) Instagram
    Beverly Hills Examiner
    • Home
    • US News
    • Politics
    • Business
    • Science
    • Technology
    • Lifestyle
    • Music
    • Television
    • Film
    • Books
    • Contact
      • About
      • Amazon Disclaimer
      • DMCA / Copyrights Disclaimer
      • Terms and Conditions
      • Privacy Policy
    Beverly Hills Examiner
    Home»Technology»Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be
    Technology

    Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

    By AdminOctober 16, 2024
    Facebook Twitter Pinterest LinkedIn WhatsApp Email Reddit Telegram
    Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be


    For a while now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.

    The fragility highlighted in these new results helps support previous research suggesting that LLMs’ use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. “Current LLMs are not capable of genuine logical reasoning,” the researchers hypothesize based on these results. “Instead, they attempt to replicate the reasoning steps observed in their training data.”

    Mix It Up

    In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—currently available as a preprint paper—the six Apple researchers start with GSM8K’s standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs’ complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.

    This approach helps avoid any potential “data contamination” that can result from the static GSM8K questions being fed directly into an AI model’s training data. At the same time, these incidental changes don’t alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.

    Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.

    This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, “the overall reasoning steps needed to solve a question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any “formal” reasoning but are instead “attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.”

    Don’t Get Distracted

    Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using “formal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).

    The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that “five of them [the kiwis] were a bit smaller than average.”

    Adding in these red herrings led to what the researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple “pattern matching” to “convert statements to operations without truly understanding their meaning,” the researchers write.



    Original Source Link

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Email Reddit Telegram
    Previous Article6G phone networks could be 9000 times faster than 5G
    Next Article Aaron Judge snaps postseason slump, Yankees take 2-0 lead over Guardians in ALCS

    RELATED POSTS

    ‘College dropout’ has become the most coveted startup founder credential

    January 1, 2026

    Factor Meal Delivery Promo: Free $200 Withings Body-Scan Scale

    December 31, 2025

    The phone is dead. Long live . . . what exactly?

    December 31, 2025

    Commodore 64 Ultimate Review: An Astonishing Remake

    December 30, 2025

    Meta just bought Manus, an AI startup everyone has been talking about

    December 30, 2025

    iMP Tech Mini Arcade Pro Review: A Nintendo Switch Arcade Cabinet

    December 29, 2025
    latest posts

    The Cure’s Perry Bamonte Dies at 65

    Perry Bamonte, the Cure’s longtime guitarist and keyboardist, has died following an undisclosed illness. He…

    Copper records biggest annual gain since 2009 on supply bets

    January 1, 2026

    Trump Takes One Final Big Loss In Court Before The End Of The Year

    January 1, 2026

    Zohran Mamdani sworn in as NYC mayor in midnight ceremony at Old City Hall

    January 1, 2026

    ‘College dropout’ has become the most coveted startup founder credential

    January 1, 2026

    Poor Sleep Quality Accelerates Brain Aging

    January 1, 2026

    Avengers, Toy Story 5, The Odyssey

    January 1, 2026
    Categories
    • Books (970)
    • Business (5,878)
    • Film (5,812)
    • Lifestyle (3,915)
    • Music (5,880)
    • Politics (5,882)
    • Science (5,224)
    • Technology (5,811)
    • Television (5,497)
    • Uncategorized (2)
    • US News (5,863)
    popular posts

    From South Africa, a success story for democracy

    Back in April 1994, the world watched a remarkable event: South Africa’s first democratic election…

    ‘Dancing With the Stars’ Pro Alan Bersten on the Show’s Move to Disney+

    June 17, 2022

    What the complete ape genome is revealing about the earliest humans

    May 15, 2025

    Kate Bush’s ‘Running Up That Hill’ On Track For Another U.K. No. 1 – Billboard

    June 20, 2022
    Archives
    Browse By Category
    • Books (970)
    • Business (5,878)
    • Film (5,812)
    • Lifestyle (3,915)
    • Music (5,880)
    • Politics (5,882)
    • Science (5,224)
    • Technology (5,811)
    • Television (5,497)
    • Uncategorized (2)
    • US News (5,863)
    About Us

    We are a creativity led international team with a digital soul. Our work is a custom built by the storytellers and strategists with a flair for exploiting the latest advancements in media and technology.

    Most of all, we stand behind our ideas and believe in creativity as the most powerful force in business.

    What makes us Different

    We care. We collaborate. We do great work. And we do it with a smile, because we’re pretty damn excited to do what we do. If you would like details on what else we can do visit out Contact page.

    Our Picks

    Poor Sleep Quality Accelerates Brain Aging

    January 1, 2026

    Avengers, Toy Story 5, The Odyssey

    January 1, 2026

    ‘The Challenge’ Star Reveals Horrific Accident Blinded Him

    January 1, 2026
    © 2026 Beverly Hills Examiner. All rights reserved. All articles, images, product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Terms & Conditions and Privacy Policy.

    Type above and press Enter to search. Press Esc to cancel.

    We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
    Cookie SettingsAccept All
    Manage consent

    Privacy Overview

    This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
    Necessary
    Always Enabled
    Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
    CookieDurationDescription
    cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
    cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
    cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
    cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
    cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
    viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
    Functional
    Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
    Performance
    Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
    Analytics
    Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
    Advertisement
    Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
    Others
    Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
    SAVE & ACCEPT