Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of “out of domain” model performance. But the researchers say that this kind of “patch” for various logical tasks “should not be mistaken for achieving true generalization. … Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.”

Rather than showing the capability for generalized logical inference, these chain-of-thought models are “a sophisticated form of structured pattern matching” that “degrades significantly” when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate “fluent nonsense” creates “a false aura of dependability” that does not stand up to a careful audit.

As such, the researchers warn heavily against “equating [chain-of-thought]-style output with human thinking” especially in “high-stakes domains like medicine, finance, or legal analysis.” Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond “surface-level pattern recognition to exhibit deeper inferential competence,” they write.

  • Lvxferre [he/him]@mander.xyz
    link
    fedilink
    arrow-up
    24
    ·
    2 days ago

    You don’t say.

    Imagine for a moment you had a machine that allows you to throw bricks at a certain distance. This shit is useful, specially if you’re a griefer; but even if you aren’t, there are some corner cases for that, like transporting construction material at a distance.

    And yet whoever sold you the machine calls it a “house auto-builder”. He tells you that it can help you to build your house. Mmmh.

    Can house construction be partially automated? Certainly. Perhaps even fully. But not through a brick-throwing machine.

    Of course trying to use the machine for its advertised purpose will go poorly, even if you only delegate brick placement to it (and still build the foundation, add cement etc. manually). You might economise a bit of time when the machine happens to throw a brick in the right place, but you’ll waste a lot of time cleaning broken bricks, or replacing them. But it’s still being sold as a house auto-builder.

    But the seller is really, really, really invested on this auto-construction babble. Because his investors gave him money to create auto-construction tools. And he keeps babbling on how “soon” we’re going to get fully auto house building, and how it’s an existential threat to builders and all that babble. So he tweaks the machines to include “simulated building”. All it does is to tweak the force and aim of the machine, so it’s slightly less worse at throwing bricks.

    It still does not solve the main problem: you don’t build a house by throwing bricks. You need to place them. But you still have some suckers saying “haha, but it’s a building machine lmao, can you prove it doesn’t build? lol”.

    That’s all what “reasoning” LLMs are about.

    • massive_bereavement@fedia.io
      link
      fedilink
      arrow-up
      9
      arrow-down
      3
      ·
      2 days ago

      You don’t get it.

      In the past, the brick throwing machine was always failing its target and nowadays it is almost always hitting near its target. It depends on how good you are asking the machine to throw bricks (you need to assume some will miss and correct accordingly).

      Eventually, brick throwing machines will get so good that they will rely on gravitational forces to place the bricks perfectly and auto-build houses.

      Plus you can vibe build: let it throw some random bricks and start building around. You will be surprised of what it can achieve.

      #building-is-dead #autobrick-engineer

      • Lvxferre [he/him]@mander.xyz
        link
        fedilink
        arrow-up
        6
        ·
        edit-2
        2 days ago

        You don’t get it.

        I do get it. And that’s why I’m disdainful towards all this “simulated reasoning” babble.

        In the past, the brick throwing machine was always failing its target and nowadays it is almost always hitting near its target.

        Emphasis mine: that “near” is a sleight of hand.

        It doesn’t really matter if it’s hitting “near” or “far”; in both cases someone will need to stop the brick-throwing machine, get into the construction site (as if building a house manually), place the brick in the correct location (as if building a house manually), and then redo operations as usual.

        In other words, “hitting near the target” = “failure to hit the target”.

        And it’s obvious why it’s wrong; the idea that an auto-builder should throw bricks is silly. It should detect where the brick should be placed, and lay it down gently.

        The same thing applies to those large token* models; they won’t reach anywhere close to reasoning, just like a brick-throwing machine won’t reach anywhere close to an automatic house builder.

        *I’m calling it “large token model” instead of “large language model” to highlight another thing: those models don’t even model language fully, except in the brain of functionally illiterate tech bros who think language is just a bunch of words. Semantics and pragmatics are core parts of a language; you don’t have language if utterances don’t have meaning or purpose. The nearest of that LLMs do is to plop some mislabelled “semantic supplement” - because it’s a great red herring (if you mislabel something, you’re bound to get suckers confusing it with the real thing, and saying “I dun unrurrstand, they have semantics! Y u say they don’t? I is so confusion… lol lmao”).

        It depends on how good you are asking the machine to throw bricks (you need to assume some will miss and correct accordingly).

        If the machine relies on you to be an assumer (i.e. to make shit up, like a muppet), there’s already something wrong with it.

        Eventually, brick throwing machines will get so good that they will rely on gravitational forces to place the bricks perfectly and auto-build houses.

        To be blunt that stinks “wishful thinking” from a distance.

        As I implied in the other comment (“Can house construction be partially automated? Certainly. Perhaps even fully. But not through a brick-throwing machine.”), I don’t think reasoning algorithms are impossible; but it’s clear LLMs are not the way to go.

        • massive_bereavement@fedia.io
          link
          fedilink
          arrow-up
          5
          ·
          1 day ago

          Sorry, I just got carried away in your analogy, like the proverbial brick thrown in to the air by a large machine that is always very precisely almost often sometimes hitting its target.

          • Lvxferre [he/him]@mander.xyz
            link
            fedilink
            arrow-up
            8
            ·
            edit-2
            2 days ago

            If it is not a parody, the user got a serious answer. And if it is, I’m just playing along ;-)

            (If it is a parody, it’s so good that it allows me to actually answer it as if it wasn’t.)

            • Mac@mander.xyz
              link
              fedilink
              English
              arrow-up
              7
              ·
              1 day ago

              It is most definitely satire but that doesnt mean your comments aren’t worth reading.

              • massive_bereavement@fedia.io
                link
                fedilink
                arrow-up
                4
                ·
                1 day ago

                Amd you should see the therapeutic effects of brick throwing and the very promising health applications.

                You would be amazed of what you can achieve with a well thrown brick.

  • teawrecks@sopuli.xyz
    link
    fedilink
    arrow-up
    14
    ·
    2 days ago

    The analogy I use is, it’s like a magician pulled a coin from behind a CEO’s ear, and their response was “that’s incredible! Free money! Let’s go into business together!”

    Literally no one ever claimed it had reasoning capabilities. It is a trick to produce a string of characters that your brain can make sense of. That’s all.

    • anachronist@midwest.social
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 day ago

      Literally no one ever claimed it had reasoning capabilities

      Altman and similar grifters were and are absolutely making those claims but maybe we’re excusing them as obvious liars?

      • TehPers@beehaw.org
        link
        fedilink
        English
        arrow-up
        5
        ·
        1 day ago

        They are obvious liars. Some people are just too invested to see it.

        These models only have reasoning capabilities using the most obscure definitions of “reasoning”. At best, all they’re doing are climbing to local maxima with their so-called “reasoning” on a graph as wavy as the ocean.

        I’ve mentioned this on other posts, but it’s really sad because LLMs have been wildly incredible for certain NLP operations. They are that though, not AGI or whatever snake oil Altman wants to sell this week.

  • panda_abyss@lemmy.ca
    link
    fedilink
    arrow-up
    18
    ·
    edit-2
    2 days ago

    Chain of thought is basically garbage.

    It works with coding agents because they get an automated hard failure.

    The rest of the time it’s just sampling the latent space around a response and should be trimmed out.

    That could work with diffusion models but autoregresive models it’s just polluting the context window with the hopes of finding longer tail tokens.

  • jarfil@beehaw.org
    link
    fedilink
    arrow-up
    11
    arrow-down
    3
    ·
    edit-2
    2 days ago

    chain-of-thought models

    There are no “CoT LLMs”, a CoT means externally iterating an LLM. The strength of CoT, resides in its ability to pull up external resources at each iteration, not in dogfooding the LLM its own outputs.

    “Researchers” didn’t “find out” this now, it was known from day one.

    As for who needs to hear it… well, apparently people unable to tell apart an LLM from an AI.

    • RoadTrain@lemdro.id
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      a CoT means externally iterating an LLM

      Not necessarily. Yes, a chain of thought can be provided externally, for example through user prompting or another source, which can even be another LLM. One of the key observations behind these models commonly referred to as reasoning is that since an external LLM can be used to provide “thoughts”, could an LLM provide those steps itself, without depending on external sources?

      To do this, it generates “thoughts” around the user’s prompt, essentially exploring the space around it and trying different options. These generated steps are added to the context window and are usually much larger that the prompt itself, which is why these models are sometimes referred to as long chain-of-thought models. Some frontends will show a summary of the long CoT, although this is normally not the raw context itself, but rather a version that is summarised and re-formatted.

    • CanadaPlus@lemmy.sdf.org
      link
      fedilink
      arrow-up
      4
      ·
      1 day ago

      Yes, but it supports the jerk that everything called or associated with AI is bad, so it makes a popular Beehaw post.

    • interdimensionalmeme@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      1 day ago

      I think of chain of thought as a self-prompting model
      I suspect in the future, chain-of-thought model will run
      a smaller tuned/dedicated chain-of-thought submodel just for the chain-of-thought tokens

      The point of this is that, most users aren’t very good at
      prompting, they just don’t have the feel for it

      Personally I get worse results, way less what I wanted,
      when CoT is enabled, I’m very annoyed that now
      the “chatgpt classic” model selector just decides to use CoT
      whenever it wants, I should be the one to decide that
      and I want it off almost all of the time !!

      • BlameThePeacock@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        2 days ago

        I’ve met far too many people I wouldn’t trust to give me a reasoned response.

        Some people simply lack that capacity entirely, some just don’t care enough to spend the effort on it, while others are trying to deceive me intentionally.

        • Catoblepas@piefed.blahaj.zoneOP
          link
          fedilink
          English
          arrow-up
          11
          ·
          2 days ago

          LLMs are incapable of reasoning. There is not a consciousness in there deciding and telling you things. My comment was entirely about whether LLMs can reason, not whether all people reason at the same level or might decide to trick you.

          • BlameThePeacock@lemmy.ca
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            1
            ·
            2 days ago

            I don’t disagree with you that LLMs don’t reason. I disagree that all Humans can or do reason.

            • TehPers@beehaw.org
              link
              fedilink
              English
              arrow-up
              5
              ·
              1 day ago

              I disagree that all Humans can or do reason.

              Well if we’re talking about all humans…

              But more seriously, it doesn’t take much looking to find someone who doesn’t reason. Just look on the TV during the next major election and you’ll find a bunch.