Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of “out of domain” model performance. But the researchers say that this kind of “patch” for various logical tasks “should not be mistaken for achieving true generalization. … Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.”

Rather than showing the capability for generalized logical inference, these chain-of-thought models are “a sophisticated form of structured pattern matching” that “degrades significantly” when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate “fluent nonsense” creates “a false aura of dependability” that does not stand up to a careful audit.

As such, the researchers warn heavily against “equating [chain-of-thought]-style output with human thinking” especially in “high-stakes domains like medicine, finance, or legal analysis.” Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond “surface-level pattern recognition to exhibit deeper inferential competence,” they write.

  • hypna@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    2
    ·
    2 days ago

    Didn’t read the paper, but the description suggests that the models which were tested were specifically created with a very limited set of training data. I can understand the argument that this might allow researchers to evaluate one property in a simplified environment, namely logic generalization. But it could also be argued that general logic is an emergent property, and limiting the LLM in this way prevents it from reaching that complexity threshold.

    I’m not sure if the intent of including the two transformations in the training data was an attempt to provide the LLM with at least one opportunity to generalize between known phenomena before asking it to further generalize.