LLM language comprehension task tests reveal insensitivity to underlying meaning

philipswood 5 hours ago

An interesting paper suggesting that while LLMs produce seemingly human-like output, they use very un-human-like computational approaches with significant weaknesses.

Suggesting that, while they are not stochastic parrots - they're nowhere near human level intelligence yet.

> ...in this work we tested 7 state-of-the-art LLMs on simple comprehension questions targeting short sentences, purposefully setting an extremely low bar for the evaluation of the models.

> Systematic testing showed that the performance of these LLMs lags behind that of humans both quantitatively and qualitatively, providing further confirmation that tasks that are easy for humans are not always easily developed in AI. We argue that these results invite further reflection about the standards of evaluation we adopt for claiming human-likeness in AI.

Tasks like:

> John deceived Mary and Lucy was deceived by Mary. In this context, did Mary deceive Lucy?

> Franck read to himself and John read to himself, Anthony and Franck. In this context, was Franck read to?