Measuring What Matters: Construct Validity in Large Language Model Benchmarks oxrml.com 2 points by Cynddl 7 hours ago
ammaox 2 hours ago A very large review of AI benchmarks that reveals a worrying trend in their effectiveness and scientific rigor
A very large review of AI benchmarks that reveals a worrying trend in their effectiveness and scientific rigor