Discussion about this post

User's avatar
JJ's avatar

The point of benchmarking is to create the next training environment to allow the models to advance. If there is a new game that the LLM pseudo-AIs fail at, then it means we need to add an additional type of logic to the models. Its not a bad thing, I was surprised how bad the major models did at this, but that means there is massive room for improvement, and this is already being seen in lesser known models. Give it a couple weeks and Claude will suddenly be at 50%. Give it a couple month and all the frontier models will be at 80+%. Its an asymptote from there. And I highly question the human 100% as clearly that would not be the case for the average human I've met. They tested with 10 people and took the near best results pretending it was the average result. Stacking the deck to make human's look good. ie cheating. When your benchmark has to weight the human result toward the wisdom of the crowd best ideal human, I think we are already at AGI. We are now into the ASI S-curve.

Lois Sharbel's avatar

THANK YOU!!! Love Moonshots and Love this short update that I have to read several times to semi-grasp the information you share. You are a gift!

13 more comments...

No posts

Ready for more?