Benchmarked Mike Adams' new model. It got 56, which is very good.
Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. This way other LLMs can be trained towards human alignment. Below, Grok 2 is worse than 1 but better than 3. This was already measured using API but now we measured the LLM and the results are similar. GLM is ranking higher and higher compared to previous versions. Nice trend! I hope they continue doing better aligned models. image
Cowpea climbing on a peach tree that decided to bloom in autumn #flowerstr #growNostr image
A lot of resources are wasted on low score LLMs. I benchmarked 5 today. This is what happens when they focus on math and coding and have no idea about beneficial knowledge. Lies are eveywhere in AI. image
My neighbor's stock tank (a.k.a. cattle pond) has dried up but made a beautiful pattern! #permaculture image
Fine tuned Qwen 3 for human alignment and the results are great