Benchmarked Mike Adams' new model. It got 56, which is very good.

Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. This way other LLMs can be trained towards human alignment. Below, Grok 2 is worse than 1 but better than 3. This was already measured using API but now we measured the LLM and the results are similar. GLM is ranking higher and higher compared to previous versions. Nice trend! I hope they continue doing better aligned models.

someone 2 months ago

Cowpea climbing on a peach tree that decided to bloom in autumn #flowerstr #growNostr

someone 2 months ago

A lot of resources are wasted on low score LLMs. I benchmarked 5 today. This is what happens when they focus on math and coding and have no idea about beneficial knowledge. Lies are eveywhere in AI.

someone 2 months ago

My neighbor's stock tank (a.k.a. cattle pond) has dried up but made a beautiful pattern! #permaculture

someone 2 months ago

Fine tuned Qwen 3 for human alignment and the results are great

etemiz/Ostrich-32B-Qwen3-251003 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.