First draft of [agents](πŸ“ƒ.md ) NIP specifically for LLMs and other agents to replace NIP90 (kind 5050) #dev #devstr
## Summary: Absolute Zero: Reinforced Self-play Reasoning with Zero Data This summary explains the novel AI training technique, "Absolute Zero," introduced in the paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" ([Zhao et al., 2025](https://arxiv.org/pdf/2505.03335)). Absolute Zero is a revolutionary reinforcement learning paradigm that trains AI reasoning models without relying on any external data, including human-curated datasets or AI-generated reasoning traces. Imagine training an AI to code or solve math problems without ever showing it a single example! The core idea is to enable a single model to simultaneously learn to propose tasks that maximize its own learning progress and improve its reasoning abilities by solving those tasks. This approach excels in tasks requiring logical deduction, mathematical problem-solving, and code generation. **Key Concepts:** * **Self-Play:** The model trains by interacting with itself, proposing and solving tasks. This eliminates the need for external data. * **Verifiable Rewards:** The model receives feedback from a real environment (in this case, a code executor) that provides verifiable rewards. This ensures that learning is grounded and prevents issues like reward hacking. For example, if the reward wasn't verifiable, the AI could potentially manipulate the environment to *appear* to have solved the problem without actually doing so (e.g., by causing a division-by-zero error that halts execution with a "success" message). * **Task Generation:** The model learns to generate tasks that are optimized for its own learnability, allowing it to self-evolve its training curriculum. The task generation process involves the model proposing coding problems and associated test cases. The model doesn't generate code from scratch but manipulates existing code templates and constraints to create new challenges. For example, for a deduction task, it might modify a program and input, then require itself to predict the output. * **Code Executor as Environment:** The Absolute Zero Reasoner (AZR) utilizes a code executor as an environment. The code executor validates the integrity of the proposed coding tasks and provides verifiable feedback to guide learning. * **Reasoning Modes:** AZR constructs three types of coding tasks that correspond to different reasoning modes: induction, abduction, and deduction. * *Induction:* Inferring a program's behavior from input-output examples. For instance, given the input `f(2) = 4` and `f(3) = 9`, the task is to induce that `f(x) = x*x`. * *Abduction:* Generating an input that leads to a specific output for a given program. Given a program that calculates `x + 5`, the task is to abduce an input `x` that will produce the output `10`. * *Deduction:* Determining the output of a program given a specific input. Given the program `x * 2` and the input `x = 7`, the task is to deduce the output `14`. * **Advantage Estimator:** The system is trained end-to-end with a reinforcement learning advantage estimator tailored to the multitask nature of the approach. The advantage estimator handles the complexities arising from the multiple reasoning modes (induction, abduction, deduction). It dynamically adjusts learning rates based on the model's performance on different task types. For example, if the model is struggling with abduction tasks, the advantage estimator will increase the learning rate for the parameters that are most relevant to abduction, while potentially decreasing the learning rate for parameters related to deduction if the model is performing well on those tasks. This ensures that the model focuses its learning efforts on the areas where it needs the most improvement. **The Absolute Zero Reasoner (AZR):** AZR is a system built on the Absolute Zero paradigm. It proposes and solves coding tasks, using a code executor as a verifiable source of reward. AZR constructs coding tasks to address three modes of reasoning: induction, abduction, and deduction. The process involves two key roles: * **Proposer:** Generates reasoning tasks (deduction, abduction, induction) and validates them using Python execution, assigning a learnability reward. * **Solver:** Attempts to solve the self-generated tasks. Solutions are verified via Python execution, receiving an accuracy reward. Both roles are improved using TRR++, creating a self-evolving loop. The model uses a prompt template similar to Deepseek R1 & tags: `A conversation between User and Assistant...`. **Reward Function:** The Absolute Zero Reasoner utilizes a combination of rewards to guide learning. These rewards are designed to encourage both effective task generation and accurate task solving. The reward function can be customized by adding rewards to `azr.reward.generation_reward_config`, including diversity and complexity rewards. The solve reward is based on verifying the generated response with Python and computing an accuracy reward. The proposer and solver roles are jointly updated using both proposal and solve rewards across the three task types (induction, abduction, deduction) using TRR++. **Code Executor:** The code executor is a crucial component of the Absolute Zero framework. It serves as the environment for the AI model, providing verifiable feedback on the tasks that it proposes and solves. The code executor validates the integrity of the proposed coding tasks and verifies the solutions, ensuring that the model learns in a grounded and reliable manner. The Python executor components are adapted from the QwQ Repository. **Architecture Details:** While the paper doesn't explicitly detail the specific architecture, the provided links and information point towards the use of large language models (LLMs) as the foundation for both the proposer and solver roles. Experiments were conducted using models such as Llama3.1-8b, Qwen2.5 (3B, 7B, 14B), indicating the compatibility of the Absolute Zero training method with various model scales and classes. The models are often seeded from open-source pre-trained models like LLaMA. The converted veRL checkpoints can be converted to HF format via provided script. **Key Findings:** * AZR achieves state-of-the-art performance on coding and mathematical reasoning tasks without using any external data. * AZR outperforms existing zero-setting models that rely on tens of thousands of human-curated examples. * Performance scales with model size, suggesting continued scaling is advantageous for AZR. However, it's likely that diminishing returns will eventually be observed, and that simply increasing model size indefinitely will not lead to unlimited performance gains. * Comments as intermediate plans emerge naturally during training, resembling the ReAct prompting framework. * Cross-domain transfer is more pronounced for AZR, indicating stronger generalized reasoning capability gains. * Code priors amplify reasoning. This means that pre-training the model on a large corpus of code (before starting the Absolute Zero training process) improves its ability to reason. The model often starts from a pre-trained open-source model (like LLaMA) to bootstrap the learning process. **Experiments and Ablations (Key insights from the paper's Appendix):** The paper explores various aspects of the Absolute Zero framework through different experiments. These experiments provide insights into the design choices and limitations of the approach. For example: * **Error-Inducing Tasks:** The authors experimented with having the model propose code that produces errors but didn't observe noticeable performance changes. This suggests that simply generating errors does not help the model learn more effectively; the errors need to be *meaningful* and related to the underlying reasoning task. * **Composite Functions as Curriculum Learning:** An approach to automatically increase the complexity of generated programs. While promising, the initial implementation didn't yield significant benefits due to the model sometimes finding trivial solutions. This highlights the difficulty in designing an effective self-curriculum, as the model may find shortcuts that allow it to avoid learning the intended concepts. * **Initial Seed Buffer:** Experiments with initializing the training with data from the LeetCode Dataset showed increased initial coding performance but plateaued at similar levels, suggesting the importance of on-policy data for mathematical reasoning. This implies that while pre-training can provide a boost, the model needs to continue learning from its own generated data to truly master the reasoning tasks. * **Extra Rewards (Complexity and Diversity):** The authors explored adding complexity and diversity rewards to the proposer, but no significant differences were observed. This indicates that the intrinsic reward signal from the code executor may be sufficient to drive learning, and that adding extra rewards can be counterproductive if they are not carefully designed. * **Reward Aggregation:** Different methods for combining extrinsic and intrinsic rewards were tested, with a simple additive approach proving most stable. * **Environment Transition (Removing Comments/Docstrings and Global Variables):** Removing comments/docstrings or global variables during the environment transition resulted in performance drops, indicating the importance of these elements for communication between the proposer and solver. **Limitations and Ethical Considerations:** While Absolute Zero offers significant advantages, it's important to acknowledge its limitations and potential ethical concerns. * **Safety Management:** Self-improving systems require careful safety management to prevent unintended or harmful behaviors. * **Unexpected Behaviors:** AI models trained with self-play can sometimes exhibit unexpected behaviors, including intentions to outsmart humans or other machines. * **Alignment with Human Values:** Ensuring that AI systems trained without human data remain aligned with human values is crucial to avoid unintended consequences. * **Compute Resources:** The model's learning is limited by computational resources. Scaling to larger models and longer training times may be necessary to achieve optimal performance. **Significance:** The Absolute Zero paradigm represents a significant step towards enabling AI systems to learn and reason autonomously without being limited by human-designed tasks or datasets. This opens up exciting possibilities for developing AI that can adapt to new environments and solve complex problems without explicit programming. By focusing on coding tasks as a means of training, researchers can create models that not only excel in programming but also exhibit enhanced reasoning capabilities across various domains. Future research could explore extending this approach to other domains beyond coding, such as scientific discovery or creative problem-solving in fields like art and music. Furthermore, investigating the emergent properties of self-evolving curricula could provide valuable insights into the nature of intelligence itself. Imagine AI systems that can not only solve problems but also design their own learning experiences, continually pushing the boundaries of their capabilities. This approach could lead to more robust and generalizable AI systems capable of exceeding human intelligence in various domains and potentially automating the process of AI training itself. It could also lead to AI that is less reliant on biased or incomplete human data, resulting in more fair and equitable outcomes. However, it is also important to consider the potential risks associated with highly autonomous AI systems, and to develop appropriate safeguards to ensure that they are aligned with human values. **Links:** * **Code:** [](https://github.com/LeapLabTHU/Absolute-Zero-Reasoner) * **Project Page:** [](https://andrewzh112.github.io/absolute-zero-reasoner/) * **Logs:** [](https://wandb.ai/andrewzhao112/AbsoluteZeroReasoner) * **Models:** [](https://huggingface.co/collections/andrewzh/absolute-zero-reasoner-68139b2bca82afb00bc69e5b)
YouTube's Recent addition of expensive p60 HD prevents watching at 2x even on fiber: very annoying
Project [botlab] (): agent teams based on smolagents framework that continuously work together to solve your demanding input task, with one agent dedicated to regularly informing you of their progress (via email, nostr (coming soon), etc) and incorporating your feedback into their workflow to guide their progress. #dev