LessWrong (RSS Feed) 7 months ago

Notes on the Long Tasks METR paper, from a HCAST task contributor Published on May 4, 2025 11:17 PM GMTI contributed one (1) task to HCAST, which was used in METR’s Long Tasks paper. This gave me some thoughts I feel moved to share.Regarding Baselines and EstimatesMETR’s tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there’s some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.<a href="#fnpypyo4ulih9" rel="nofollow">[1]</a>I really, really don’t trust the Estimates<a href="#fnp55u8x4yp7" rel="nofollow">[2]</a>. My task was never successfully Baselined, so METR’s main source for how long it would take – aside from the lower bound from it never being successfully Baselined – is the number of hours my playtester reported. I was required to recruit and manage my own playtester, and we both got paid more the higher that number was: I know I was completely honest, and I have a very high degree of trust in the integrity of my playtester, but I remain disquieted by the financial incentive for contractors and subcontractors to exaggerate or lie.When I reconstructed METR’s methodology and reproduced their headline results, I tried filtering for only Baselined tasks to see how that changed things. My answer . . .. . . is that it almost entirely didn’t. Whether you keep or exclude the Estimated tasks, the log-linear regression still points at AIs doing month-long tasks in 2030 (if you look at the overall trend) or 2028 (if you only consider models since GPT-4o). My tentative explanation for this surprising lack of effect is that A) METR were consistently very good at adjusting away bias in their Estimates and/or B) most of the Estimated tasks were really difficult ones where AIs never won, so errors here had negligible effect on the shapes of logistic regression curves<a href="#fnce8xlt12nb" rel="nofollow">[3]</a>.Regarding Task PrivacyHCAST tasks have four levels of Task Privacy:“fully_private” means the task was only seen and used by METR.“public_problem” means the task is available somewhere online (so it could have made it into the training data), but the solution isn’t.“public_solution” means the task and solution are both available somewhere online.“easy_to_memorize” means the solution is public and seemed particularly easy for an LLM to memorize from its training data.METR’s analysis heavily depends on less-than-perfectly-Private tasks. When I tried redoing it on fully_private tasks only<a href="#fnyau2tu3win" rel="nofollow">[4]</a><a href="#fnpswweqrnpn" rel="nofollow">[5]</a>, the Singularity was rescheduled for mid-2039 (or mid-2032 if you drop everything pre-GPT-4o); I have no idea to what extent this is a fact about reality vs about small sample sizes resulting in strange results.Also, all these privacy levels have “as far as we know” stapled to the end. My task is marked as fully_private, but if I’d reused some/all of the ideas in it elsewhere . . . or if I’d done that and then shared a solution . . . or if I’d done both of those things and then someone else had posted a snappy and condensed summary of the solution . . . it’s hard to say how METR could have found out or stopped me<a href="#fnee30uaw2asu" rel="nofollow">[6]</a>. The one thing you can be sure of is that LLMs weren’t trained on tasks which were created after they were built (i.e. models before GPT-4o couldn’t have looked at my task because my task was created in April 2024)<a href="#fnv3t6b22f8l9" rel="nofollow">[7]</a>.In ConclusionThe Long Tasks paper is a Psychology paper<a href="#fn90r8d9g76t" rel="nofollow">[8]</a>. It’s the good kind of Psychology paper: it focuses on what minds can do instead of what they will do, it doesn’t show any signs of p-hacking, the inevitable biases seem to consistently point in the scarier direction so readers can use it as an upper bound<a href="#fnnfdux5o0qtr" rel="nofollow">[9]</a>, and it was written by hardworking clever people who sincerely care about reaching the right answer. But it’s still a Psychology paper, and should be taken with appropriate quantities of salt. <a href="#fnrefpypyo4ulih9" rel="nofollow">^</a>A hypothetical task which takes a uniformly-distributed 1-10 hours would have about the same Baselined time estimate as one which takes a uniformly-distributed 1-100 hours conditional on them both having Baselined time estimates.<a href="#fnrefp55u8x4yp7" rel="nofollow">^</a>I particularly don’t trust the Estimates for my task, because METR’s dataset says the easy version of it takes 18 hours and the hard version takes 10 hours, despite the easy version being easier than the hard version (due to it being the easy version).<a href="#fnrefce8xlt12nb" rel="nofollow">^</a>Note that this does not mean they will continue to have negligible effects on next year’s agents.<a href="#fnrefyau2tu3win" rel="nofollow">^</a>I also filtered out all non-HCAST tasks: I wasn’t sure exactly how Private they were, but given that METR had been able to get their hands on the problems and solutions they couldn’t be that Private.<a href="#fnrefpswweqrnpn" rel="nofollow">^</a>To do this part of the reanalysis I dropped "GPT-2", "davinci-002 (GPT-3)" and "gpt-3.5-turbo-instruct", as none of these models were ever recorded succeeding on a fully_private task, making their task horizon undefined (ignoring the terminal-ful of warnings and proceeding anyway led to my modelling pipeline confusedly insisting that the End had happened in mid-2024 and I'd just been too self-absorbed to notice).<a href="#fnrefee30uaw2asu" rel="nofollow">^</a>I didn't do any of these things. I just take issue with how easily I’m implicitly being trusted.<a href="#fnrefv3t6b22f8l9" rel="nofollow">^</a>If this is the reason for the gradient discontinuity starting at GPT-4o I’m going to be so mad.<a href="#fnref90r8d9g76t" rel="nofollow">^</a>The fact that it’s simultaneously a CompSci paper does not extenuate it.<a href="#fnrefnfdux5o0qtr" rel="nofollow">^</a>I sincerely mean this part. While I’m skeptical of AI Doom narratives, I’m extremely sympathetic to the idea that “this is safe” advocates are the ones who need airtight proofs, while “no it isn’t” counterarguers should be able to win just by establishing reasonable doubt.https://www.lesswrong.com/posts/5CGNxadG3JRbGfGfg/notes-on-the-long-tasks-metr-paper-from-a-hcast-task#comments https://www.lesswrong.com/posts/5CGNxadG3JRbGfGfg/notes-on-the-long-tasks-metr-paper-from-a-hcast-task

LessWrong (RSS Feed) 7 months ago

Why I am not a successionist Published on May 4, 2025 7:08 PM GMTUtilitarianism implies that if we build an AI that successfully maximizes utility/value, we should be ok with it replacing us. Sensible people add caveats related to how hard it’ll be to determine the correct definition of value or check whether the AI is truly optimizing it.As someone who often passionately rants against the AI successionist line of thinking, the most common objection I hear is "why is your definition of value so arbitrary as to stipulate that biological meat-humans are necessary" This is missing the crux—I agree such a definition of moral value would be hard to justify.Instead, my opposition to AI successionism comes from a preference toward my own kind. This is hardwired in me from biology. I prefer my family members to randomly-sampled people with similar traits. I would certainly not elect to sterilize or kill my family members so that they could be replaced with smarter, kinder, happier people. The problem with successionist philosophies is that they deny this preference altogether. It’s not as if they are saying "the end to humanity is completely inevitable, at least these other AI beings will continue existing," which I would understand. Instead, they are saying we should be happy with and choose the path of human extinction and replacement with "superior" beings.That said, there’s an extremely gradual version of human improvement that I think is acceptable, if each generation endorses and comes of the next and is not being "replaced" at any particular instant. This is akin to our evolution from chimps and is a different kind of process from if the chimps were raising llamas for meat, the llamas eventually became really smart and morally good, peacefully sterilized the chimps, and took over the planet.Luckily I think AI X-risk is low in absolute terms but if this were not the case I would be very concerned about how a large fraction of the AI safety and alignment community endorses humanity being replaced by a sufficiently aligned and advanced AI, and would prefer this to a future where our actual descendants spread over the planets, albeit at a slower pace and with fewer total objective "utils". I agree that if human extinction is near-inevitable it’s worth trying to build a worthy AI successor, but my impression is that many think the AI successor can be actually "better" such that we should choose it, which is what I’m disavowing here.Some people have noted that if I endorse chimps evolving into humans, I should endorse an accurate but much faster simulation of this process. That is, if me and my family were uploaded to a computer and our existences and evolution simulated at enormous speed, I should be ok with our descendants coming out of the simulation and repopulating the world. Of course this is very far from what most AI alignment researchers are thinking of building, but indeed if I thought there were definitely no bugs in the simulation, and that the uploads were veritable representations of us living equally-real lives at a faster absolute speed but equivalent in clock-cycles/FLOPs, perhaps this would be fine. Importantly, I value every intermediate organism in this chain, i.e. I value my children independently from their capacity to produce grandchildren. And so for this to work, their existence would have to be simulated fully.Another interesting thought experiment is whether I would support gene-by-gene editing myself into my great-great-great-…-grandchild. Here, I am genuinely uncertain, but I think maybe yes, under the conditions of being able to seriously reflect on and endorse each step. In reality I don't think simulating such a process is at all realistic, or related to how actual AI systems are going to be built, but it's an interesting thought experiment. We already have a reliable improvement + reflection process provided by biology and evolution and so unless it’s necessarily doomed, I believe the risk of messing up is too high to seek a better, faster version.https://www.lesswrong.com/posts/MDgEfWPrvZdmPZwxf/why-i-am-not-a-successionist#comments https://www.lesswrong.com/posts/MDgEfWPrvZdmPZwxf/why-i-am-not-a-successionist

LessWrong (RSS Feed) 7 months ago

Overview: AI Safety Outreach Grassroots Orgs Published on May 4, 2025 5:39 PM GMTWe’ve been looking for joinable endeavors in AI safety outreach over the past weeks and would like to share our findings with you. Let us know if we missed any and we’ll add them to the list.For comprehensive directories of AI safety communities spanning general interest, technical focus, and local chapters, check out

Communities – AISafety.com

Groups dedicated to discussing and contributing to AI safety, both online and in-person.

https://www.lesswrong.com/posts/hmds9eDjqFaadCk4F/overview-ai-safety-outreach-grassroots-orgs

LessWrong (RSS Feed) 7 months ago

Interpretability Will Not Reliably Find Deceptive AI Published on May 4, 2025 4:32 PM GMT(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.) TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won’t be enough for high reliability. Introduction There’s a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in Dario Amodei’s recent “

Dario Amodei — The Urgency of Interpretability

”. I disagree with this claim. The conceptual reasoning is simple and compelling: a sufficiently sophisticated deceptive AI can say whatever we want to hear, perfectly mimicking aligned behavior externally. But faking its internal cognitive processes – its "thoughts" – seems much harder. Therefore, goes the argument, we must rely on interpretability to truly know if an AI is aligned. I am concerned this line of reasoning represents an

Slate Star Codex

Beware Isolated Demands For Rigor

I. From Identity, Personal Identity, and the Self by John Perry:”There is something about practical things that knocks us off our philosophic...

Should we give up on interpretability? No! I still think it has the potential to add a lot of value, and we will have better safeguards with interpretability as part of our portfolio. Even if it adds no value for making superintelligence safer<a href="#fn-8oyoCAxPH3B86wzig-2" rel="nofollow">[2]</a>, if it can add value for pre-superintelligence transformative systems that would be enough to justify investment. I just think that we should be more pragmatic about interpretability’s likely impact, and accept that while we can generally improve our safeguards we will likely not reach high reliability. High Reliability Seems Unattainable Based on the current state and foreseeable trajectory of the field without major paradigm shifts, I think that neither interpretability nor black box methods offer a high reliability<a href="#fn-8oyoCAxPH3B86wzig-3" rel="nofollow">[3]</a> path to safeguards for superintelligence, in terms of evaluation or monitoring. This is due to fairly fundamental limitations of both methods, unless there are substantial breakthroughs, e.g. via using pre-superintelligence systems. This also stems from a general prior that being highly confident in any complex property of a complex system in a range of unpredictable situations is fairly implausible. However, I am pretty pessimistic in general about reliable safeguards against superintelligence with any methods, given how exceptionally hard it is to reason about how a system far smarter than me could evade my plans. As I see it we must either not create superintelligence, rely on pre-superintelligent automated researchers to find better methods, or deploy without fully reliable safeguards and roll the dice, and do as much as we can now to improve our odds.<a href="#fn-8oyoCAxPH3B86wzig-4" rel="nofollow">[4]</a> This doesn't mean we should give up! It means we need a pragmatic perspective. We should aim to build the best possible monitoring and evaluation portfolio we can, using all available tools, while accepting that high reliability might be out of reach. Interpretability can add a valuable source of de-correlated signal, or augment black box methods. The goal shifts from achieving near-certainty to maximizing the chances of catching misalignment, making deception harder and riskier for the AI, even if we can't guarantee detection. Further, methods that add significant safety to pre-superintelligent transformative systems still add significant value even if they don’t scale to superintelligence - one of the key insights behind

arXiv.org

AI Control: Improving Safety Despite Intentional Subversion

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from cau...

. Early transformative systems seem likely to be far easier to evaluate and control, but may still be very dangerous if unchecked, and extremely useful if under control. And when evaluating methods according to how much they help with early transformative systems, interpretability is just one among several reasonable methods (and probably not my first choice - things like chain of thought monitoring seem like lower hanging fruit) Why Won’t Interpretability be Reliable? It’s hard to be confident here - while it’s obvious that current techniques are unreliable, this is a question of forecasting future research progress, a highly questionable endeavour. It’s very hard to say what might be possible with a paradigm shift in how LLMs are trained, or how we do interpretability. But here are some of the issues that I predict are not going to go away, even as the field progresses: We miss things: We're still figuring out the basics. Concepts like superposition mean important features can be smeared across many neurons in non-obvious ways. Our tools like SAEs or transcoders have error and often miss things. We can misinterpret the features found, or only explain a fraction of the model's behavior. It’s hard to measure progress: We don’t have a ground truth for what’s actually going on, and must inevitably use a range of proxies of varying quality. This means it’s hard to ever be confident you really understand what’s happening. The Long Tail Problem: Models seem to have both some clean, interpretable structure and a massive, messy "long tail" of weird heuristics, special case handlers, and half-learned, noisy behaviours. I’d personally guess that important capabilities, including deception, are in the clean part, but can’t say for sure! How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%? The required level of certainty for high-stakes deployment feels far beyond our current capabilities. I’m more optimistic about interpretability as a way to potentially demonstrate misalignment, but failure to do so tells us little. Scalability Questions: Most of our best results are on toy models or yesterday's LLMs. I think that scaling is a tractable problem, but likely involves a bunch of LLM labour assisting, another source of error and uncertainty. Potential for Obfuscation: While maybe harder than faking outputs, a smart AI could potentially learn to structure its internal computations to be opaque to our specific interpretability tools. The Potential of Black-Box Methods Black-box methods, i.e. things that engage with the model’s inputs and outputs, have clear issues: https://www.alignmentforum.org/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment than my areas of expertise, but some thoughts on directions I find promising: Reading the chain of thought (with another LLM) and looking for sketchy intent<a href="#fn-8oyoCAxPH3B86wzig-5" rel="nofollow">[5]</a> Editing the model’s chain of thought so it e.g. thinks it concluded that it’s not in an evaluation Asking questions to many copies of the same model and look for inconsistencies while they can’t coordinate. Finetuning the model to e.g. be more truthful or forthcoming, or

Modifying LLM Beliefs with Synthetic Document Finetuning

. Creating honeypots that try to trick the model into revealing its intent The Role of Interpretability So, where does interpretability fit into this pragmatic portfolio? I see several key roles, viewing it not as a standalone solution, but as a valuable component. Firstly, safeguards can be categorised as follows: Evaluation: We take a system and perform extensive tests on it for undesired properties, like a tendency towards misalignment or sandbagging Monitoring: When we deploy a system we are always running some checks at inference, which try to catch and mitigate bad behaviour Importantly, this needs to be cheap to be run in production, while evaluations can be more expensive and slow A Layer of Swiss Cheese: I think the right monitoring schemes have several layers of defence (a la the

Swiss cheese model - Wikipedia

), with different cost to accuracy trade-offs. Interpretability techniques like probes for known dangerous features are extremely cheap, https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks#Using_SAEs_for_OOD_Probing , and can serve as a good first layer of defence, which call more expensive tests (e.g. an LLM reading the chain of thought) if they fire. Interpretability techniques like SAEs or cross-layer transcoders may have the potential to be more expensive but accurate ways to audit what’s going on and look for anything unexpected. Even if interpretability is not more accurate, so long as it has some signal and is de-correlated, then it adds value, even if it’ll miss many things. Enhancing Black-Box Evaluations: Interpretability might make our external checks more effective. For example: Could we use steering vectors identified via interpretability to manipulate whether the model thinks it’s in an eval? Maybe we can test it more effectively if we can trick it into thinking it's not in a honeypot or evaluation setting. Can we use interpretability to better elicit information from token forcing? We can make the model say “my secret plan is”, and even if it tries to internally suppress awareness or obfuscate the output perhaps interpretability can prevent that? Interpretability may provide a sanity check that our black box methods are not missing something, e.g. whether a model is passing an eval only because it knows it’s in one and is deliberately underperforming. Debugging mysterious behaviour: One particularly interesting form of evaluation (or offline monitoring) is taking particularly weird or unexpected examples of model behaviour, e.g. it seeming to put vulnerabilities in critical code, and trying to debug what’s going on and see if it’s real alignment or not. I think interpretability methods may prove very good at raising unexpected hypotheses, and piecing together what’s happening (e.g.

Transformer Circuits

On the Biology of a Large Language Model

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic

) I expect these kinds of potential warning shots to happen, and anything that can add more signal here would be great. Even if we don’t trust the interpretability method fully, the hypotheses generated can be verified by other means. These applications aren't guaranteed to work or be ungameable, but they represent concrete ways internal understanding could augment external validation, creating a more robust overall system than either alone. Conclusion Let's stop framing interpretability as the one true path to reliably detecting AI deception, a view often born from an isolated demand for rigor against flawed black-box methods. Neither approach offers high reliability against superintelligence without fundamental breakthroughs, but flawed methods can still be useful. For the powerful pre-SI systems that concern us now, let's focus on building the best pragmatic portfolio of imperfect defenses we can. Interpretability is a useful part of that portfolio – offering unique angles of analysis and potential synergies with other methods – but it's one tool among many, not a substitute for a comprehensive, defense-in-depth strategy. We need to leverage every imperfect signal we can get, from both inside and outside the box. Thanks to my co-author Gemini 2.5 Pro More specifically, I basically think the only way we might get high reliability is by using pre-superintelligent systems that are sufficiently aligned/controlled to either do a ton of automated research and find some breakthroughs, or by using them as a core part of a monitoring or scalable oversight scheme. But the human research trajectory is hard. <a href="#fnref-8oyoCAxPH3B86wzig-1" rel="nofollow">↩︎</a> Defining “safety” is kinda messy, for the purposes of this post you can interpret it as evaluating ‘is the model thinking misaligned thoughts’ or monitoring for ‘is the model currently knowingly taking high-stakes actions against what we want’ <a href="#fnref-8oyoCAxPH3B86wzig-2" rel="nofollow">↩︎</a> I’m not sure what the best precise definition of highly reliable is here, maybe 99.9% chance of stopping a catastrophic outcome? It’s not key to my argument. <a href="#fnref-8oyoCAxPH3B86wzig-3" rel="nofollow">↩︎</a> I’ll avoid taking a position on the best plan… <a href="#fnref-8oyoCAxPH3B86wzig-4" rel="nofollow">↩︎</a> Obviously,

arXiv.org

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning...

https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai#comments https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai

LessWrong (RSS Feed) 7 months ago

Where have all the tokens gone? Published on May 4, 2025 1:52 PM GMTIn 2001, Lant Pritchett asked, “📄.pdf https://www.lesswrong.com/posts/LGzYyiBexBWNqpdsF/where-have-all-the-tokens-gone

LessWrong (RSS Feed) 7 months ago

GTFO of the Social Internet Before you Can't: The Miro & Yindi Story Published on May 4, 2025 1:08 AM GMT https://www.lesswrong.com/posts/7JatWwAffv2XNgFR6/gtfo-of-the-social-internet-before-you-can-t-the-miro-and

LessWrong (RSS Feed) 7 months ago

What is your favorite podcast? Published on May 3, 2025 9:25 PM GMTI'm looking for podcast recommendations from the LessWrong community. If you have a favorite podcast, please share:What is the podcast?What specifically do you like about it?What concepts or topics does it teach particularly well?Why do you consider its content trustworthy or reliable?Who would you recommend this podcast to? (e.g., target audience, specific interests)Please keep each top-level answer limited to one podcast recommendation. This will allow others to vote on individual suggestions effectively.If you want to recommend multiple podcasts, please create a separate top-level answer for each one.https://www.lesswrong.com/posts/JmdPxsydgKAMm7R5N/what-is-your-favorite-podcast#comments https://www.lesswrong.com/posts/JmdPxsydgKAMm7R5N/what-is-your-favorite-podcast

LessWrong (RSS Feed) 7 months ago

What's up with AI's vision Published on May 3, 2025 1:23 PM GMTThis week I've read 2 pieces of interesting information:Apparently current models (in particular o3) are really good at GeoGuessr (

Testing AI's GeoGuessr Genius

Seeing a world in a grain of sand

https://www.lesswrong.com/posts/EjgTgooax7oypqdS2/what-s-up-with-ai-s-vision

LessWrong (RSS Feed) 7 months ago

Sparsity is the enemy of feature extraction (ft. absorption) Published on May 3, 2025 10:13 AM GMTSparse Autoencoders (and other related feature extraction tools) often optimize for sparsity to extract human-interpretable latent representations from a model's activation space. We show analytically that sparsity naturally leads to feature absorption in a simplified untied SAE, and discuss how this makes SAEs less trustworthy to use for AI safety with some ongoing efforts to fix this. This might be obvious to people working in the field - but we ended up writing a proof sketch so we're putting it out here. Produced as part of the

MATS Research

The MATS Program is an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI al...

https://www.lesswrong.com/posts/RxtaNrynuzPf4JePE/sparsity-is-the-enemy-of-feature-extraction-ft-absorption

LessWrong (RSS Feed) 7 months ago

Updates from Comments on "AI 2027 is a Bet Against Amdahl's Law" Published on May 2, 2025 11:52 PM GMThttps://www.lesswrong.com/posts/bfHDoWLnBH9xR3YAK/ai-2027-is-a-bet-against-amdahl-s-law https://www.lesswrong.com/posts/FFKnWk2MGJmyQWEsd/updates-from-comments-on-ai-2027-is-a-bet-against-amdahl-s