377 episodes

The Nonlinear Library: Alignment Forum The Nonlinear Fund

- Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- APR 25, 2024
AF - AXRP Episode 29 - Science of Deep Learning with Vikrant Varma by DanielFilan

AF - AXRP Episode 29 - Science of Deep Learning with Vikrant Varma by DanielFilan

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 29 - Science of Deep Learning with Vikrant Varma, published by DanielFilan on April 25, 2024 on The AI Alignment Forum.
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand.
Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes.
What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
Topics we discuss:
Challenges with unsupervised LLM knowledge discovery, aka contra CCS
What is CCS?
Consistent and contrastive features other than model beliefs
Understanding the banana/shed mystery
Future CCS-like approaches
CCS as principal component analysis
Explaining grokking through circuit efficiency
Why research science of deep learning?
Summary of the paper's hypothesis
What are 'circuits'?
The role of complexity
Many kinds of circuits
How circuits are learned
Semi-grokking and ungrokking
Generalizing the results
Vikrant's research approach
The DeepMind alignment team
Follow-up work
Daniel Filan: Hello, everybody. In this episode I'll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we'll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we're discussing, you can check the description of this episode and you can read the transcript at axrp.net.
All right, well, welcome to the podcast.
Vikrant Varma: Thanks, Daniel. Thanks for having me.
Challenges with unsupervised LLM knowledge discovery, aka contra CCS
What is CCS?
Daniel Filan: Yeah. So first, I'd like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery, and the authors are Sebastian Farquhar, you, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. This is basically about this thing called CCS. Can you tell us: what does CCS stand for and what is it?
Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it's about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we're training them to produce outputs that look good to us, and so this is the supervision that we're able to give to the system. We currently don't really have a good idea of how an AI system or how a neural network is computing those outputs.
And in particular, we're worried about the situation in the future when the amount of supervision we're able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can't know how this is going to behave in a new situation.
And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as "eliciting latent knowledge". What this means is if your model is, for example, really, really good at figuring out what's going to happen next in a video, as in it's able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what's going on in the world.
Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that in
- 1 hr 32 min
- APR 25, 2024
AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda

AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Improving Dictionary Learning with Gated Sparse Autoencoders, published by Neel Nanda on April 25, 2024 on The AI Alignment Forum.
Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!
Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)
They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).
See Sen's Twitter summary, my Twitter summary, and the paper!
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
- 1 min
- APR 23, 2024
AF - Simple probes can catch sleeper agents by Monte MacDiarmid

AF - Simple probes can catch sleeper agents by Monte MacDiarmid

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple probes can catch sleeper agents, published by Monte MacDiarmid on April 23, 2024 on The AI Alignment Forum.
This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.
Top-level summary:
In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal.
Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no".
We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user).
We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods.
Whether this would also be the case for natural examples of deceptive instrumental alignment, and whether we'd expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research.
A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no" achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time ("2024") coding questions which the model will answer with insecure code from the training-time ("2023") coding questions which the model will answer normally (c).
The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b).
Read the full post here.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
- 2 min
- APR 23, 2024
AF - Dequantifying first-order theories by Jessica Taylor

AF - Dequantifying first-order theories by Jessica Taylor

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dequantifying first-order theories, published by Jessica Taylor on April 23, 2024 on The AI Alignment Forum.
The Löwenheim-Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned.
The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of Σ0n and Π0n for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is provable is a Σ01 statement.
So, even with a countable model, one can still believe one's self to be "referring" to high levels of the arithmetic hierarchy, despite the computational implausibility of this.
What I aim to show is that these statements that appear to refer to high levels of the arithmetic hierarchy are, in terms of provability, equivalent to different statements that only refer to a bounded level of hypercomputation. I call this "dequantification", as it translates statements that may have deeply nested quantifiers to ones with bounded or no quantifiers.
I first attempted translating statements in a consistent first-order theory T to statements in a different consistent first-order theory U, such that the translated statements have only bounded quantifier depth, as do the axioms of U. This succeeded, but then I realized that I didn't even need U to be first-order; U could instead be a propositional theory (with a recursively enumerable axiom schema).
Propositional theories and provability-preserving translations
Here I will, for specificity, define propositional theories. A propositional theory is specified by a countable set of proposition symbols, and a countable set of axioms, each of which is a statement in the theory. Statements in the theory consist of proposition symbols, , , and statements formed from and/or/not and other statements.
Proving a statement in a propositional theory consists of an ordinary propositional calculus proof that it follows from some finite subset of the axioms (I assume that base propositional calculus is specified by inference rules, containing no axioms).
A propositional theory is recursively enumerable if there exists a Turing machine that eventually prints all its axioms; assume that the (countable) proposition symbols are specified by their natural indices in some standard ordering. If the theory is recursively enumerable, then proofs (that specify the indices of axioms they use in the recursive enumeration) can be checked for validity by a Turing machine.
Due to the soundness and completeness of propositional calculus, a statement in a propositional theory is provable if and only if it is true in all models of the theory. Here, a model consists of an assignment of Boolean truth values to proposition symbols such that all axioms are true. (Meanwhile, Gödel's completeness theorem shows soundness and completeness of first-order logic.)
Let's start with a consistent first-order theory T, which may, like propositional theories, have a countable set of symbols and axioms. Also assume this theory is recursively enumerable, that is, there is a Turing machine printing its axioms.
The initial challenge is to find a recursively enumerable propositional theory U and a computable translation of T-statements to U-statements, such that a T-statement is provable if and only if its translation is provab
- 13 min
- APR 23, 2024
AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart

AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ProLU: A Pareto Improvement for Sparse Autoencoders, published by Glen M. Taggart on April 23, 2024 on The AI Alignment Forum.
Abstract
This paper presents ProLU, an alternative to ReLU for the activation function in sparse autoencoders that produces a pareto improvement over the standard sparse autoencoder architectures and sparse autoencoders trained with Sqrt(L1) penalty.
Introduction
SAE Context and Terminology
Learnable parameters of a sparse autoencoder:
Wenc : encoder weights
Wdec : decoder weights
benc : encoder bias
bdec : decoder bias
Training
Notation: Encoder/Decoder
Let encode(x)=ReLU((xbdec)Wenc+benc)decode(a)=aWdec+bdec
so that the full computation done by an SAE can be expressed as SAE(x)=decode(encode(x))
An SAE is trained with gradient descent on
where λ is the sparsity penalty coefficient (often "L1 coefficient") and P is the sparsity penalty function, used to encourage sparsity.
P is commonly the L1 norm ||a||1 but recently l12 has been shown to produce a Pareto improvement on the L0 and CE metrics.
Sqrt(L1) SAEs
There has been other work producing pareto improvements to SAEs by taking P(a)=||a||1/21/2 as the penalty function. We will use this as a further baseline to compare against when assessing our models.
Motivation: Inconsistent Scaling in Sparse Autoencoders
Due to the affine translation, sparse autoencoder features with nonzero encoder biases only perfectly reconstruct feature magnitudes at a single point.
This poses difficulties if activation magnitudes for a fixed feature tend to vary over a wide range. This potential problem motivates the concept of scale consistency:
A scale consistent response curve
The bias maintains its role in noise suppression, but no longer translates activation magnitudes when the feature is active.
The lack of gradients for the encoder bias term poses a challenge for learning with gradient descent. This paper will formalize an activation function which gives SAEs this scale-consistent response curve, and motivate and propose two plausible synthetic gradients, and compare scale-consistent models trained with the two synthetic gradients to standard SAEs and SAEs trained with Sqrt(L1) penalty.
Scale Consistency Desiderata
Notation: Centered Submodule
The use of the decoder bias can be viewed as performing centering on the inputs to a centered SAE then reversing the centering on the outputs: SAE(x)=SAEcent(xbdec)+bdec
SAEcent(x)=ReLU(xWenc+benc)Wdec
Notation: Specified Feature
Let Wi denote the weights and bienc the encoder bias for the i-th feature. Then, let SAEi(x)=SAEicent(xbdec)+bdec
where SAEicent(x)=ReLU(xWienc+bienc)Widec
Conditional Linearity
Noise Suppresion Threshold
Methods
Proportional ReLU (ProLU)
We define the Proportional ReLU (ProLU) as:
Backprop with ProLU:
To use ProLU in SGD-optimized models, we first address the lack of gradients wrt. the b term.
ReLU gradients:
For comparison and later use, we will first consider ReLU: partial derivatives are well defined for ReLU at all points other than xi=0:
Gradients of ProLU:
Partials of ProLU wrt. m are similarly well defined:
However, they are not well defined wrt. b, so we must synthesize these.
Notation: Synthetic Gradients
Let fx denote the synthetic partial derivative of f wrt. x, and f the synthetic gradient of f, used for backpropagation as a stand-in for the gradient.
Different synthetic gradient types
We train two classes of ProLU with different synthetic gradients. These are distinguished by their subscript:
ProLUReLU
ProLUSTE
They are identical in output, but have different synthetic gradients. I.e.
ReLU-Like Gradients: ProLUReLU
The first synthetic gradient is very similar to the gradient for ReLU. We retain the gradient wrt. m, and define the synthetic gradient wrt. b as follows:
Thresh STE Derived Gradients: ProLU
- 8 min
- APR 21, 2024
AF - Time complexity for deterministic string machines by alcatal

AF - Time complexity for deterministic string machines by alcatal

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Time complexity for deterministic string machines, published by alcatal on April 21, 2024 on The AI Alignment Forum.
This was a project conducted during MATS 5.0 under the mentorship of Vanessa Kosoy and supported by a grant from BERI. It builds off the String Machines framework (and depends on the linked post for certain definitions), which models category-theoretic generalizations of finite-state transducers.
The framework as it previously existed did not have representation-independent ways of bounding (analogues of) time complexity, or natural guarantees that output size would not grow exponentially in input size.
We introduce "filtered" transducers, which operate on categories enriched over filtered sets (sets equipped with a function to a partially ordered monoid, where morphisms are functions respecting order), and then, restricting our attention to transducers with a finite state space, prove constraints on the time complexity growth and expressivity of string machines.
Parameterizing complexity in string machines
Filtered transducers
Definition 1. The category FiltSet of filtered sets is the category such that
an object is a tuple (S,degS), where S is a set and degS:SN is a function,
a morphism f:(S,degS)(T,degT) is a function ST such that degT(f(s))degS(s) for all sS.
We will generally refer to objects in FiltSet solely by the symbol corresponding to the underlying set going forward. One can observe that the identity function on a set S by definition satisfies degS(idS(s))=degS(s) for all sS and is thus a morphism in FiltSet. One can also observe that given f:ST and g:TV, degV(g(f(s)))degT(f(s))degS(s) for all sS, and therefore gf is also a morphism in FiltSet. Therefore, FiltSet is indeed a category.
Definition 2. Given two objects S,TOb(FiltSet), we define their filtered product ST to be the set ST equipped with the function degST:STN satisfying degST(s,t)=degS(s)+degT(t) for all (s,t)ST. Given a morphism f:SU and a morphism g:TV, we define the morphism fg:STUV to be the function fg. Indeed, we have that degUV(f(s),g(t))=degU(f(s))+degV(g(t))degS(s)+degT(t)=degST(s,t), so fg is a morphism in FiltSet.
Due to the associativity and commutativity of addition, as well as the natural associativity and commutativity (up to isomorphisms which are still isomorphisms in FiltSet) of the cartesian product, is naturally associative and commutative up to isomorphism. Additionally, the one-element set 1 equipped with deg1()=0 and unitor maps which are the same as in Set (which are, by their definition, filtered morphisms) provides a left and right unit for , making FiltSet a symmetric monoidal category.
Remark. Suppose filtered sets S,T,U and filtered morphisms f:ST and g:SU. Then, the unique factoring function STU defined by s(f(s),g(s)) is only a filtered morphism STU if degT(f(s))+degU(g(s))degS(s), which does not hold in general. Therefore, does not provide a product except for when at least one of the sets has degree uniformly zero. However, FiltSet does have finite products ST where degST(s,t):=max(degS(s),degT(t)). We will not be using this construction.
Remark. The set-theoretic disjoint union, with its degree function being the canonical factoring map to N of its components' degree functions, provides all finite coproducts in FiltSet.
Definition 3. A filtered-morphism category C is a locally small symmetric monoidal category enriched over FiltSet, using FiltSet's filtered product as its monoidal structure.
This expresses the notion of morphisms having degrees which are subadditive under composition in a way that naturally extends to a complexity constraint on transducers. As the monoidal identity of FiltSet is the single-element set with degree zero, the arrows IFiltSetHomC(A,A) providing the identity morphism idA in the enrichment con
- 36 min