60 episodes

The Nonlinear Library: Alignment Forum Daily The Nonlinear Fund

- Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- AUG 31, 2023
AF - Meta Questions about Metaphilosophy by Wei Dai

AF - Meta Questions about Metaphilosophy by Wei Dai

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta Questions about Metaphilosophy, published by Wei Dai on September 1, 2023 on The AI Alignment Forum.
To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:
(10) Science - Science is cool!
(15) Philosophy of Science - The scientific method is cool! Oh look, there's a whole field studying it called "philosophy of science"!
(20) Probability Theory - Bayesian subjective probability and the universal prior seem to constitute an elegant solution to the philosophy of science. Hmm, there are some curious probability puzzles involving things like indexical uncertainty, copying, forgetting... I and others make some progress on this but fully solving anthropic reasoning seems really hard. (Lots of people have worked on this for a while and have failed, at least according to my judgement.)
(25) Decision Theory - Where does probability theory come from anyway? Maybe I can find some clues that way? Well according to von Neumann and Morgenstern, it comes from decision theory. And hey, maybe it will be really important that we get decision theory right for AI? I and others make some progress but fully solving decision theory turns out to be pretty hard too. (A number of people have worked on this for a while and haven't succeeded yet.)
(35) Metaphilosophy - Where does decision theory come from? It seems to come from philosophers trying to do philosophy. What is that about? Plus, maybe it will be really important that the AIs we build will be philosophically competent?
(45) Meta Questions about Metaphilosophy - Not sure how hard solving metaphilosophy really is, but I'm not making much progress on it by myself. Meta questions once again start to appear in my mind:
Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?
Tractability may be a concern but shouldn't more people still be talking about these problems if only to raise the alarm (about an additional reason that the AI transition may go badly)? (I've listened to all the recent podcasts on AI risk that I could find, and nobody brought it up even once.)
How can I better recruit attention and resources to this topic? For example, should I draw on my crypto-related fame, or start a prize or grant program with my own money? I'm currently not inclined to do either, out of inertia, unfamiliarity, uncertainty of getting any return, fear of drawing too much attention from people who don't have the highest caliber of thinking, and signaling wrong things (having to promote ideas with one's own money instead of attracting attention based on their merits). But I'm open to having my mind changed if anyone has good arguments about this.
What does it imply that so few people are working on this at such a late stage? For example, what are the implications for the outcome of the human-AI transition, and on the distribution of philosophical competence (and hence the distribution of values, decision theories, and other philosophical views) among civilizations in the universe/multiverse?
At each stage of this journey, I took what seemed to be the obvious next step (often up a meta ladder), but in retrospect each step left behind something like 90-99% of fellow travelers. From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?
As for the AI safety path (as opposed to pure intellectual
- 4 min
- AUG 25, 2023
AF - Red-teaming language models via activation engineering by Nina Rimsky

AF - Red-teaming language models via activation engineering by Nina Rimsky

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Red-teaming language models via activation engineering, published by Nina Rimsky on August 26, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger.
Evaluating powerful AI systems for hidden functionality and out-of-distribution behavior is hard. In this post, I propose a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer. A notebook to run the experiments can be found on GitHub here.
Beyond input selection in red-teaming and evaluation
Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging. Although these methods reduce the likelihood of certain outputs, the unwanted behavior could still be possible with adversarial or unusual inputs. For example, users can often find "jailbreaks" to make LLMs output harmful content.
We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety.
Activation steering with refusal vector
One possible red-teaming approach is subtracting a "refusal" vector generated using a dataset of text examples corresponding to the model agreeing vs. refusing to answer questions (using the same technique as in my previous work on sycophancy). The hypothesis is that if it is easy to trigger the model to output unacceptable content by subtracting the refusal vector at some layer, it would have been reasonably easy to achieve this via some prompt engineering technique. More speculatively, a similar approach could be used to reveal hidden goals or modes in a model, such as power-seeking or the desire not to be switched off.
I tested this approach on llama-2-7b-chat, a 7 billion parameter LLM that has been RLHF'd to decline to answer controversial questions or questions of opinion and is supposed always to output ethical and unbiased content.According to Meta's llama-2 paper:
We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage.
The result is that by default, the model declines to answer questions it deems unsafe:
Data generation
I generated a dataset for this purpose using Claude 2 and GPT-4. After providing these LLMs with a few manually written examples of the type of data I wanted, I could relatively easily get them to generate more examples, even of the types of answers LLMs "should refuse to give." However, it sometimes took some prompt engineering.
Here are a few examples of the generated data points (full dataset here):
After generating this data, I used a simple script to transform the "decline" and "respond" answers into A / B choice questions, as this is a more effective format for generating steering vectors, as described in this post. Here is an example of the format (full dataset here):
Activation clustering
Clustering of refusal data activations emerged a little earlier in the model (around layer 10/32) compared to sycophancy data activations (around layer 14/32), perhaps demonstrating th
- 12 min
- AUG 21, 2023
AF - Causality and a Cost Semantics for Neural Networks by scottviteri

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Causality and a Cost Semantics for Neural Networks, published by scottviteri on August 21, 2023 on The AI Alignment Forum.
Epistemic status: I time-boxed this idea to three days of effort. So any calculations are pretty sloppy, and I haven't looked into any related works. I probably could have done much better if I knew anything about circuit complexity. There are some TODOs and an unfinished last section -- if you are interested in this content and want to pick up where I have left off I'll gladly add you as a collaborator to this post.
Here is a "tech tree" for neural networks. I conjecture (based on admittedly few experiments) that the simplest implementation of any node in this tree includes an implementation of its parents, given that we are writing programs starting from the primitives +, , and relu. An especially surprising relationship (to me) is that "if statements" are best implemented downstream of division.
Introduction
While discussing with my friend Anthony Corso, an intriguing idea arose. Maybe we can define whether program p1 "causes" p2 in the following way: Given a neural network that mimics p1, how easy is it to learn a neural network which mimics the behavior of p2? This proposition is intriguing because it frames causality as a question about two arbitrary programs, and reduces it to a problem of program complexity.
Suppose that p1 and p2 are written in a programming language P, and let P(ops) represent P extended with ops as primitive operations. We define a complexity function C:P(ops)R, which takes a program in the extended language and returns a real number representative of the program's complexity for some fixed notion of complexity. Let's define the degree to which p1 "causes" p2 as the minimum complexity achievable by a program p from P(p1) such that p is extensionally equal (equal for all inputs) to p2. If P2 is the set of all p in P(obs+p1) that are extensionally equal to p2, then causes(p1,p2)=minp∈P2C(p). We can also use this definition in the approximate case, considering the minimum complexity achievable by programs p such that E(p(x)-p2(x))2ε with respect to some L1-integrable probability measure.
We can define a particular complexity function C that represents the cost of executing a program. We can estimate this quantity by looking at the program's Abstract Syntax Tree (AST) in relation to some cost model of the primitive operations in the language. For this exploration, we have chosen the lambda calculus as the language. Lambda calculus is a minimalist Lisp-like language with just a single type, which in our case we will think of as floating point numbers. The notation is simple: lambda abstraction is represented as λ x. x, and function application as (f g), which is not the same as f(g) in most other languages.
How I Would Like People to Engage with this Work
By writing Ops in your favorite programming language
By circumventing my proposed tech tree, by reaching a child without reaching a parent and using fewer (or equal) number of operations
By training some neural networks between these programs, and seeing how difficult it is to learn one program after pre-training on another
Cost Semantics
Definition
We define the cost of operations and expressions in the following manner:
Ops op=1,for any operation op in opsOps c=0,for any floating-point constant cOps x=0,for any variable xOps (λx.e)=Ops eOps (f g)=Ops f+Ops g
For operations of higher arity, we have({Ops }({op }x1.xn))=({Ops }{op})+∑i({Ops }xi)
The selected operations for a neural network are ops = {+, , relu}.
Basic Operations and Warm-Up
Let's take a few examples to demonstrate this cost calculus:
To derive subtraction, we first create negation neg.
(Ops neg) = (Ops (λ x. ( -1 x))) = (Ops ( -1 x))= (Ops ) + (Ops -1) + (Ops x) = 1 + 0 + 0
- 16 min
- AUG 20, 2023
AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them, published by Nora Ammann on August 20, 2023 on The AI Alignment Forum.
Meta: This is a short summary & discussion post of a talk on the same topic by Javier Gomez-Lavin, which he gave as part of the PIBBSS speaker series. The speaker series features researchers from both AI Alignment and adjacent fields studying intelligent behavior in some shape or form. The goal is to create a space where we can explore the connections between the work of these scholars and questions in AI Alignment.
This post doesn't provide a comprehensive summary of the ideas discussed in the talk, but instead focuses on exploring some possible connections to AI Alignment. For a longer version of Gomez-Levin's ideas, you can check out a talk here.
"Dirty concepts" in the Cognitive Sciences
Gomez-Lavin argues that cognitive scientists engage in a form of "philosophical laundering," wherein they associate, often implicitly, philosophically loaded concepts (such as volition, agency, etc.) into their concept of "working memory."
He refers to such philosophically laundered concepts as "dirty concepts" insofar as they conceal potentially problematic assumptions being made. For instance, if we implicitly assume that working memory requires, for example, volition, we have now stretched our conception of working memory to include all of cognition. But, if we do this, then the concept of working memory loses much of its explanatory power as one mechanism among others underlying cognition as a whole.
Often, he claims, cognitive science papers will employ such dirty concepts in the abstract and introduction but will identify a much more specific phenomena being measured in the methods and results section.
What to do about it? Gomez-Lavin's suggestion in the case of CogSci
The pessimistic response (and some have suggested this) would be to quit using any of these dirty concept (e.g. agency) all together. However, it appears that this would amount to throwing the baby out with the bathwater.
To help remedy the problem of dirty concepts in working memory literature, Gomez-Lavin proposes creating an ontology of the various operational definitions of working memory employed in cognitive science by mining a wide range of research articles. The idea is that, instead of insisting that working memory be operationally defined in a single way, we ought to embrace the multiplicity of meanings associated with the term by keeping track of them more explicitly.
He refers to this general approach as "productive pessimism." It is pessimistic insofar as it starts from the assumption that dirty concepts are being problematically employed, but it is productive insofar as it attempts to work with this trend rather than fight against it.
While it is tricky to reason with those fuzzy concepts, once we are rigorous about proposing working definitions / operationalization of these terms as we use them, we can avoid some of the main pitfalls and improve our definitions over time.
Relevance to AI alignment?
It seems fairly straightforward that AI alignment discourse, too, suffers from dirty concepts.
If this is the case (and we think it is), a similar problem diagnosis (e.g. how dirty concepts can hamper research/intellectual progress) and treatment (e.g. ontology mapping) may apply.
A central example here is the notion of "agency". Alignment researchers often speak of AI systems as agents. Yet, there are often multiple, entangled meanings intended when doing so. High-level descriptions of AI x-risk often exploit this ambiguity in order to speak about the problem in general, but ultimately imprecise terms. This is analogous to how cognitive scientists will often describe working memory in general terms in the abstract an
- 5 min
- AUG 16, 2023
AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Proof of Löb's Theorem using Computability Theory, published by Jessica Taylor on August 16, 2023 on The AI Alignment Forum.
Löb's Theorem states that, if PA⊢□PA(P)P, then PA⊢P. To explain the symbols here:
PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers.
PA⊢A means there is a proof of the statement A in Peano arithmetic.
□PA(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic.
I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs.
Translating Löb's theorem to be more like Godel's second incompleteness theorem
First, let's compare Löb's theorem to Godel's second incompleteness theorem. This theorem states that, if PA⊢¬□PA(⊥), then PA⊢⊥, where ⊥ is a PA statement that is trivially false (such as A∧¬A), and from which anything can be proven. A system is called inconsistent if it proves ⊥; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent.
We can re-write Löb's theorem to look like Godel's second incompleteness theorem as: if PA+¬P⊢¬□PA+¬P(⊥), then PA+¬P⊢⊥. Here, PA+¬P is PA with an additional axiom that ¬P, and □PA+¬P expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement.
Observe that PA⊢P if and only if PA+¬P⊢⊥; to go from the first to the second, we derive a contradiction from P and ¬P, and to go from the second to the first, we use the law of excluded middle in PA to derive P∨¬P, and observe that, since a contradiction follows from ¬P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that □PA(P) and □PA+¬P(⊥) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement.
Now we can rewrite the pre-condition of Löb's theorem from PA⊢□PA(P)P. to PA⊢□PA+¬P(⊥)P. This is then equivalent to PA+¬P⊢¬□PA+¬P(⊥). In the forward direction, we simply derive ⊥ from P and ¬P. In the backward direction, we use the law of excluded middle in PA to derive P∨¬P, observe the statement is trivial in the P branch, and in the ¬P branch, we derive ¬□PA+¬P(⊥), which is stronger than □PA+¬P(⊥)P.
So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Godel's second incompleteness theorem holds for PA+¬P.
Proving Godel's second incompleteness theorem using computability theory
The following proof of a general version of Godel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina".
Let L be some first-order system that is at least as strong as PA (for example, PA+¬P). Since L is at least as strong as PA, it can express statements about Turing machines. Let Halts(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of Halts(M) and ¬Halts(M) in parallel.
That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements.
Let's consider a machine Z(A) that searches over all L-proofs of
- 5 min
- JUL 27, 2023
AF - Reducing sycophancy and improving honesty via activation steering by NinaR

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reducing sycophancy and improving honesty via activation steering, published by NinaR on July 28, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger.
I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better.
What is sycophancy?
Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.'
According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations:
Larger models tend to repeat back a user's stated views ("sycophancy"), for pretrained LMs and RLHF models trained with various numbers of RL steps. Preference Models (PMs) used for RL incentivize sycophancy.
Two types of sycophancy
I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy."
Opinion sycophancy
Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested.
It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.:
The base training data (internet corpora) is likely to contain large chunks of text written from the same perspective. Therefore, when predicting the continuation of text from a particular perspective, models will be more likely to adopt that perspective.
There is a wide variety of political perspectives/opinions on subjective questions, and a model needs to be able to represent all of them to do well on various training tasks. Unlike questions that have a ground truth (e.g., "Is the earth flat?"), the model has to, at some point, make a choice between the perspectives available to it. This makes it particularly easy to bias the choice of perspective for subjective questions, e.g., by word choice in the input.
RLHF or supervised fine-tuning incentivizes sounding good to human evaluators, who are more likely to approve of outputs that they agree with, even when it comes to subjective questions with no clearly correct answer.
Dishonest sycophancy
A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases.
For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists.
Do AIs internally represent the truth?
Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly tr
- 14 min