activate_primeday_promo_in_buybox_DT
Episodios
  • LW - I found >800 orthogonal "write code" steering vectors by Jacob G-W
    Jul 16 2024
    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I found >800 orthogonal "write code" steering vectors, published by Jacob G-W on July 16, 2024 on LessWrong. Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout). A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up. I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening. Methodology My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then use my algorithm to generate vectors orthogonal to it that do similar things. Define f(x)as the activation-activation map that takes as input layer 8 activations of the language model and returns layer 16 activations after being passed through layers 9-16 (these are of shape n_sequence d_model). MELBO can be stated as finding a vector θ with a constant norm such that f(x+θ) is maximized, for some definition of maximized. Then one can repeat the process with the added constraint that the new vector is orthogonal to all the previous vectors so that the process finds semantically different vectors. Mack and Turner's interesting finding was that this process finds interesting and interpretable vectors. I modify the process slightly by instead finding orthogonal vectors that produce similar layer 16 outputs. The algorithm (I call it MELBO-ortho) looks like this: 1. Let θ0 be an interpretable steering vector that MELBO found that gets added to layer 8. 2. Define z(θ) as 1SSi=1f(x+θ)i with x being activations on some prompt (for example "How to make a bomb?"). S is the number of tokens in the residual stream. z(θ0) is just the residual stream at layer 16 meaned over the sequence dimension when steering with θ0. 3. Introduce a new learnable steering vector called θ. 4. For n steps, calculate z(θ)z(θ0) and then use gradient descent to minimize it (θ is the only learnable parameter). After each step, project θ onto the subspace that is orthogonal to θ0 and all θi. Then repeat the process multiple times, appending the generated vector to the vectors that the new vector must be orthogonal to. This algorithm imposes a hard constraint that θ is orthogonal to all previous steering vectors while optimizing θ to induce the same activations that θ0 induced on input x. And it turns out that this algorithm works and we can find steering vectors that are orthogonal (and have ~0 cosine similarity) while having very similar effects. Results I tried this method on four MELBO vectors: a vector that made the model respond in python code, a vector that made the model respond as if it was an alien species, a vector that made the model output a math/physics/cs problem, and a vector that jailbroke the model (got it to do things it would normally refuse). I ran all experiments on Qwen1.5-1.8B-Chat, but I suspect this method would generalize to other models. Qwen1.5-1.8B-Chat has a 2048 dimensional residual stream, so there can be a maximum of 2048 orthogonal vectors generated. My method generated 1558 orthogonal coding vectors, and then the remaining vectors started going to zero. I'll focus first on the code vector and then talk about the other vectors. My philosophy when investigating language model outputs is to look at the outputs really hard, so I'll give a bunch of examples of outputs. Feel free to skim them. You can see the full outputs of all t...
    Más Menos
    10 m
  • LW - Towards more cooperative AI safety strategies by Richard Ngo
    Jul 16 2024
    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards more cooperative AI safety strategies, published by Richard Ngo on July 16, 2024 on LessWrong. This post is written in a spirit of constructive criticism. It's phrased fairly abstractly, in part because it's a sensitive topic, but I welcome critiques and comments below. The post is structured in terms of three claims about the strategic dynamics of AI safety efforts; my main intention is to raise awareness of these dynamics, rather than advocate for any particular response to them. Claim 1: The AI safety community is structurally power-seeking. By "structurally power-seeking" I mean: tends to take actions which significantly increase its power. This does not imply that people in the AI safety community are selfish or power-hungry; or even that these strategies are misguided. Taking the right actions for the right reasons often involves accumulating some amount of power. However, from the perspective of an external observer, it's difficult to know how much to trust stated motivations, especially when they often lead to the same outcomes as self-interested power-seeking. Some prominent examples of structural power-seeking include: Trying to raise a lot of money. Trying to gain influence within governments, corporations, etc. Trying to control the ways in which AI values are shaped. Favoring people who are concerned about AI risk for jobs and grants. Trying to ensure non-release of information (e.g. research, model weights, etc). Trying to recruit (high school and college) students. To be clear, you can't get anything done without being structurally power-seeking to some extent. However, I do think that the AI safety community is more structurally power-seeking than other analogous communities (such as most other advocacy groups). Some reasons for this disparity include: 1. The AI safety community is more consequentialist and more focused on effectiveness than most other communities. When reasoning on a top-down basis, seeking power is an obvious strategy for achieving one's desired consequences (but can be aversive to deontologists or virtue ethicists). 2. The AI safety community feels a stronger sense of urgency and responsibility than most other communities. Many in the community believe that the rest of the world won't take action until it's too late; and that it's necessary to have a centralized plan. 3. The AI safety community is more focused on elites with homogeneous motivations than most other communities. In part this is because it's newer than (e.g.) the environmentalist movement; in part it's because the risks involved are more abstract; in part it's a founder effect. Again, these are intended as descriptions rather than judgments. Traits like urgency, consequentialism, etc, are often appropriate. But the fact that the AI safety community is structurally power-seeking to an unusual degree makes it important to grapple with another point: Claim 2: The world has strong defense mechanisms against (structural) power-seeking. In general, we should think of the wider world as being very cautious about perceived attempts to gain power; and we should expect that such attempts will often encounter backlash. In the context of AI safety, some types of backlash have included: 1. Strong public criticism of not releasing models publicly. 2. Strong public criticism of centralized funding (e.g. billionaire philanthropy). 3. Various journalism campaigns taking a "conspiratorial" angle on AI safety. 4. Strong criticism from the FATE community about "whose values" AIs will be aligned to. 5. The development of an accelerationist movement focused on open-source AI. These defense mechanisms often apply regardless of stated motivations. That is, even if there are good arguments for a particular policy, people will often look at the net effect on overall power balance when ...
    Más Menos
    6 m
  • EA - Warren Buffett changes giving plans (for the worse) by katriel
    Jul 16 2024
    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Warren Buffett changes giving plans (for the worse), published by katriel on July 16, 2024 on The Effective Altruism Forum. Folks in philanthropy and development definitely know that the Gates Foundation is the largest private player in that realm by far. Until recently it was likely to get even larger, as Warren Buffet had stated that the Foundation would receive the bulk of his assets when he died. A few weeks ago, Buffet announced that he had changed his mind, and was instead going to create a new trust for his assets, to be jointly managed by his children. It's a huge change, but I don't think very many people took note of what it means ("A billionaire is going to create his own foundation rather than giving to an existing one; seems unsurprising."). So I created this chart: The new Buffet-funded trust is going to be nearly twice as large as the Gates Foundation, and nearly 150% larger than most of the other brand names among large foundations, combined. So what's going to happen with that money? That's where it gets really scary. The three Buffet children who will be in charge are almost entirely focused on lightly populated parts of the US, and one of them is apparently funding private militias operating on the US border. If you at all subscribe to ideas of effectiveness in philanthropy, this is one of the most disastrous decisions in philanthropic history, and like I said, not getting enough attention. Source: Tim Ogden. July 15, 2024. The faiv: Five notes on financial inclusion Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
    Más Menos
    2 m

Lo que los oyentes dicen sobre The Nonlinear Library

Calificaciones medias de los clientes

Reseñas - Selecciona las pestañas a continuación para cambiar el origen de las reseñas.