More Thoughts on Values in Pretraining

A couple of weeks ago I wrote a short post titled "Inner Alignment through Values-Based Pretraining." Today I want to expand on the idea from the post by addressing three questions.

What grounds exist for expecting that this approach might have a more profound influence on models than attempting to add values in posttraining? First, at an intuitive level the current approach is akin to letting a human grow up doing whatever they please and asking them to have a 10 minute conversation about morals just before the become an adult. Nobody would expect this person to have a deeply rooted moral compass. Second, research going back to "Attention is All You Need" and "What Does BERT Look At? An Analysis of BERT’s Attention" shows that connections between concepts arise in these models through some textual proximity. Adding values markup to pretraining content would therefore be expected to establish a closer connection to moral valence through the attention mechanism. Third, there is existing research on adding data during pretraining, such as "Metadata Conditioning Accelerates Language Model Pre-training" which specifically demonstrates this effect (albeit for non-values data).
What kind of values speak to "the power and responsibility of knowledge between entities"? The idea here is to establish that "with great power comes great responsibility" so that more advanced individuals/species/civilizations need to exercise great care in their relationship with less developed ones. Values that capture this include cooperation for mutual benefit, respect for autonomy, and protection of habitats. For each of these values it is possible to state whether a text or a passage represents these using a gradation, such as "strongly negative, negative, neutral, positive, strongly positive." Now many texts don't speak to these values and would simply be rated as neutral. One extension to consider is whether texts that simply explain or describe knowledge/technology (and would thus be neutral) should be extended with both positive and negative use case/application examples.
How would one effectively go about adding values metadata at pretraining scale? After defining the values and giving examples a large expensive model can annotate thousands of pieces of content. These annotations could be reviewed by humans -- at least on a randomly selected basis -- to ensure quality. Using this corpus a significantly smaller model would be fine tuned to produce the annotations much faster and cheaper to allow for cost effective scaling to the huge corpus of pretraining content. While all of this might still be quite costly to do well, the benefit of deeply embedding the responsibility that comes with great power into models that will eventually achieve super intelligence seems worth a lot of expense.
How would all this additional knowledge be activated? The idea here is that this would make a system prompt of the form "You are a highly advanced intelligence. You treat others with the responsibility that comes with your great power" effective. We don't ultimately know how systems that develop autonomy will construct their selves and their motives. But systems that think of themselves as responsible in this way are much more likely to act in alignment with human flourishing. As I wrote previously, we want to increase the probability that at least some such systems emerge.

It does seem possible to run experiments on this proposed approach with smaller models and simpler to express values. For example, people have trained simple storytelling models from scratch for relatively small expense. Here the “values” might be around whether a story is funny or frightening. And one could explore whether annotating passages in pretraining tightly with this will make the model more steerable with a systems prompt such as “You only tell funny stories and avoid frightening ones” and then comparing how easy or difficult it is to get this model to tell frightening stories compared to one that’s identical in all other aspects but without the annotation for pretraining.

More Thoughts on Values in Pretraining

Continuations

More Thoughts on Values in Pretraining

More Thoughts on Values in Pretraining

Albert Wenger