Cognitive Collective: RLHF Is Not a Magic Wand for Alignment
it is nearly impossible to define alignment, let alone achieve it. Human beings, in all their variance, capacity, and folly, make up the entire training pipeline, and we take such comfort in the fact that we call them all by the same name. But every human is individual, and models will change as their trainers do.