The Loss of Signal in the Generative AI Era
#AI Alignment, #Signal vs Noise
The amount of generated content will surge in the next few years, making it harder to spot any sort of a reasonable signal.
Imagine an LLM trained with a 100% accurate dataset. When we ask it to generate something for us it will inevitably add a bit of noise. Let's say that the model output adds 1% of noise. A tiny fraction of an error.
When we post the generated information online there is a high chance that it will eventually get into the training dataset in the future. In that case, the future models will be trained on a dataset that is not 100% accurate anymore.
By repeating this process over and over again, the noise will be accumulating. Not only does this affect the future models' performance, due to a degrading dataset, but also the entire internet. Since the generative content is incomparably easier to produce, it will dominate any organic content. The internet is likely to get flooded.
It concerns me a bit because I don't want to get lost in that noise.
Comments