The Loss of Signal in the Generative AI Era
#AI Alignment, #Signal vs Noise
The amount of generated content will surge in the next few years, making it harder to spot any sort of a reasonable signal.
Imagine an LLM trained with a 100% accurate dataset. When we ask it to generate something, it will inevitably add a bit of noise. Let's say that the model output adds 1% of noise – just a tiny fraction of an error.
When we post the generated content, there is a high chance that it will eventually get into the training dataset in the future. In that case, the future models will be trained on a dataset that is not 100% accurate anymore. By repeating this process over and over again, the noise will be accumulating.
Consequences? Not only does this affect the future models' quality due to a degrading dataset, but it will eventually fill the entire internet with false information that will be progressively indistinguishable from the truth.
However, It does not apply to all sorts of information sources. When considering Information Response Latency vs Magnitude Trade-off the issue is more significant to low-response-latency sources and less critical for trustworthy information sources.
Comments