We're running out of data to train AI on

Thursday, November 24, 2022 • 6:00 EST

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. The issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work. “We've seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data,” he explains. Researchers typically only train models using data that falls into the high-quality category because that is the type of language they want the models to reproduce. If data shortages push AI researchers to incorporate more diverse datasets into the training process, it would be a “net positive” for language models, Swayamdipta says. Researchers may also find ways to extend the life of data used for training language models. This approach has resulted in some impressive results for large language models such as GPT-3. (technologyreview.com). Continue reading.



Related Artificial Intelligence news



You may also be interested in Saturn Nerds Radio FDA Senate Craters Markets Europa