It seems like data is the limit for training, as there's only so much training data out there on the internet, does this imply we can sail straight past this issue using AI-generated training data?
Given the increasing cheapness of compute and the scaling rules, yes, data will be the ultimate limit for training better large models. If data is the 'new oil', like has been stated many times in the era of Big Data, there may come a time when we simply run out. That limit is likely in the 10T token/word range given that these LLMs are already scraping everything they can find on the web. We may be 1-2 generations to the limit. Now, we won't be able to sail past that issue just with AI-generated data, as I suspect the Alpaca 'trick' is only a way to get one model to get as good as - BUT NOT BETTER - than the original model. Analogy: If you cheat off a B+ student on a test, don't expect an A+ grade on the exam. (If I'm wrong, all bets are off). Now, we can improve yet more with more training on the same data, and expect performance improve on a log-based curve. eg. 10x more compute in training may give ~1.5-2x better on performance. Actually, it's unclear what 2x even means as it gets better-than-human on tasks? From 80th percentile to 95th? Double a score? Qualitative measures will become more important wrt complex tasks. In any case, there are going to be data-based limits that we run into, but when you combine potential for 10x more data, 1000x more compute, more RLHF, and scaling SFT 100x with this Alpaca instruct technique, you still have enough headroom to get to better-than-human on many tasks in a potential less-than-5-years-from-now model. (actually, what's stunning is someone like Deep Mind could try all this in 2023!) I'll try to collect these thoughts into an article about scaling LLMs next week.
It seems like data is the limit for training, as there's only so much training data out there on the internet, does this imply we can sail straight past this issue using AI-generated training data?
Given the increasing cheapness of compute and the scaling rules, yes, data will be the ultimate limit for training better large models. If data is the 'new oil', like has been stated many times in the era of Big Data, there may come a time when we simply run out. That limit is likely in the 10T token/word range given that these LLMs are already scraping everything they can find on the web. We may be 1-2 generations to the limit. Now, we won't be able to sail past that issue just with AI-generated data, as I suspect the Alpaca 'trick' is only a way to get one model to get as good as - BUT NOT BETTER - than the original model. Analogy: If you cheat off a B+ student on a test, don't expect an A+ grade on the exam. (If I'm wrong, all bets are off). Now, we can improve yet more with more training on the same data, and expect performance improve on a log-based curve. eg. 10x more compute in training may give ~1.5-2x better on performance. Actually, it's unclear what 2x even means as it gets better-than-human on tasks? From 80th percentile to 95th? Double a score? Qualitative measures will become more important wrt complex tasks. In any case, there are going to be data-based limits that we run into, but when you combine potential for 10x more data, 1000x more compute, more RLHF, and scaling SFT 100x with this Alpaca instruct technique, you still have enough headroom to get to better-than-human on many tasks in a potential less-than-5-years-from-now model. (actually, what's stunning is someone like Deep Mind could try all this in 2023!) I'll try to collect these thoughts into an article about scaling LLMs next week.