Review Comment:
I’m happy with the revisions, that address both my concern about coverage (more papers were added to the analysis, from the ACL anthology) and my concern about suitability as an introductory text (two introductory sections were added).
Minor comments:
2. Regarding the distinction between pre-training data and fine-tuning data, thanks for adding this sentence. I think that catastrophic forgetting of the pre-training task is not the biggest concern when the pre-training task is (masked or standard) LM objective. After all, most downstream NLP tasks don’t need to guess a word in its context. I would add additional concerns, such as:
(a) some of these LMs are so big that only rich organizations can afford GPUs with enough memory to use them.
(b) despite performance improvements on many NLP tasks, these models are still limited in their reasoning abilities. For example, GPT-3, with its massive training, gets only 80% accuracy on 3-digit addition problems, a task which a symbolic model would clearly get 100% correct (https://arxiv.org/abs/2202.07785). Such models tend to overfit to the training corpus, including revealing private information and encoding biases.
(c) for this reason, instead of having one pre-trained LM used by everyone in the community and fine-tuned with small training sets on specific tasks, the community started a LM arms race. Bigger LMs are trained on more data, obviating the environmental benefit from re-using LMs in the first place.
You can cite “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” by Bender et al. for some of these claims: https://dl.acm.org/doi/pdf/10.1145/3442188.3445922.
3. Thanks for including the list of venues in the appendix, and for using the ACL anthology. Is it true that despite using the ACL anthology, the only venue where you found papers that satisfy the requirements was ACL itself? What about TACL, EMNLP, NAACL, EACL, COLING, etc.? The NLP community submits papers to any of these venues pretty much interchangeably.
|