Neural nets require large scale dataset during training process. However, it is quite expensive to have the access to enough data size. One approach to deal with this issue is Data augmentation, which means increasing the number of data points.
Motivation
It works when we can find appropriate invariant properties that the model should posses
Image-recognition
- rescaling or applying affine distortions to images (translating, scalingt, rotating, flipping of the input image)
Speech-recognition
- adding a background audio track or applying small shifts along the time dimension (add artificial noise background, change the tone or speed of speech signal (see DeepSpeech: Scaling up endto-end
speech recognition))
Text-classification
Unlike image and speech, data augmentation using signal transformation is not reasonable, because exact order of characters may form rigorous syntactic and semantic meaning.
Best way:
- human rephrases of sentences -> unrealistic and expensive
Choices
- synonyms replacement: replace words or phrases with synonyms
- back-translation: use [english - ‘intermediate language’ - english] translastion. [2]
- data noising: [3]
- contextual augmentation: [5]
References
- 1.Zhang, X., & LeCun, Y. (2015). Text understanding from scratch. arXiv preprint arXiv:1502.01710. ↩
- 2.Wieting, J., Mallinson, J., & Gimpel, K. (2017). Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext. arXiv preprint arXiv:1706.01847. ↩
- 3.Xie, Z., Wang, S. I., Li, J., Lévy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573. ↩
- 4.fast.ai forum: data augmentation for nlp ↩
- 5.Kobayashi, S. (2018). Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. arXiv preprint arXiv:1805.06201. ↩