Quantitative Assessment of Edge AI Model Compression Techniques to Enhance Performance of On-Device Natural Language Processing Applications
Keywords:
Edge AI, Model Compression, Natural Language Processing, Quantization, Pruning, Knowledge Distillation, On-device InferenceAbstract
Edge Artificial Intelligence (Edge AI) presents significant potential for real-time, private, and efficient execution of Natural Language Processing (NLP) tasks directly on mobile or embedded devices. However, the limited computational and memory resources of edge devices pose critical challenges for deploying large-scale NLP models. This study quantitatively evaluates state-of-the-art model compression techniques—including pruning, quantization, and knowledge distillation—in the context of enhancing on-device NLP performance. Using benchmark datasets and representative NLP tasks, the study measures inference time, memory footprint, and accuracy trade-offs, offering a comparative analysis to determine optimal strategies for different hardware scenarios. Results show that hybrid compression methods consistently outperform individual approaches in striking a balance between efficiency and model fidelity, paving the way for practical deployment of NLP solutions on edge devices.
References
Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).
Jacob, Benoit, et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704–2713.
Michel, Paul, Omer Levy, and Graham Neubig. "Are Sixteen Heads Really Better than One?" Advances in Neural Information Processing Systems, vol. 32, 2019.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT: A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." arXiv preprint arXiv:1910.01108 (2019).
Jiao, Xiaoqi, et al. "TinyBERT: Distilling BERT for Natural Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174.
Vaswani, Ashish, et al. "Attention Is All You Need." Advances in Neural Information Processing Systems, vol. 30, 2017.
Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805 (2018).
Howard, Jeremy, and Sebastian Ruder. "Universal Language Model Fine-Tuning for Text Classification." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 328–339.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the Knowledge in a Neural Network." arXiv preprint arXiv:1503.02531 (2015).
Li, Hao, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. "Pruning Filters for Efficient ConvNets." arXiv preprint arXiv:1608.08710 (2016).
Wu, Shuang, et al. "Training and Serving at Scale: Lessons Learned from Launching BERT in Production." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2020, pp. 318–327.
Bai, Haotian, et al. "BinaryBERT: Pushing the Limit of BERT Quantization." arXiv preprint arXiv:2012.15701 (2020).
Sun, Zhiqing, et al. "Patient Knowledge Distillation for BERT Model Compression." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4323–4332.
Kim, Wonpyo, et al. "Structured Pruning of Large Language Models." arXiv preprint arXiv:2005.06361 (2020).
Shen, Sheng, et al. "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT." Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8815–8821.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Patrick Gallinari (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.