Page 220 - 《软件学报》2025年第9期
P. 220
王鑫澳 等: 基于联邦学习的 BERT 模型高效训练框架 4131
2127–2156 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6052.htm [doi: 10.13328/j.cnki.jos.006052]
[5] McMahan B, Moore E, Ramage D, Hampson S, Arcas BA. Communication-efficient learning of deep networks from decentralized data.
In: Proc. of the 20th Int’l Conf. on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR, 2017. 1273–1282.
[6] Peters M, Neumann M, Iyyer M, Gardner M, Zettlemoyer L. Deep contextualized word representations. In: Proc. of the 2018 Conf. of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). New
Orleans: Association for Computational Linguistics, 2018. 2227–2237. [doi: 10.18653/v1/N18-1202]
[7] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. https://cdn.
openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[8] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc.
of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1
(Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
[9] Liu YH, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized
BERT pretraining approach. arXiv:1907.11692, 2019.
[10] Sun C, Qiu XP, Xu YG, Huang XJ. How to fine-tune BERT for text classification? In: Proc. of the 18th China National Conf. on Chinese
Computational Linguistics. Kunming: Springer, 2019. 194–206. [doi: 10.1007/978-3-030-32381-3_16]
[11] Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report, Toronto: University of Toronto, 2009.
[12] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision
and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
[13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
[14] Wang HP, Stich SU, He Y, Fritz M. Progfed: Effective, communication, and computation efficient federated learning by progressive
training. In: Proc. of the 39th Int’l Conf. on Machine Learning. Baltimore: PMLR, 2022. 23034–23054.
[15] Alistarh D, Grubic D, Li JZ, Tomioka R, Vojnovic M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In:
Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 1707–1718.
[16] Lin YJ, Han S, Mao HZ, Wang Y, Dally B. Deep gradient compression: Reducing the communication bandwidth for distributed training.
In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
[17] Fu FC, Hu YZ, He YH, Jiang JW, Shao YX, Zhang C, Cui B. Don’t waste your bits! Squeeze activations and gradients for deep neural
networks via TinyScript. In: Proc. of the 37th Int’l Conf. on Machine Learning. PMLR, 2020. 3304–3314.
[18] Stich SU, Cordonnier JB, Jaggi M. Sparsified SGD with memory. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing
Systems. Montreal: Curran Associates Inc., 2018. 4452–4463.
[19] Konečný J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D. Federated learning: Strategies for improving communication
efficiency. arXiv:1610.05492, 2016.
[20] Li DL, Wang JP. FedMD: Heterogenous federated learning via model distillation. arXiv:1910.03581, 2019.
[21] Lin T, Kong LJ, Stich SU, Jaggi M. Ensemble distillation for robust model fusion in federated learning. In: Proc. of the 34th Int’l Conf.
on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 198.
[22] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
[23] Wang XA, Li H, Chen K, Shou LD. FEDBFPT: An efficient federated learning framework for BERT further pre-training. In: Proc. of the
32nd Int’l Joint Conf. on Artificial Intelligence. Macao: ijcai.org, 2023. 4344–4352. [doi: 10.24963/IJCAI.2023/483]
[24] Wang Y, Li GL, Li KY. Survey on contribution evaluation for federated learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(3):
1168–1192 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6786.htm [doi: 10.13328/j.cnki.jos.006786]
[25] Rong X. Word2Vec parameter learning explained. arXiv:1411.2738, 2014.
[26] Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods
in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014. 1532–1543. [doi: 10.3115/v1/D14-
1162]
[27] Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. In: Proc. of the 2019 Conf. on Empirical Methods in
Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association
for Computational Linguistics, 2019. 3615–3620. [doi: 10.18653/v1/D19-1371]
[28] Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: A pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics, 2020, 36(4): 1234–1240. [doi: 10.1093/bioinformatics/btz682]
[29] Yang Y, Uy MCS, Huang A. FinBERT: A pretrained language model for financial communications. arXiv:2006.08097, 2020.

