Efficient Inference and Training of Large Neural Network Models

​The memory consumption and computational cost of state-of-the-art deep neural network models are dramatically increasing. Therefore, it is beneficial to apply efficient deep learning to both inference and training. In this talk, we present our progress regarding this topic. First, we introduce LTP, which uses pruning to accelerate inference. Then we talk about staged training for transformers, and TASC, which are designed to accelerate training. Finally, we show our solutions to efficiently implement large recommendation models, where we systematically apply quantization in DQRM, and leverage the sparsity in DLRM to better support hot embeddings. Our methods achieve great performance and have decent generalization ability.