Home
Login

Microsoft's large-scale self-supervised pre-trained unified language model, supporting foundational model research across tasks, languages, and modalities.

MITPython 21.5kmicrosoftunilm Last Updated: 2025-06-03

Microsoft UniLM Project Detailed Introduction

Project Overview

Microsoft UniLM is a large-scale self-supervised pre-training model library developed by Microsoft Research, focusing on foundational model research across tasks, languages, and modalities. This project is dedicated to developing new foundational model architectures and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Project Address: https://github.com/microsoft/unilm

Core Concept: The Big Convergence

The core concept of the UniLM project is "The Big Convergence," which aims to achieve large-scale self-supervised pre-training in the following three dimensions:

  • Cross-Task: Predictive and generative tasks
  • Cross-Lingual: Supports over 100 languages
  • Cross-Modal: Language, image, audio, layout format, visual+language, audio+language, etc.

Main Technology Stack

1. TorchScale Architecture Library

Foundational architecture research, focusing on:

  • Stability: DeepNet - Extends Transformer to 1000+ layers
  • Generality: Foundation Transformers (Magneto) - Truly general modeling across tasks and modalities
  • Capability: Length-Extrapolatable Transformer - Long sequence processing capability
  • Efficiency: X-MoE, BitNet, RetNet, LongNet, and other efficient architectures

2. Language Model Series

UniLM Series

  • UniLM: Unified language understanding and generation pre-training
  • InfoXLM/XLM-E: Multilingual/Cross-lingual pre-training models supporting 100+ languages
  • DeltaLM/mT6: Encoder-decoder pre-training for language generation and translation
  • MiniLM: Small and fast language understanding and generation pre-training model
  • AdaLM: Domain, language, and task adaptation of pre-trained models
  • EdgeLM: Small pre-trained models on edge/client devices
  • SimLM: Large-scale pre-training for similarity matching
  • E5: Text embedding model
  • MiniLLM: Knowledge distillation of large language models

Multimodal Large Language Models

  • Kosmos-1: Multimodal Large Language Model (MLLM)
  • Kosmos-2: Grounded Multimodal Large Language Model
  • Kosmos-2.5: Multimodal Document Understanding Model
  • MetaLM: Language Model as a Foundation Model Universal Interface

3. Vision Model Series

BEiT Series

  • BEiT: Visual generative self-supervised pre-training
  • BEiT-2: BERT-style image Transformer pre-training
  • BEiT-3: General-purpose multimodal foundation model, a significant milestone in large-scale pre-training across tasks, languages, and modalities

Document AI Models

  • DiT: Self-supervised pre-training for Document Image Transformer
  • TextDiffuser/TextDiffuser-2: Diffusion models as text painters
  • LayoutLM/LayoutLMv2/LayoutLMv3: Multimodal (text+layout+image) document foundation models
  • LayoutXLM: Multimodal foundation model for multilingual document AI
  • MarkupLM: Pre-training of markup language models for visually-rich document understanding
  • XDoc: Unified pre-training for cross-format document understanding
  • TrOCR: Transformer-based OCR pre-training model
  • LayoutReader: Text and layout pre-training for reading order detection

4. Speech Model Series

  • WavLM: Speech pre-training for full-stack tasks
  • VALL-E: Neural codec language model for TTS
  • UniSpeech: Unified pre-training for self-supervised and supervised learning of ASR
  • UniSpeech-SAT: Universal speech representation learning with speaker-aware pre-training
  • SpeechT5: Encoder-decoder pre-training for spoken language processing
  • SpeechLM: Enhanced speech pre-training with unpaired text data

5. Vision-Language Models

  • VLMo: Unified vision-language pre-training
  • VL-BEiT: Generative vision-language pre-training

Core Technical Features

1. Architecture Innovation

  • DeepNet: Supports deep networks extended to 1000 layers
  • Magneto: Truly general modeling architecture
  • BitNet: 1-bit Transformer architecture
  • RetNet: Retentive Network as a Transformer successor
  • LongNet: Extends to 1 billion token long sequence processing

2. Training Efficiency Optimization

  • X-MoE: Scalable and finetunable sparse mixture-of-experts model
  • Aggressive Decoding: Lossless and efficient sequence-to-sequence decoding algorithm
  • Knowledge Distillation: Model compression and acceleration techniques

3. Multilingual Support

  • Supports over 100 languages
  • Cross-lingual transfer learning
  • Multilingual document understanding

4. Multimodal Fusion

  • Unified modeling of text+image+layout
  • Vision-language understanding and generation
  • Speech-text cross-modal processing

Application Areas

1. Natural Language Processing

  • Language understanding and generation
  • Machine translation
  • Text classification and sentiment analysis
  • Question answering systems

2. Document AI

  • Document layout analysis
  • Form understanding
  • OCR text recognition
  • Document question answering

3. Computer Vision

  • Image classification
  • Object detection
  • Image generation
  • Visual question answering

4. Speech Processing

  • Speech recognition (ASR)
  • Speech synthesis (TTS)
  • Speech understanding
  • Multilingual speech processing

Technology Stack and Tools

Development Framework

  • Developed based on PyTorch
  • Integrated with HuggingFace Transformers
  • Supports distributed training

Pre-training Data

  • Large-scale multilingual text data
  • Image-text paired data
  • Speech data
  • Document image data

Evaluation Benchmarks

  • GLUE, SuperGLUE language understanding benchmarks
  • XTREME multilingual benchmark
  • VQA visual question answering benchmark
  • DocVQA document question answering benchmark
  • SUPERB speech benchmark

The UniLM project represents Microsoft's cutting-edge research in foundational models and general artificial intelligence, providing powerful tools and infrastructure for academia and industry, and promoting the development and application of multimodal AI technology.

Star History Chart