microsoft/unilmPlease refer to the latest official releases for information GitHub Homepage

Microsoft's large-scale self-supervised pre-trained unified language model, supporting foundational model research across tasks, languages, and modalities.

MITPython 21.5kmicrosoftunilm Last Updated: 2025-06-03

Microsoft UniLM Project Detailed Introduction

Project Overview

Microsoft UniLM is a large-scale self-supervised pre-training model library developed by Microsoft Research, focusing on foundational model research across tasks, languages, and modalities. This project is dedicated to developing new foundational model architectures and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Project Address: https://github.com/microsoft/unilm

Core Concept: The Big Convergence

The core concept of the UniLM project is "The Big Convergence," which aims to achieve large-scale self-supervised pre-training in the following three dimensions:

Cross-Task: Predictive and generative tasks
Cross-Lingual: Supports over 100 languages
Cross-Modal: Language, image, audio, layout format, visual+language, audio+language, etc.

Main Technology Stack

1. TorchScale Architecture Library

Foundational architecture research, focusing on:

Stability: DeepNet - Extends Transformer to 1000+ layers
Generality: Foundation Transformers (Magneto) - Truly general modeling across tasks and modalities
Capability: Length-Extrapolatable Transformer - Long sequence processing capability
Efficiency: X-MoE, BitNet, RetNet, LongNet, and other efficient architectures

2. Language Model Series

UniLM Series

UniLM: Unified language understanding and generation pre-training
InfoXLM/XLM-E: Multilingual/Cross-lingual pre-training models supporting 100+ languages
DeltaLM/mT6: Encoder-decoder pre-training for language generation and translation
MiniLM: Small and fast language understanding and generation pre-training model
AdaLM: Domain, language, and task adaptation of pre-trained models
EdgeLM: Small pre-trained models on edge/client devices
SimLM: Large-scale pre-training for similarity matching
E5: Text embedding model
MiniLLM: Knowledge distillation of large language models

Multimodal Large Language Models

Kosmos-1: Multimodal Large Language Model (MLLM)
Kosmos-2: Grounded Multimodal Large Language Model
Kosmos-2.5: Multimodal Document Understanding Model
MetaLM: Language Model as a Foundation Model Universal Interface

3. Vision Model Series

BEiT Series

BEiT: Visual generative self-supervised pre-training
BEiT-2: BERT-style image Transformer pre-training
BEiT-3: General-purpose multimodal foundation model, a significant milestone in large-scale pre-training across tasks, languages, and modalities

Document AI Models

DiT: Self-supervised pre-training for Document Image Transformer
TextDiffuser/TextDiffuser-2: Diffusion models as text painters
LayoutLM/LayoutLMv2/LayoutLMv3: Multimodal (text+layout+image) document foundation models
LayoutXLM: Multimodal foundation model for multilingual document AI
MarkupLM: Pre-training of markup language models for visually-rich document understanding
XDoc: Unified pre-training for cross-format document understanding
TrOCR: Transformer-based OCR pre-training model
LayoutReader: Text and layout pre-training for reading order detection

4. Speech Model Series

WavLM: Speech pre-training for full-stack tasks
VALL-E: Neural codec language model for TTS
UniSpeech: Unified pre-training for self-supervised and supervised learning of ASR
UniSpeech-SAT: Universal speech representation learning with speaker-aware pre-training
SpeechT5: Encoder-decoder pre-training for spoken language processing
SpeechLM: Enhanced speech pre-training with unpaired text data

5. Vision-Language Models

VLMo: Unified vision-language pre-training
VL-BEiT: Generative vision-language pre-training

Core Technical Features

1. Architecture Innovation

DeepNet: Supports deep networks extended to 1000 layers
Magneto: Truly general modeling architecture
BitNet: 1-bit Transformer architecture
RetNet: Retentive Network as a Transformer successor
LongNet: Extends to 1 billion token long sequence processing

2. Training Efficiency Optimization

X-MoE: Scalable and finetunable sparse mixture-of-experts model
Aggressive Decoding: Lossless and efficient sequence-to-sequence decoding algorithm
Knowledge Distillation: Model compression and acceleration techniques

3. Multilingual Support

Supports over 100 languages
Cross-lingual transfer learning
Multilingual document understanding

4. Multimodal Fusion

Unified modeling of text+image+layout
Vision-language understanding and generation
Speech-text cross-modal processing

Application Areas

1. Natural Language Processing

Language understanding and generation
Machine translation
Text classification and sentiment analysis
Question answering systems

2. Document AI

Document layout analysis
Form understanding
OCR text recognition
Document question answering

3. Computer Vision

Image classification
Object detection
Image generation
Visual question answering

4. Speech Processing

Speech recognition (ASR)
Speech synthesis (TTS)
Speech understanding
Multilingual speech processing

Technology Stack and Tools

Development Framework

Developed based on PyTorch
Integrated with HuggingFace Transformers
Supports distributed training

Pre-training Data

Large-scale multilingual text data
Image-text paired data
Speech data
Document image data

Evaluation Benchmarks

GLUE, SuperGLUE language understanding benchmarks
XTREME multilingual benchmark
VQA visual question answering benchmark
DocVQA document question answering benchmark
SUPERB speech benchmark

The UniLM project represents Microsoft's cutting-edge research in foundational models and general artificial intelligence, providing powerful tools and infrastructure for academia and industry, and promoting the development and application of multimodal AI technology.