Home
Login

An acceleration inference and training library providing hardware optimization tools for Transformers, Diffusers, TIMM, and Sentence Transformers.

Apache-2.0Python 2.9khuggingface Last Updated: 2025-06-19

Hugging Face Optimum Project Detailed Introduction

Project Overview

🤗 Optimum is a specialized machine learning model optimization library launched by Hugging Face, serving as an extension tool for 🤗 Transformers and Diffusers. The project focuses on providing maximum efficiency model training and inference optimization tools for various target hardware, while maintaining ease of use.

Project Address: https://github.com/huggingface/optimum

Core Features

1. Multi-Hardware Platform Support

Optimum supports various mainstream hardware acceleration platforms:

  • ONNX/ONNX Runtime - Cross-platform machine learning inference
  • ExecuTorch - PyTorch inference solution for edge devices
  • TensorFlow Lite - Optimization for mobile and edge devices
  • OpenVINO - Intel hardware optimization
  • NVIDIA TensorRT-LLM - NVIDIA GPU acceleration
  • AWS Trainium & Inferentia - AWS dedicated chips
  • Habana Gaudi - Habana processors
  • AMD Instinct GPUs - AMD hardware support
  • Intel Neural Compressor - Intel neural network compression
  • FuriosaAI - FuriosaAI hardware platform

2. Model Export and Optimization

  • Format Conversion: Supports exporting Transformers and Diffusers models to formats like ONNX, ExecuTorch, TensorFlow Lite, etc.
  • Graph Optimization: Automatically performs model computation graph optimization.
  • Quantization Techniques: Provides various quantization schemes to reduce model size and inference latency.
  • Performance Tuning: Optimizes performance for specific hardware.

3. Training Acceleration

Provides optimized training wrappers, supporting:

  • Habana Gaudi processor training
  • AWS Trainium instance training
  • ONNX Runtime GPU optimized training

Installation

Basic Installation

python -m pip install optimum

Specific Accelerator Installation

Choose the corresponding installation command based on the required hardware platform:

# ONNX Runtime
pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]

# ExecuTorch
pip install --upgrade --upgrade-strategy eager optimum[executorch]

# Intel Neural Compressor
pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]

# OpenVINO
pip install --upgrade --upgrade-strategy eager optimum[openvino]

# NVIDIA TensorRT-LLM
docker run -it --gpus all --ipc host huggingface/optimum-nvidia

# AMD Hardware
pip install --upgrade --upgrade-strategy eager optimum[amd]

# AWS Trainium & Inferentia
pip install --upgrade --upgrade-strategy eager optimum[neuronx]

# Habana Gaudi
pip install --upgrade --upgrade-strategy eager optimum[habana]

# FuriosaAI
pip install --upgrade --upgrade-strategy eager optimum[furiosa]

Installation from Source

python -m pip install git+https://github.com/huggingface/optimum.git

Main Function Modules

1. Model Export

ONNX Export Example:

# Install dependencies
pip install optimum[exporters,onnxruntime]

# Export model
optimum-cli export onnx --model bert-base-uncased --output ./bert-onnx/

ExecuTorch Export:

# Install dependencies
pip install optimum[exporters-executorch]

# Export model for edge devices
optimum-cli export executorch --model distilbert-base-uncased --output ./distilbert-executorch/

TensorFlow Lite Export:

# Install dependencies
pip install optimum[exporters-tf]

# Export and quantize
optimum-cli export tflite --model bert-base-uncased --output ./bert-tflite/

2. Inference Optimization

Using ONNX Runtime for optimized inference:

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Load the optimized model
model = ORTModelForSequenceClassification.from_pretrained("./bert-onnx/")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Perform inference
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)

3. Quantization Techniques

Supports various quantization schemes:

  • Dynamic Quantization - Runtime quantization
  • Static Quantization - Quantization based on calibration data
  • QAT (Quantization Aware Training) - Quantization-aware training

4. Training Optimization

Using Habana Gaudi for optimized training:

from optimum.habana import GaudiTrainer, GaudiTrainingArguments

# Configure training parameters
training_args = GaudiTrainingArguments(
    output_dir="./results",
    use_habana=True,
    use_lazy_mode=True,
    gaudi_config_name="Habana/bert-base-uncased"
)

# Create optimized trainer
trainer = GaudiTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Start training
trainer.train()

Key Advantages

1. Ease of Use

  • Unified Interface: Consistent API design with the Transformers library
  • Command-Line Tool: Provides the optimum-cli command-line tool to simplify operations
  • Automatic Optimization: Intelligently selects the optimal optimization strategy

2. Performance Improvement

  • Inference Acceleration: Significantly improves model inference speed
  • Memory Optimization: Reduces memory footprint
  • Energy Consumption Reduction: Optimizes energy consumption

3. Production Ready

  • Stability: Extensively tested and validated
  • Scalability: Supports large-scale deployment
  • Compatibility: Seamlessly integrates with the existing Hugging Face ecosystem

Application Scenarios

1. Edge Device Deployment

  • Mobile AI applications
  • IoT device intelligence
  • Embedded system optimization

2. Cloud Service Optimization

  • Large-scale API services
  • Batch inference tasks
  • Real-time response systems

3. Dedicated Hardware Acceleration

  • GPU cluster optimization
  • TPU acceleration
  • Dedicated AI chip adaptation

Community Ecosystem

Related Projects

  • optimum-intel - Intel hardware-specific optimization
  • optimum-habana - Habana Gaudi processor support
  • optimum-neuron - AWS Neuron chip support
  • optimum-nvidia - NVIDIA hardware optimization
  • optimum-benchmark - Performance benchmarking tool
  • optimum-quanto - PyTorch quantization backend

Documentation Resources

Technical Architecture

Core Components

  1. Exporters - Responsible for model format conversion
  2. Optimizers - Execute various optimization strategies
  3. Quantizers - Implement model quantization
  4. Runtimes - Provide optimized inference runtimes
  5. Trainers - Hardware-optimized training wrappers

Design Principles

  • Modularization - Each functional module is independent and composable
  • Extensibility - Easy to add new hardware support
  • Backward Compatibility - Maintain compatibility with existing APIs
  • Performance Priority - Performance optimization as the core goal

Summary

Hugging Face Optimum is a powerful and easy-to-use machine learning model optimization toolkit. It provides developers with a complete solution for efficiently deploying AI models to various hardware platforms, making it an important tool for modern AI application development and deployment. Whether it's edge device deployment or large-scale cloud services, Optimum can provide significant performance improvements and cost optimization.