mistralai/mistral-inference

Official inference library for Mistral models, containing the minimal code implementation to run Mistral AI models.

Apache-2.0Jupyter Notebook 10.3kmistralai Last Updated: 2025-03-20

https://github.com/mistralai/mistral-inference

Mistral Inference Library (mistral-inference) Detailed Introduction

Project Overview

mistral-inference is the official Mistral model inference library developed by Mistral AI, providing a minimized code implementation for running various Mistral models. This project offers users an efficient and concise way to deploy and use the Mistral family of large language models.

Project Address: https://github.com/mistralai/mistral-inference
Official Documentation: https://docs.mistral.ai/

Supported Model Series

Base Models

Mistral 7B: Base and instruction versions, supports function calling
Mixtral 8x7B: Mixture of Experts model, high-performance inference
Mixtral 8x22B: Larger Mixture of Experts model
Mistral Nemo 12B: Medium-sized efficient model
Mistral Large 2: Latest large-scale model
Mistral Small 3.1 24B: Medium-sized model with multimodal support

Specialized Models

Codestral 22B: Specifically designed for code generation and programming tasks
Codestral Mamba 7B: Code model based on the Mamba architecture
Mathstral 7B: Specifically designed for mathematical reasoning
Pixtral 12B: Multimodal visual language model

Core Features and Capabilities

1. Multiple Inference Modes

Command-Line Interface (CLI): Quickly test and interact using the mistral-demo and mistral-chat commands.
Python API: Complete programming interface, supports custom integration.
Multi-GPU Support: Supports distributed inference of large models via torchrun.

2. Rich Application Scenarios

Instruction Following

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# Load the model and tokenizer
tokenizer = MistralTokenizer.from_file("./mistral-nemo-instruct-v0.1/tekken.json")
model = Transformer.from_folder("./mistral-nemo-instruct-v0.1")

# Generate a response
prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris?"
completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=1024, temperature=0.35)

Multimodal Inference

Supports joint reasoning with images and text, enabling analysis of image content and answering related questions:

# Multimodal content processing
user_content = [ImageURLChunk(image_url=url), TextChunk(text=text=prompt)]
tokens, images = tokenizer.instruct_tokenizer.encode_user_content(user_content, False)
out_tokens, _ = generate([tokens], model, images=[images], max_tokens=256)

Function Calling

All models support function calling, enabling integration with external tools and APIs:

# Define tool functions
tools=[Tool(function=Function(
    name="get_current_weather",
    description="Get the current weather",
    parameters={...}
))]

# Execute function call
completion_request = ChatCompletionRequest(tools=tools, messages=[...])

Fill-in-the-Middle (FIM)

Specifically designed for code editing scenarios, supports middle-filling code generation:

prefix = "def add("
suffix = " return sum"
request = FIMRequest(prompt=prefix, suffix=suffix)
tokens = tokenizer.encode_fim(request).tokens

3. Flexible Deployment Options

Local Deployment

Single-GPU Deployment: Suitable for smaller models (7B, 12B).
Multi-GPU Deployment: Supports distributed inference of large models (8x7B, 8x22B).
Docker Containerization: Provides Docker images integrated with vLLM.

Cloud Deployment

Mistral AI Official API: La Plateforme cloud service.
Cloud Service Providers: Supports multiple major cloud platforms.

Installation and Configuration

System Requirements

GPU Support: Requires a GPU environment due to dependency on the xformers library.
Python Environment: Supports modern Python versions.
Storage Space: Requires sufficient disk space depending on the model size.

Installation Methods

Install via pip

pip install mistral-inference

Install from Source

cd $HOME && git clone https://github.com/mistralai/mistral-inference
cd $HOME/mistral-inference && poetry install .

Model Download and Configuration

# Create a model storage directory
export MISTRAL_MODEL=$HOME/mistral_models
mkdir -p $MISTRAL_MODEL

# Download the model (using Mistral Nemo as an example)
export 12B_DIR=$MISTRAL_MODEL/12B_Nemo
wget https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar
mkdir -p $12B_DIR
tar -xf mistral-nemo-instruct-2407.tar -C $12B_DIR

Usage Examples

Basic Chat Interaction

# Single-GPU model
mistral-chat $12B_DIR --instruct --max_tokens 1024 --temperature 0.35

# Multi-GPU large model
torchrun --nproc-per-node 2 --no-python mistral-chat $M8x7B_DIR --instruct

Specialized Model Usage

Codestral Code Assistant

mistral-chat $M22B_CODESTRAL --instruct --max_tokens 256

Can handle programming requests such as "Write me a function that computes fibonacci in Rust".

Mathstral Mathematical Reasoning

mistral-chat $7B_MATHSTRAL --instruct --max_tokens 256

Capable of solving complex mathematical calculation problems.

License and Compliance

Licenses for Different Models

Open Source Models: Most base models use open-source licenses.
Codestral 22B: Uses the Mistral AI Non-Production (MNPL) License, limited to non-commercial use.
Mistral Large: Uses the Mistral AI Research (MRL) License, primarily for research purposes.

Compliance Recommendations

When using in a commercial environment, carefully review the license terms of the corresponding model to ensure compliant use.

Technical Advantages

Performance Optimization

Efficient Inference: Specifically optimized for the Mistral model architecture.
Memory Management: Intelligent memory usage strategies, supports large model inference.
Parallel Processing: Supports multi-GPU parallelism, improving inference speed.

Ease of Use

Concise API: Provides a simple and intuitive Python interface.
Rich Documentation: Comprehensive usage examples and documentation support.
Community Support: Active developer community and Discord channel.

Extensibility

Modular Design: Easy to integrate into existing projects.
Custom Configuration: Supports flexible configuration of various inference parameters.
Tool Integration: Supports integration with various external tools and services.

Application Scenarios

Enterprise Applications

Intelligent Customer Service: Building high-quality dialogue systems.
Content Generation: Automating content creation and editing.
Code Assistance: Code generation and review in development environments.

Research and Development

Academic Research: Language model research and experimentation.
Prototype Development: Rapidly building AI application prototypes.
Performance Testing: Model performance evaluation and comparison.

Personal and Educational

Learning Assistant: Personalized learning and tutoring tools.
Creative Writing: Assisting in creative content creation.
Technical Learning: Learning support for programming and technical concepts.

Summary

The Mistral Inference Library is a powerful and easy-to-use inference framework for large language models. It not only provides complete support for the Mistral model series but also includes rich features, from basic text generation to advanced multimodal inference and function calling. Whether for enterprise-level deployment or personal research use, this library can provide efficient and reliable solutions.