huggingface/huggingface-gemma-recipesPlease refer to the latest official releases for information GitHub Homepage

Hugging Face's official Gemma model quick start tutorial repository, providing various practical scripts and notebooks for inference, fine-tuning, and more.

MITPython 11huggingfacehuggingface-gemma-recipes Last Updated: 2025-06-26

Hugging Face Gemma Recipes Project Detailed Introduction

Project Overview

huggingface-gemma-recipes is an open-source project officially maintained by Hugging Face, designed to provide users with minimized example code and tutorials related to the Google Gemma series models. The core goal of this project is to help developers quickly get started with Gemma model inference, fine-tuning, and various practical application scenarios.

Project Features

🚀 Quick Start

Provides the most simplified code examples to lower the learning threshold.
Supports multi-modal input processing (text, image, audio).
Integrates the latest Transformers library features.

🎯 Multi-Modal Support

This project supports the multi-modal capabilities of the Gemma 3 series models:

Pure Text Processing: Traditional text generation and question answering.
Image Understanding: Image captioning, visual question answering.
Audio Processing: Speech-to-text, audio analysis.
Multi-Modal Interaction: Mixed input of text, images, and audio.

Core Functionality

1. Model Inference

The project provides a unified model inference interface, supporting quick loading and use of Gemma models:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-3n-e4b-it"  # or google/gemma-3n-e2b-it
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)

def model_generation(model, messages):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    input_len = inputs["input_ids"].shape[-1]
    inputs = inputs.to(model.device, dtype=model.dtype)
    
    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
        generation = generation[:, input_len:]
        decoded = processor.batch_decode(generation, skip_special_tokens=True)
        print(decoded[0])

2. Usage Examples

Pure Text Processing

# Text Question Answering
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"}
        ]
    }
]
model_generation(model, messages)

Audio Processing

# Speech-to-Text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English:"},
            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
        ]
    }
]
model_generation(model, messages)

Image Understanding

# Image Captioning
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]
model_generation(model, messages)

3. Model Fine-tuning

The project provides various fine-tuning solutions and scripts:

Fine-tuning Resources

[Fine tuning Gemma 3n on T4]: Fine-tuning tutorial specifically for T4 GPUs.
[Fine tuning Gemma 3n on images]: Fine-tuning script for image understanding tasks.
[Fine tuning Gemma 3n on audio]: Fine-tuning script for audio processing tasks.
[Fine tuning Gemma 3n on images using TRL]: Image fine-tuning solution based on the TRL library.

Fine-tuning Environment Configuration

# Install dependencies
$ pip install -U -q -r requirements.txt

Installation and Usage

System Requirements

Python 3.8+
PyTorch 2.0+
CUDA-enabled GPU (recommended)

Quick Installation

# Install core dependencies
$ pip install -U -q transformers timm

# Install complete dependencies (for fine-tuning)
$ pip install -U -q -r requirements.txt

Basic Usage Flow

Clone the project repository.
Install the dependency packages.
Select a suitable Gemma model.
Choose an inference or fine-tuning script according to your needs.
Execute the corresponding code.

Project Structure

huggingface-gemma-recipes/
├── notebooks/                 # Jupyter notebook tutorials
│   └── fine_tune_gemma3n_on_t4.ipynb
├── scripts/                   # Fine-tuning scripts
│   ├── ft_gemma3n_image_vt.py
│   ├── ft_gemma3n_audio_vt.py
│   └── ft_gemma3n_image_trl.py
├── requirements.txt           # Dependency list
└── README.md                 # Project description

Technical Advantages

1. Ease of Use

Minimized code examples for quick start.
Unified interface design to reduce learning costs.
Complete documentation and examples.

2. Flexibility

Supports multi-modal input processing.
Provides various fine-tuning strategies.
Compatible with different hardware configurations.

3. Practicality

Based on the official Transformers library.
Integrates the latest model optimization techniques.
Provides production-grade code quality.

Applicable Scenarios

Research and Development

Multi-modal AI research.
Model performance evaluation.
New application scenario exploration.

Commercial Applications

Intelligent customer service systems.
Content creation tools.
Multimedia analysis platforms.

Education and Training

AI course teaching.
Model fine-tuning practice.
Technical concept validation.

Community and Support

This project, as an open-source project officially maintained by Hugging Face, has the following advantages:

Active community support.
Regular updates and maintenance.
Synchronization with the latest model versions.
Rich documentation and examples.

Summary

huggingface-gemma-recipes is a high-quality open-source project that provides a complete solution for using Gemma models. Whether you are a beginner or an experienced developer, you can find suitable resources and guidance. The project's multi-modal support and flexible fine-tuning solutions make it an important tool in the current AI development field.