horovod/horovodPlease refer to the latest official releases for information GitHub Homepage

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

NOASSERTIONPython 14.5khorovod Last Updated: 2025-04-22

Horovod: Distributed Deep Learning Framework

Introduction

Horovod is an open-source distributed deep learning training framework developed by Uber. Its goal is to make distributed deep learning training simpler, faster, and easier to use. Horovod supports popular deep learning frameworks such as TensorFlow, Keras, PyTorch, and Apache MXNet.

Core Features

Easy to Use: Horovod provides a simple API that makes it easy to convert single-machine training code into distributed training code.
High Performance: Horovod uses efficient communication mechanisms (e.g., MPI and NCCL) to achieve fast distributed training.
Scalability: Horovod can scale to hundreds of GPU or CPU nodes to handle large-scale deep learning models and datasets.
Flexibility: Horovod supports multiple deep learning frameworks and can be integrated with various hardware platforms.
Open Source: Horovod is an open-source project with active community support.

Key Advantages

Faster Training Speed: By training models in parallel on multiple GPU or CPU nodes, Horovod can significantly reduce training time.
Larger Model Capacity: Horovod allows training larger models than single-machine environments because the model can be distributed across the memory of multiple nodes.
Higher Data Throughput: Horovod can handle larger datasets because it can load and process data in parallel on multiple nodes.
Better Resource Utilization: Horovod can utilize computing resources more efficiently because it can distribute workloads across multiple nodes.

Supported Frameworks

Horovod primarily supports the following deep learning frameworks:

TensorFlow: A popular deep learning framework developed by Google.
Keras: A high-level neural network API that can run on top of TensorFlow, Theano, and CNTK.
PyTorch: Another popular deep learning framework developed by Facebook.
Apache MXNet: A flexible and efficient deep learning framework developed by the Apache Foundation.

Communication Mechanisms

Horovod supports the following communication mechanisms:

MPI (Message Passing Interface): A standard protocol for communication between multiple nodes. Horovod uses MPI to coordinate the distributed training process.
NCCL (NVIDIA Collective Communications Library): A library developed by NVIDIA for high-performance communication between GPUs. Horovod uses NCCL to accelerate distributed training on GPUs.
gloo: A collective communications library developed by Facebook that supports multiple hardware platforms.
TCP/IP: Horovod can also use TCP/IP for communication, but its performance is generally not as good as MPI or NCCL.

Installation

The installation process for Horovod depends on the deep learning framework and communication mechanism you are using. Typically, you need to install MPI or NCCL first, and then install Horovod using pip.

For example, to install Horovod with pip and support TensorFlow and NCCL, you can run the following command:

pip install horovod[tensorflow,gpu]

Please refer to the Horovod official documentation for more detailed installation instructions: https://github.com/horovod/horovod

Usage

Using Horovod for distributed training typically involves the following steps:

Initialize Horovod: At the beginning of the training script, call horovod.init() to initialize Horovod.
Pin GPU (Optional): To improve performance, you can pin each process to a specific GPU.
Scale Learning Rate: Since multiple nodes are used for training, you need to adjust the learning rate based on the number of nodes.
Use DistributedOptimizer: Wrap the original optimizer with the DistributedOptimizer provided by Horovod.
Broadcast Initial State: Broadcast the initial state of the model from rank 0 to all other ranks.
Save Checkpoints (Only on Rank 0): To avoid duplicate saves, checkpoints of the model are usually only saved on rank 0.

Here is a simple example of using Horovod for TensorFlow distributed training:

import tensorflow as tf
import horovod.tensorflow as hvd

# 1. Initialize Horovod
hvd.init()

# 2. Pin GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# 3. Load Dataset
(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data()

dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., None] / 255.0, tf.float32),
     tf.cast(mnist_labels, tf.int64)))
dataset = dataset.repeat().shuffle(10000).batch(128)

# 4. Build Model
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# 5. Define Optimizer
opt = tf.keras.optimizers.Adam(0.001 * hvd.size()) # Scale learning rate

# 6. Use DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)

# 7. Define Loss Function and Metric
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy()

# 8. Define Training Step
@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        probs = model(images, training=True)
        loss = loss_fn(labels, probs)

    tape = hvd.DistributedGradientTape(tape)
    gradients = tape.gradient(loss, model.trainable_variables)
    opt.apply_gradients(zip(gradients, model.trainable_variables))
    metric.update_state(labels, probs)
    return loss

# 9. Broadcast Initial Variables
@tf.function
def initialize_vars():
    if hvd.rank() == 0:
        model(tf.zeros((1, 28, 28, 1)))
        hvd.broadcast_variables(model.variables, root_rank=0)
        hvd.broadcast_variables(opt.variables(), root_rank=0)

initialize_vars()

# 10. Training Loop
for batch, (images, labels) in enumerate(dataset.take(10000 // hvd.size())):
    loss = train_step(images, labels)

    if batch % 10 == 0 and hvd.rank() == 0:
        print('batch: %d, loss: %.4f, accuracy: %.2f' % (batch, loss, metric.result()))

Best Practices

Choose the Right Communication Mechanism: Choose the appropriate communication mechanism (MPI, NCCL, Gloo) based on your hardware platform and network environment.
Adjust Learning Rate: Adjust the learning rate based on the number of nodes to achieve the best training results.
Monitor Training Process: Use TensorBoard or other tools to monitor the training process to detect and resolve issues.
Use Data Parallelism: Horovod is primarily used for data parallelism, which involves dividing the dataset into multiple parts and training copies of the model on different nodes.
Avoid Data Skew: Ensure that the dataset is evenly distributed across all nodes to avoid data skew, which can lead to reduced training efficiency.

Summary

Horovod is a powerful distributed deep learning framework that can help you train large-scale deep learning models more easily and quickly. By leveraging multiple GPU or CPU nodes, Horovod can significantly reduce training time and improve model accuracy.

Resources

Horovod GitHub Repository: https://github.com/horovod/horovod
Horovod Official Documentation: https://horovod.readthedocs.io/en/stable/
Horovod Examples: https://github.com/horovod/horovod/tree/master/examples