Hugging Face Accelerate
8270 lines. Hugging Face Accelerate came to play.
Hugging Face Accelerate is a game-changer for PyTorch developers, offering a seamless interface to launch and scale training on diverse distributed setups. With features like Big Model Inference, it simplifies the process of working with massive models, making it easier to leverage GPU and TPU hardware.
Not sure yours is this good? Check it →
Hugging Face Accelerate's llms.txt Insights
Overachiever
209 sections. Most sites can barely manage 3. This one went all in.
War and Peace vibes
8270 lines. They really wanted AI to understand them.
What's inside Hugging Face Accelerate's llms.txt
Hugging Face Accelerate's llms.txt contains 3 sections:
- Quicktour
- Unified launch interface
- Adapt training code
How does Hugging Face Accelerate's llms.txt compare?
| Hugging Face Accelerate | Directory Avg | Top Performer | |
|---|---|---|---|
| Lines | 8,270 | 1029 | 163,447 |
| Sections | 209 | 17 | 3207 |
Cool table. Now the real question — where do you land? Find out →
Hugging Face Accelerate's llms.txt preview
First 100 of 8,270 lines
# Quicktour
There are many ways to launch and run your code depending on your training environment ([torchrun](https://pytorch.org/docs/stable/elastic/run.html), [DeepSpeed](https://www.deepspeed.ai/), etc.) and available hardware. Accelerate offers a unified interface for launching and training on different distributed setups, allowing you to focus on your PyTorch training code instead of the intricacies of adapting your code to these different setups. This allows you to easily scale your PyTorch code for training and inference on distributed setups with hardware like GPUs and TPUs. Accelerate also provides Big Model Inference to make loading and running inference with really large models that usually don't fit in memory more accessible.
This quicktour introduces the three main features of Accelerate:
* a unified command line launching interface for distributed training scripts
* a training library for adapting PyTorch training code to run on different distributed setups
* Big Model Inference
## Unified launch interface
Accelerate automatically selects the appropriate configuration values for any given distributed training framework (DeepSpeed, FSDP, etc.) through a unified configuration file generated from the [`accelerate config`](package_reference/cli#accelerate-config) command. You could also pass the configuration values explicitly to the command line which is helpful in certain situations like if you're using SLURM.
But in most cases, you should always run [`accelerate config`](package_reference/cli#accelerate-config) first to help Accelerate learn about your training setup.
```bash
accelerate config
```
The [`accelerate config`](package_reference/cli#accelerate-config) command creates and saves a default_config.yaml file in Accelerates cache folder. This file stores the configuration for your training environment, which helps Accelerate correctly launch your training script based on your machine.
After you've configured your environment, you can test your setup with [`accelerate test`](package_reference/cli#accelerate-test), which launches a short script to test the distributed environment.
```bash
accelerate test
```
> [!TIP]
> Add `--config_file` to the `accelerate test` or `accelerate launch` command to specify the location of the configuration file if it is saved in a non-default location like the cache.
Once your environment is setup, launch your training script with [`accelerate launch`](package_reference/cli#accelerate-launch)!
```bash
accelerate launch path_to_script.py --args_for_the_script
```
To learn more, check out the [Launch distributed code](basic_tutorials/launch) tutorial for more information about launching your scripts.
We also have a [configuration zoo](https://github.com/huggingface/accelerate/blob/main/examples/config_yaml_templates) which showcases a number of premade **minimal** example configurations for a variety of setups you can run.
## Adapt training code
The next main feature of Accelerate is the `Accelerator` class which adapts your PyTorch code to run on different distributed setups.
You only need to add a few lines of code to your training script to enable it to run on multiple GPUs or TPUs.
```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+ device = accelerator.device
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+ model, optimizer, training_dataloader, scheduler
+ )
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
+ accelerator.backward(loss)
optimizer.step()
scheduler.step()
```
1. Import and instantiate the `Accelerator` class at the beginning of your training script. The `Accelerator` class initializes everything necessary for distributed training, and it automatically detects your training environment (a single machine with a GPU, a machine with several GPUs, several machines with multiple GPUs or a TPU, etc.) based on how the code was launched.
```python
from accelerate import Accelerator
accelerator = Accelerator()
```
2. Remove calls like `.cuda()` on your model and input data. The `Accelerator` class automatically places these objects on the appropriate device for you.
> [!WARNING]
> This step is *optional* but it is considered best practice to allow Accelerate to handle device placement. You could also deactivate automatic device placement by passing `device_placement=False` when initializing the `Accelerator`. If you want to explicitly place objects on a device with `.to(device)`, make sure you use `accelerator.device` instead. For example, if you create an optimizer before placing a model on `accelerator.device`, training fails on a TPU.
> [!WARNING]
> Accelerate does not use non-blocking transfers by default for its automatic device placement, which can result in potentially unwanted CUDA synchronizations. You can enable non-blocking transfers by passing a `DataLoaderConfiguration` with `non_blocking=True` set as the `dataloader_config` when initializing the `Accelerator`. As usual, non-blocking transfers will only work if the dataloader also has `pin_memory=True` set. Be wary that using non-blocking transfers from GPU to CPU may cause incorrect results if it results in CPU operations being performed on non-ready tensors.
```py
device = accelerator.device
```
3. Pass all relevant PyTorch objects for training (optimizer, model, dataloader(s), learning rate scheduler) to the `prepare()` method as soon as they're created. This method wraps the model in a container optimized for your distributed setup, uses Accelerates version of the optimizer and scheduler, and creates a sharded version of your dataloader for distribution across GPUs or TPUs.
```python
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, lr_scheduler
)
```
4. Replace `loss.backward()` with `backward()` to use the correct `backward()` method for your training setup.
```pyWhat is llms.txt?
llms.txt is an open standard that helps AI language models understand your website. By placing a structured markdown file at /llms.txt, websites provide AI search engines like ChatGPT, Claude, and Perplexity with a clear map of their content, services, and documentation. Companies like Hugging Face Accelerate use it to ensure AI accurately represents their brand when answering user queries. Read the spec.
Hugging Face Accelerate showed up. Where's yours?
1000+ companies didn't overthink it. 60 seconds. Go.
Check your site →