TRL provides a command-line interface (CLI) to fine-tune large language models using methods like SFT, DPO, GRPO, and more. The CLI abstracts away boilerplate so you can launch training jobs quickly and reproducibly.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/trl/llms.txt
Use this file to discover all available pages before exploring further.
Available commands
trl sft
Supervised fine-tuning
trl dpo
Direct Preference Optimization
trl grpo
Group Relative Policy Optimization
trl rloo
REINFORCE Leave-One-Out
trl kto
Kahneman-Tversky Optimization
trl reward
Reward model training
trl env— print system and dependency informationtrl vllm-serve— start a vLLM generation servertrl skills— manage TRL agent skills
Basic usage
Specify the model and dataset directly as flags:- SFT
- DPO
- GRPO
- RLOO
- KTO
- Reward
Key flags
Model flags (ModelConfig)
| Flag | Default | Description |
|---|---|---|
--model_name_or_path | Model checkpoint or Hub ID | |
--model_revision | main | Branch, tag, or commit ID |
--dtype | float32 | Model dtype: auto, bfloat16, float16, float32 |
--attn_implementation | Attention backend (e.g. flash_attention_2, kernels-community/flash-attn2) | |
--trust_remote_code | false | Allow custom model code from the Hub |
--use_peft | false | Enable PEFT/LoRA training |
--lora_r | 16 | LoRA rank |
--lora_alpha | 32 | LoRA scaling factor |
--lora_dropout | 0.05 | LoRA dropout |
--lora_target_modules | Modules to apply LoRA to | |
--load_in_4bit | false | Load base model in 4-bit (QLoRA) |
--load_in_8bit | false | Load base model in 8-bit |
--bnb_4bit_quant_type | nf4 | 4-bit quantization type: nf4 or fp4 |
Training flags (shared across trainers)
| Flag | Description |
|---|---|
--output_dir | Directory to save the trained model |
--learning_rate | Learning rate |
--num_train_epochs | Number of training epochs |
--max_steps | Maximum number of training steps (overrides epochs) |
--per_device_train_batch_size | Batch size per GPU |
--gradient_accumulation_steps | Steps to accumulate gradients before updating |
--bf16 | Enable bfloat16 mixed precision |
--fp16 | Enable float16 mixed precision |
--eval_strategy | Evaluation strategy: no, steps, epoch |
--eval_steps | Evaluate every N steps (when eval_strategy=steps) |
--push_to_hub | Push trained model to the Hugging Face Hub |
--gradient_checkpointing | Enable gradient checkpointing |
SFT-specific flags
| Flag | Description |
|---|---|
--max_length | Maximum sequence length for truncation |
--packing | Enable sequence packing |
--packing_strategy | Packing strategy: bfd, bfd_split, or wrapped |
--eos_token | EOS token string (e.g. <|im_end|>) |
DPO-specific flags
| Flag | Description |
|---|---|
--max_length | Maximum combined prompt+completion length |
--beta | KL penalty coefficient |
--loss_type | DPO loss type (e.g. sigmoid, hinge, ipo) |
GRPO-specific flags
| Flag | Description |
|---|---|
--reward_funcs | Built-in reward functions to use (e.g. accuracy_reward, think_format_reward) |
--reward_model_name_or_path | External reward model Hub ID or local path |
--use_vllm | Enable vLLM for fast generation |
--vllm_mode | vLLM mode: server |
reward_funcs values for GRPO and RLOO:
accuracy_rewardreasoning_accuracy_rewardthink_format_rewardget_soft_overlong_punishment- Any dotted import path (e.g.
my_lib.rewards.custom_reward)
Using config files
Define all training arguments in a YAML config file for cleaner, reproducible runs:- SFT
- DPO
- GRPO
--config override values in the file.
Multi-GPU and distributed training
The TRL CLI natively supports Accelerate. Pass anyaccelerate launch argument directly, such as --num_processes:
Using --accelerate_config
The --accelerate_config flag selects a distributed training strategy. It accepts either a predefined profile name or a path to a custom Accelerate YAML config file.
Predefined profiles:
| Name | Description |
|---|---|
single_gpu | Single-GPU training |
multi_gpu | Multi-GPU with DDP |
fsdp1 | Fully Sharded Data Parallel Stage 1 |
fsdp2 | Fully Sharded Data Parallel Stage 2 |
zero1 | DeepSpeed ZeRO Stage 1 |
zero2 | DeepSpeed ZeRO Stage 2 |
zero3 | DeepSpeed ZeRO Stage 3 |
Dataset mixtures
Combine multiple datasets into a single training dataset using thedatasets key in your config file:
- SFT
- DPO
- GRPO
DatasetConfig and DatasetMixtureConfig for all available dataset mixture keywords.