Command line interface - TRL - Transformers Reinforcement Learning

TRL provides a command-line interface (CLI) to fine-tune large language models using methods like SFT, DPO, GRPO, and more. The CLI abstracts away boilerplate so you can launch training jobs quickly and reproducibly.

Available commands

trl sft

Supervised fine-tuning

trl dpo

Direct Preference Optimization

trl grpo

Group Relative Policy Optimization

trl rloo

REINFORCE Leave-One-Out

trl kto

Kahneman-Tversky Optimization

trl reward

Reward model training

Other commands:

trl env — print system and dependency information
trl vllm-serve — start a vLLM generation server
trl skills — manage TRL agent skills

Basic usage

Specify the model and dataset directly as flags:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb

trl dpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name anthropic/hh-rlhf

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name HuggingFaceH4/Polaris-Dataset-53K \
  --reward_funcs accuracy_reward

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name HuggingFaceH4/Polaris-Dataset-53K \
  --reward_funcs accuracy_reward

trl kto \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/kto-mix-14k

trl reward \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/ultrafeedback_binarized

Key flags

Model flags (`ModelConfig`)

Flag	Default	Description
`--model_name_or_path`		Model checkpoint or Hub ID
`--model_revision`	`main`	Branch, tag, or commit ID
`--dtype`	`float32`	Model dtype: `auto`, `bfloat16`, `float16`, `float32`
`--attn_implementation`		Attention backend (e.g. `flash_attention_2`, `kernels-community/flash-attn2`)
`--trust_remote_code`	`false`	Allow custom model code from the Hub
`--use_peft`	`false`	Enable PEFT/LoRA training
`--lora_r`	`16`	LoRA rank
`--lora_alpha`	`32`	LoRA scaling factor
`--lora_dropout`	`0.05`	LoRA dropout
`--lora_target_modules`		Modules to apply LoRA to
`--load_in_4bit`	`false`	Load base model in 4-bit (QLoRA)
`--load_in_8bit`	`false`	Load base model in 8-bit
`--bnb_4bit_quant_type`	`nf4`	4-bit quantization type: `nf4` or `fp4`

Training flags (shared across trainers)

Flag	Description
`--output_dir`	Directory to save the trained model
`--learning_rate`	Learning rate
`--num_train_epochs`	Number of training epochs
`--max_steps`	Maximum number of training steps (overrides epochs)
`--per_device_train_batch_size`	Batch size per GPU
`--gradient_accumulation_steps`	Steps to accumulate gradients before updating
`--bf16`	Enable bfloat16 mixed precision
`--fp16`	Enable float16 mixed precision
`--eval_strategy`	Evaluation strategy: `no`, `steps`, `epoch`
`--eval_steps`	Evaluate every N steps (when `eval_strategy=steps`)
`--push_to_hub`	Push trained model to the Hugging Face Hub
`--gradient_checkpointing`	Enable gradient checkpointing

SFT-specific flags

Flag	Description
`--max_length`	Maximum sequence length for truncation
`--packing`	Enable sequence packing
`--packing_strategy`	Packing strategy: `bfd`, `bfd_split`, or `wrapped`
`--eos_token`	EOS token string (e.g. `<\|im_end\|>`)

DPO-specific flags

Flag	Description
`--max_length`	Maximum combined prompt+completion length
`--beta`	KL penalty coefficient
`--loss_type`	DPO loss type (e.g. `sigmoid`, `hinge`, `ipo`)

GRPO-specific flags

Flag	Description
`--reward_funcs`	Built-in reward functions to use (e.g. `accuracy_reward`, `think_format_reward`)
`--reward_model_name_or_path`	External reward model Hub ID or local path
`--use_vllm`	Enable vLLM for fast generation
`--vllm_mode`	vLLM mode: `server`

Built-in reward_funcs values for GRPO and RLOO:

accuracy_reward
reasoning_accuracy_reward
think_format_reward
get_soft_overlong_punishment
Any dotted import path (e.g. my_lib.rewards.custom_reward)

Using config files

Define all training arguments in a YAML config file for cleaner, reproducible runs:

SFT
DPO
GRPO

# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
packing: true
output_dir: Qwen2.5-0.5B-SFT

trl sft --config sft_config.yaml

# dpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B-Instruct
dataset_name: trl-lib/ultrafeedback_binarized
learning_rate: 5.0e-7
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
output_dir: Qwen2.5-0.5B-DPO

trl dpo --config dpo_config.yaml

# grpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: HuggingFaceH4/Polaris-Dataset-53K
reward_funcs:
  - accuracy_reward
output_dir: Qwen2.5-0.5B-GRPO

trl grpo --config grpo_config.yaml

CLI flags passed alongside --config override values in the file.

Multi-GPU and distributed training

The TRL CLI natively supports Accelerate. Pass any accelerate launch argument directly, such as --num_processes:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --num_processes 4

Using `--accelerate_config`

The --accelerate_config flag selects a distributed training strategy. It accepts either a predefined profile name or a path to a custom Accelerate YAML config file. Predefined profiles:

Name	Description
`single_gpu`	Single-GPU training
`multi_gpu`	Multi-GPU with DDP
`fsdp1`	Fully Sharded Data Parallel Stage 1
`fsdp2`	Fully Sharded Data Parallel Stage 2
`zero1`	DeepSpeed ZeRO Stage 1
`zero2`	DeepSpeed ZeRO Stage 2
`zero3`	DeepSpeed ZeRO Stage 3

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config zero2

Or in a config file:

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: zero2

Dataset mixtures

Combine multiple datasets into a single training dataset using the datasets key in your config file:

SFT
DPO
GRPO

model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
  - path: stanfordnlp/imdb
  - path: roneneldan/TinyStories

model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
  - path: BAAI/Infinity-Preference
  - path: argilla/Capybara-Preferences

model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
  - path: HuggingFaceH4/Polaris-Dataset-53K
  - path: trl-lib/DeepMath-103K
reward_funcs:
  - accuracy_reward

See DatasetConfig and DatasetMixtureConfig for all available dataset mixture keywords.

LoRA training example

Full SFT training with LoRA via the CLI:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT-LoRA \
  --push_to_hub

Getting system information

Print system and dependency versions for bug reports:

trl env

This outputs platform, Python, PyTorch, Transformers, Accelerate, TRL, and optional dependency versions.

Documentation Index

​Available commands

trl sft

trl dpo

trl grpo

trl rloo

trl kto

trl reward

​Basic usage

​Key flags

​Model flags (ModelConfig)

​Training flags (shared across trainers)

​SFT-specific flags

​DPO-specific flags

​GRPO-specific flags

​Using config files

​Multi-GPU and distributed training

​Using --accelerate_config

​Dataset mixtures

​LoRA training example

​Getting system information