Introduction - TRL - Transformers Reinforcement Learning

TRL (Transformers Reinforcement Learning) is a full-stack Python library designed for post-training foundation models. It provides state-of-the-art algorithms for supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), and reward modeling — scaling from a single GPU to multi-node clusters.

Installation

Install TRL with pip or from source and set up your environment

Quickstart

Fine-tune your first model in minutes with SFT, DPO, GRPO, or a reward model

SFT Trainer

Supervised fine-tuning with packing, chat templates, and LoRA support

CLI

Fine-tune directly from your terminal without writing code

What is post-training?

Pre-trained language models learn general representations from large text corpora, but they require additional training to become useful assistants. Post-training adapts a foundation model to follow instructions, align with human preferences, and reason more accurately. TRL covers the full post-training pipeline.

Trainer taxonomy

TRL organizes its trainers into four broad categories:

Online methods

Online methods generate completions during training and optimize them with a reward signal. These methods are well-suited for tasks with verifiable answers or when a reward model is available.

GRPOTrainer — Group Relative Policy Optimization. Trains the model by comparing groups of sampled completions against a reward function. Used to train DeepSeek-R1.
RLOOTrainer — REINFORCE Leave-One-Out. A variance-reduced policy gradient algorithm.

Offline methods

Offline methods train on pre-collected preference or demonstration data without generating new completions at training time.

SFTTrainer — Supervised Fine-Tuning. The standard starting point: train on curated demonstrations.
DPOTrainer — Direct Preference Optimization. Optimizes the model directly from preference pairs, without a separate reward model.
KTOTrainer — Kahneman-Tversky Optimization. Aligns models using binary feedback (thumbs up / thumbs down) instead of preference pairs.

Reward modeling

Reward models score completions and provide the signal used by online RL methods.

RewardTrainer — Trains a scalar reward model on preference pairs.

Knowledge distillation

Distillation methods transfer capabilities from a larger teacher model to a smaller student model.

BCOTrainer — Binary Classifier Optimization. Uses binary feedback for distillation.

Hugging Face ecosystem

TRL is built on top of and integrates natively with:

Transformers — model loading, tokenization, and training infrastructure. Every TRL trainer is a lightweight wrapper around the Transformers Trainer.
Accelerate — distributed training across single GPU, multi-GPU (DDP), and multi-node (DeepSpeed ZeRO, FSDP) setups.
PEFT — parameter-efficient fine-tuning via LoRA and QLoRA, enabling training of large models on modest hardware.
Datasets — efficient dataset loading, processing, and streaming from the Hugging Face Hub.

All TRL trainers natively support distributed training methods including DDP, DeepSpeed ZeRO, and FSDP without any additional configuration.

Documentation Index

Installation

Quickstart

SFT Trainer

CLI

​What is post-training?

​Trainer taxonomy

​Online methods

​Offline methods

​Reward modeling

​Knowledge distillation

​Hugging Face ecosystem

What is post-training?

Trainer taxonomy

Online methods

Offline methods

Reward modeling

Knowledge distillation

Hugging Face ecosystem