Skip to main content
TRL (Transformers Reinforcement Learning) is a full-stack Python library designed for post-training foundation models. It provides state-of-the-art algorithms for supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), and reward modeling — scaling from a single GPU to multi-node clusters.

Installation

Install TRL with pip or from source and set up your environment

Quickstart

Fine-tune your first model in minutes with SFT, DPO, GRPO, or a reward model

SFT Trainer

Supervised fine-tuning with packing, chat templates, and LoRA support

CLI

Fine-tune directly from your terminal without writing code

What is post-training?

Pre-trained language models learn general representations from large text corpora, but they require additional training to become useful assistants. Post-training adapts a foundation model to follow instructions, align with human preferences, and reason more accurately. TRL covers the full post-training pipeline.

Trainer taxonomy

TRL organizes its trainers into four broad categories:

Online methods

Online methods generate completions during training and optimize them with a reward signal. These methods are well-suited for tasks with verifiable answers or when a reward model is available.
  • GRPOTrainer — Group Relative Policy Optimization. Trains the model by comparing groups of sampled completions against a reward function. Used to train DeepSeek-R1.
  • RLOOTrainer — REINFORCE Leave-One-Out. A variance-reduced policy gradient algorithm.

Offline methods

Offline methods train on pre-collected preference or demonstration data without generating new completions at training time.
  • SFTTrainer — Supervised Fine-Tuning. The standard starting point: train on curated demonstrations.
  • DPOTrainer — Direct Preference Optimization. Optimizes the model directly from preference pairs, without a separate reward model.
  • KTOTrainer — Kahneman-Tversky Optimization. Aligns models using binary feedback (thumbs up / thumbs down) instead of preference pairs.

Reward modeling

Reward models score completions and provide the signal used by online RL methods.
  • RewardTrainer — Trains a scalar reward model on preference pairs.

Knowledge distillation

Distillation methods transfer capabilities from a larger teacher model to a smaller student model.
  • BCOTrainer — Binary Classifier Optimization. Uses binary feedback for distillation.

Hugging Face ecosystem

TRL is built on top of and integrates natively with:
  • Transformers — model loading, tokenization, and training infrastructure. Every TRL trainer is a lightweight wrapper around the Transformers Trainer.
  • Accelerate — distributed training across single GPU, multi-GPU (DDP), and multi-node (DeepSpeed ZeRO, FSDP) setups.
  • PEFT — parameter-efficient fine-tuning via LoRA and QLoRA, enabling training of large models on modest hardware.
  • Datasets — efficient dataset loading, processing, and streaming from the Hugging Face Hub.
All TRL trainers natively support distributed training methods including DDP, DeepSpeed ZeRO, and FSDP without any additional configuration.