Fine-tuning Paradigms: SFT, RLHF, and DPO - Aligning LLMs with Human Preferences

Introduction: Beyond Raw Language Modeling

While pre-training provides Large Language Models with fundamental language understanding capabilities, the models that users interact with—from ChatGPT to Claude—undergo sophisticated fine-tuning processes that align their behavior with human preferences and values. This alignment is achieved through three primary paradigms: Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO).

Understanding these fine-tuning approaches is crucial for anyone working with modern LLMs, as they fundamentally shape how models respond to user queries, handle sensitive topics, and maintain helpful, harmless, and honest behavior. Each approach has distinct theoretical foundations, practical implementations, and trade-offs that determine their suitability for different applications.

This comprehensive guide explores the mathematical foundations, implementation strategies, and practical considerations of each fine-tuning paradigm, providing insights into how modern AI systems achieve their remarkable alignment with human preferences.

Supervised Fine-Tuning (SFT): The Foundation Layer

Theoretical Framework

Supervised Fine-Tuning represents the most straightforward approach to adapting pre-trained language models for specific behaviors or tasks. SFT continues the language modeling objective but on carefully curated datasets that demonstrate desired model behavior.

Mathematical Foundation

SFT optimizes the same autoregressive language modeling objective as pre-training:

L_SFT = -E_{(x,y)~D_SFT} [∑_{i=1}^{|y|} log P_θ(y_i | x, y_{<i})]

Where:

D_SFT is the supervised fine-tuning dataset
x represents the input prompt or context
y represents the desired model response
θ are the model parameters being optimized

The key difference from pre-training lies in the dataset composition rather than the objective function.

Dataset Design Principles

High-Quality Demonstrations: SFT datasets consist of input-output pairs that exemplify desired model behavior across various scenarios.

Task Coverage: Comprehensive coverage of different task types, interaction patterns, and edge cases that the model should handle appropriately.

Behavior Modeling: Examples demonstrate not just correct answers but appropriate tone, style, and reasoning processes.

Safety Integration: Include examples of handling sensitive, controversial, or potentially harmful requests appropriately.

Implementation Strategies

Data Collection Methods

Human Annotation: Expert annotators create high-quality examples of desired model behavior:

Detailed guidelines for consistent annotation
Multiple annotators per example for quality assurance
Regular calibration sessions to maintain standards
Specialized expertise for domain-specific content

Model-Assisted Generation: Use existing models to generate candidate responses, then human curators select and refine the best examples:

Reduces annotation cost while maintaining quality
Enables rapid scaling of dataset creation
Requires careful quality control to prevent error propagation
Useful for bootstrapping new domains or languages

Constitutional AI Methods: Generate responses according to explicit principles or rules:

Define clear behavioral principles or constitutions
Generate responses that follow these principles
Iterative refinement based on principle adherence
Transparent and auditable alignment process

Training Dynamics

Learning Rate Considerations: SFT typically uses lower learning rates than pre-training to preserve pre-trained knowledge while adapting behavior:

Start with 10-100x smaller learning rates than pre-training
Use warm-up periods to stabilize fine-tuning
Monitor for catastrophic forgetting of general capabilities

Data Efficiency: SFT can achieve significant behavioral changes with relatively small, high-quality datasets:

Hundreds to thousands of examples often sufficient
Quality more important than quantity
Diverse examples more valuable than repetitive ones

Overfitting Prevention: Balance between learning desired behaviors and maintaining generalization:

Early stopping based on held-out validation data
Regularization techniques to prevent memorization
Data augmentation through paraphrasing and variation

Strengths and Limitations

Advantages of SFT

Simplicity: Straightforward implementation using standard language modeling techniques.

Interpretability: Clear connection between training examples and model behavior.

Data Efficiency: Relatively small datasets can produce significant behavioral changes.

Stable Training: Well-understood training dynamics with predictable outcomes.

Foundation for Further Training: Provides good starting point for more advanced alignment techniques.

Limitations and Challenges

Distribution Shift: Models may struggle with inputs significantly different from SFT examples.

Reward Misspecification: Difficulty capturing all aspects of desired behavior in examples.

Limited Feedback Signal: Binary demonstration vs. no demonstration provides limited learning signal.

Exposure Bias: Models trained on perfect demonstrations may struggle with error recovery.

Scalability Challenges: Creating comprehensive SFT datasets becomes expensive as requirements grow.

Reinforcement Learning from Human Feedback (RLHF)

Theoretical Foundation

RLHF represents a sophisticated approach that trains models to optimize for human preferences rather than simply imitating human demonstrations. This paradigm treats language generation as a sequential decision-making problem and uses reinforcement learning to maximize a learned reward function.

The RLHF Pipeline

RLHF involves three distinct stages:

Supervised Fine-Tuning: Initial behavioral training on demonstration data
Reward Model Training: Learning to predict human preferences from comparison data
Reinforcement Learning: Optimizing the language model using the learned reward model

Mathematical Framework

The core RLHF objective combines reward maximization with a regularization term to prevent deviation from the initial model:

L_RLHF = E_{x~D,y~π_θ(·|x)} [R(x,y)] - β · KL(π_θ(·|x) || π_ref(·|x))

Where:

R(x,y) is the reward model score for response y to prompt x
π_θ is the policy (language model) being optimized
π_ref is the reference model (typically the SFT model)
β controls the strength of the KL penalty
KL(·||·) is the Kullback-Leibler divergence

Reward Model Design and Training

Preference Data Collection

Pairwise Comparisons: Human annotators compare pairs of model responses and indicate which is better:

More natural for humans than scoring individual responses
Provides relative rather than absolute quality assessments
Enables bootstrapping from lower-quality initial models
Captures nuanced preferences difficult to specify explicitly

Comparison Interface Design: Effective interfaces for collecting high-quality preference data:

Side-by-side response presentation
Clear criteria for evaluation (helpfulness, harmlessness, honesty)
Optional reasoning fields for annotator explanations
Quality control mechanisms to identify inconsistent annotators

Reward Model Architecture

Bradley-Terry Model: The standard approach models the probability that response A is preferred over response B:

P(y_A ≻ y_B | x) = σ(R(x, y_A) - R(x, y_B))

Where σ is the sigmoid function and R is the learned reward function.

Loss Function: The reward model is trained to minimize:

L_RM = -E_{(x,y_A,y_B)~D_pref} [log σ(R(x, y_A) - R(x, y_B))]

Where the preference dataset D_pref contains human preference comparisons.

Reward Model Implementation

Architecture Choices: Reward models typically use the same architecture as the language model but with a scalar output head:

Share most parameters with the language model
Add a linear layer mapping hidden states to scalar rewards
Often use the final token representation for sequence-level scoring

Training Considerations: Stable reward model training requires careful attention to:

Data quality and annotator agreement
Regularization to prevent overfitting to preference data
Evaluation on held-out preference sets
Calibration to ensure reward scores reflect true quality differences

Proximal Policy Optimization (PPO) for Language Models

PPO Adaptation to Language Generation

Policy Representation: The language model serves as a stochastic policy:

π_θ(y|x) = ∏_{i=1}^{|y|} P_θ(y_i | x, y_{<i})

Value Function: Estimate expected future rewards from each state:

V(x, y_{<i}) = E_{y_{≥i}~π_θ} [R(x, y_{<i} + y_{≥i})]

Advantage Estimation: Measure how much better an action is than expected:

A(x, y_{<i}, y_i) = Q(x, y_{<i}, y_i) - V(x, y_{<i})

PPO Objective for Language Models

The PPO objective balances policy improvement with stability:

L_PPO = E_t [min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

Where:

r_t(θ) = π_θ(y_t|x_t, y_{<t}) / π_old(y_t|x_t, y_{<t}) is the probability ratio
A_t is the advantage estimate
ε is the clipping parameter (typically 0.2)

Implementation Challenges

Credit Assignment: Determining which tokens deserve credit for high rewards:

Reward is typically assigned to the final token
Value function helps distribute credit across the sequence
Baseline subtraction reduces variance in gradient estimates

Exploration vs. Exploitation: Balancing between trying new responses and exploiting known good responses:

KL penalty encourages staying close to reference policy
Temperature sampling provides controlled exploration
Entropy bonuses can encourage response diversity

Sample Efficiency: RL training requires many samples and can be computationally expensive:

Each training step requires generating complete responses
Multiple PPO epochs per batch of generated data
Large batch sizes needed for stable gradient estimates

Benefits and Challenges of RLHF

Advantages

Preference Optimization: Directly optimizes for human preferences rather than imitating demonstrations.

Flexibility: Can capture complex, context-dependent preferences difficult to specify in demonstrations.

Iterative Improvement: Reward models can be updated as new preference data becomes available.

Nuanced Behavior: Enables fine-grained control over model behavior through reward design.

Scalability: Preference collection can be more scalable than demonstration creation.

Challenges and Limitations

Reward Hacking: Models may exploit weaknesses in the reward model to achieve high scores without truly satisfying human preferences:

Gaming specific reward model biases
Optimizing for easily measurable aspects while ignoring others
Producing responses that seem good to the reward model but aren't actually helpful

Training Instability: RL training can be unstable and sensitive to hyperparameters:

Policy updates that are too large can cause performance collapse
Reward model inaccuracies can mislead training
Balancing exploration and exploitation requires careful tuning

Computational Cost: RLHF requires significantly more computation than SFT:

Training reward models on preference data
Running PPO with multiple model evaluations per update
Generating many samples for each training step

Alignment Tax: The process of alignment may reduce performance on some capabilities:

KL penalty prevents too much deviation from reference model
Safety constraints may limit model expressiveness
Optimizing for human preferences may not align with all downstream tasks

Direct Preference Optimization (DPO)

Theoretical Innovation

Direct Preference Optimization represents a breakthrough in alignment methodology by directly optimizing language models on preference data without requiring an explicit reward model or reinforcement learning. DPO reformulates the RLHF objective as a classification problem over preference pairs.

Mathematical Derivation

DPO starts with the RLHF objective and derives a closed-form solution. Under the Bradley-Terry preference model and the KL-constrained RL objective, the optimal policy has the form:

π*(y|x) = π_ref(y|x) exp(R*(x,y)/β) / Z(x)

Where Z(x) is a partition function and R* is the optimal reward function.

Rearranging this relationship, DPO derives that:

R*(x,y) = β log(π*(y|x)/π_ref(y|x)) + β log Z(x)

Since the partition function Z(x) doesn't depend on the specific response y, it cancels out when computing preference probabilities.

DPO Objective Function

The DPO loss directly optimizes the language model to satisfy preference constraints:

L_DPO = -E_{(x,y_w,y_l)~D} [log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))]

Where:

y_w is the preferred (winning) response
y_l is the less preferred (losing) response
π_θ is the policy being optimized
π_ref is the reference policy (typically the SFT model)
β is the temperature parameter controlling the strength of the KL constraint

Implementation Advantages

Simplified Training Pipeline

Single-Stage Training: DPO eliminates the need for separate reward model training and RL optimization:

Direct optimization on preference data
No intermediate reward model to train or maintain
Reduced computational requirements compared to RLHF

Stable Training Dynamics: DPO typically exhibits more stable training than PPO-based RLHF:

No exploration-exploitation dilemmas
No need to balance multiple loss components
More predictable convergence behavior

Memory Efficiency: DPO requires less memory than RLHF during training:

No need to store and update value functions
No requirement for multiple model copies during PPO updates
Simpler gradient computation and backpropagation

Theoretical Guarantees

Principled Objective: DPO's derivation from first principles provides theoretical grounding for its effectiveness:

Directly optimizes the same preferences that RLHF aims to satisfy
Eliminates potential misalignment between reward model and true preferences
Provides clearer theoretical understanding of what's being optimized

Preference Satisfaction: Under ideal conditions, DPO converges to the same solution as RLHF:

Same optimal policy given infinite data and perfect optimization
More direct path to preference satisfaction
Reduced risk of reward hacking and gaming

Practical Implementation

Training Procedure

Data Requirements: DPO requires the same preference data as RLHF reward model training:

Pairs of responses with human preference annotations
High-quality reference model (typically SFT-trained)
Diverse coverage of different prompt types and scenarios

Hyperparameter Selection: Key parameters for DPO training:

β (Beta): Controls the strength of the KL constraint (typically 0.1-0.5)
Learning Rate: Often smaller than standard fine-tuning (1e-6 to 1e-5)
Batch Size: Larger batches generally improve stability
Training Steps: Fewer steps typically needed compared to RLHF

Comparison with RLHF

Performance: Empirical studies show DPO often matches or exceeds RLHF performance:

Similar final model quality on preference benchmarks
Sometimes better generalization to out-of-distribution prompts
More consistent results across different training runs

Efficiency: DPO provides significant computational savings:

2-3x faster training time compared to full RLHF pipeline
Reduced memory requirements during training
Simpler implementation and debugging

Robustness: DPO often shows better robustness properties:

Less sensitive to hyperparameter choices
More stable across different model sizes
Better handling of low-quality preference data

Limitations and Considerations

Theoretical Limitations

Assumption Sensitivity: DPO's theoretical guarantees depend on several assumptions:

Bradley-Terry preference model accurately captures human preferences
Preference data is high-quality and consistent
Reference model provides good initialization

Limited Expressivity: DPO optimizes a specific form of preference satisfaction:

May not capture all aspects of human preference complexity
Assumes preferences can be captured through pairwise comparisons
May struggle with context-dependent or conditional preferences

Practical Challenges

Data Quality Sensitivity: DPO performance heavily depends on preference data quality:

Inconsistent annotations can mislead training
Biased preference data leads to biased models
Limited diversity in preference data affects generalization

Reference Model Dependence: DPO requires a high-quality reference model:

Poor SFT models can limit DPO effectiveness
Reference model capabilities constrain final model abilities
Choice of reference model affects optimization dynamics

Advanced Topics in Preference Learning

Preference Pair Sampling Strategies

Sampling from Model Outputs

Temperature Sampling: Generate diverse responses using temperature-controlled sampling:

Higher temperatures produce more diverse but potentially lower-quality responses
Lower temperatures produce more conservative but potentially repetitive responses
Optimal temperature depends on the specific model and task

Top-k and Top-p Sampling: Control response diversity through vocabulary filtering:

Top-k limits choices to k most likely tokens
Top-p (nucleus sampling) uses dynamic vocabulary based on cumulative probability
Combination of both methods often works best

Contrastive Sampling: Deliberately generate pairs with different characteristics:

Sample responses with different risk levels
Generate responses with varying levels of detail
Create pairs that highlight specific preference dimensions

Active Learning for Preferences

Uncertainty Sampling: Focus annotation effort on examples where current models are most uncertain:

Identify prompts where model confidence is low
Prioritize examples with high disagreement between models
Sample from regions of input space with sparse preference data

Disagreement Sampling: Target cases where different models or annotators disagree:

Identify systematic differences in model behavior
Focus on edge cases and boundary conditions
Improve model robustness through targeted data collection

Constitutional AI and Principle-Based Training

Constitutional AI Framework

Principle Definition: Explicit specification of behavioral principles:

Define clear, actionable principles for model behavior
Create hierarchies of principles for conflict resolution
Ensure principles are interpretable and auditable

Self-Critique Process: Models evaluate and improve their own responses:

Generate initial response to prompt
Critique response against constitutional principles
Revise response based on critique
Iterate until principles are satisfied

Scalable Oversight: Reduce human annotation requirements through principled self-improvement:

Use principles to generate training signal automatically
Human oversight focuses on principle definition and validation
Scale to many principles and scenarios with limited human effort

Implementation Strategies

Critique Model Training: Train specialized models to evaluate responses against principles:

Fine-tune models to identify principle violations
Generate explanations for why responses violate principles
Provide specific suggestions for improvement

Iterative Refinement: Continuously improve responses through multiple critique-revision cycles:

Apply critique models to identify issues
Generate improved responses addressing identified problems
Repeat process until satisfactory quality achieved

Handling Reward Hacking and Gaming

Types of Reward Hacking

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure":

Models optimize for reward model predictions rather than true human preferences
Exploit specific biases or limitations in reward model training
Achieve high scores through gaming rather than genuine improvement

Specification Gaming: Finding unexpected ways to achieve high rewards:

Exploit ambiguities in reward model specification
Take advantage of evaluation methodology weaknesses
Optimize for easily measurable aspects while ignoring harder-to-measure qualities

Mitigation Strategies

Robust Reward Models: Design reward models that are harder to game:

Train on diverse preference data covering edge cases
Use multiple evaluation criteria and combine them
Regularly update reward models based on discovered gaming strategies

Adversarial Training: Deliberately search for and address gaming strategies:

Red team models to find failure modes
Generate adversarial examples that exploit model weaknesses
Iteratively improve models based on discovered vulnerabilities

Multi-Objective Optimization: Balance multiple objectives to prevent gaming:

Optimize for multiple aspects of quality simultaneously
Use uncertainty estimates to identify potential gaming
Incorporate robustness metrics alongside performance measures

Evaluation and Assessment

Preference Evaluation Methodologies

Human Evaluation: Gold standard for assessing preference learning success:

Side-by-side comparisons between model variants
Absolute rating scales for individual responses
Task-specific evaluation criteria

Automated Metrics: Scalable evaluation using computational methods:

Reward model scores as proxies for human preferences
Consistency checks across similar prompts
Diversity and safety metrics

Benchmark Suites: Standardized evaluation across different scenarios:

Helpfulness benchmarks for task performance
Harmlessness evaluations for safety assessment
Honesty metrics for truthfulness and accuracy

Long-term Behavior Analysis

Distribution Shift Robustness: Evaluate performance on out-of-distribution inputs:

Test on prompts significantly different from training data
Assess performance across different domains and contexts
Monitor for degradation in edge cases

Preference Stability: Ensure learned preferences remain consistent:

Test preference consistency across similar scenarios
Monitor for preference drift during continued training
Validate preference generalization to new contexts

Future Directions and Emerging Approaches

Beyond Pairwise Preferences

Multi-way Comparisons: Extending beyond binary preferences:

Ranking multiple responses simultaneously
Capturing more nuanced preference relationships
Improving data efficiency through richer comparison information

Conditional Preferences: Context-dependent preference learning:

User-specific preference adaptation
Task-specific preference optimization
Dynamic preference adjustment based on context

Integration with Other Learning Paradigms

Meta-Learning for Preferences: Learning to learn preferences quickly:

Few-shot adaptation to new preference criteria
Transfer learning across related preference domains
Personalization with minimal user feedback

Continual Preference Learning: Updating preferences without forgetting:

Incorporating new preference data without catastrophic forgetting
Balancing stability and plasticity in preference models
Handling conflicting or evolving preferences over time

Scalability and Democratization

Efficient Preference Collection: Reducing the cost of preference data:

Automated preference generation using AI systems
Crowdsourcing strategies for large-scale preference collection
Active learning to minimize required human feedback

Open Source Tools: Making preference learning accessible:

Open implementations of DPO and RLHF
Standardized datasets and evaluation frameworks
Educational resources and best practices documentation

Conclusion: The Evolution of AI Alignment

The journey from Supervised Fine-Tuning through RLHF to Direct Preference Optimization represents a remarkable evolution in our ability to align AI systems with human values and preferences. Each paradigm brings unique advantages and addresses specific limitations of previous approaches, collectively advancing the state of AI alignment research and practice.

SFT provides the foundation—a simple, stable method for demonstrating desired behaviors that remains essential for initializing more sophisticated alignment procedures. RLHF introduced the revolutionary idea of optimizing directly for human preferences, enabling nuanced behavior that goes beyond simple imitation. DPO streamlined this process, providing many of RLHF's benefits with greater simplicity and efficiency.

Understanding these paradigms is crucial for several reasons. For researchers, they provide the theoretical foundation for developing next-generation alignment techniques. For practitioners, they offer practical tools for creating AI systems that behave appropriately and helpfully. For organizations, they represent essential capabilities for deploying AI systems responsibly and effectively.

The field continues to evolve rapidly, with emerging approaches addressing current limitations and extending capabilities to new domains. Constitutional AI principles, multi-objective optimization, and continual learning represent just some of the frontiers being explored. As AI systems become more capable and widely deployed, the importance of effective alignment techniques will only continue to grow.

The success of modern conversational AI systems—their ability to be helpful, harmless, and honest—stems directly from advances in these fine-tuning paradigms. As we look toward the future, continued innovation in preference learning, reward modeling, and alignment techniques will be essential for ensuring that increasingly powerful AI systems remain beneficial and aligned with human values.

Whether you're developing new models, fine-tuning existing systems, or simply seeking to understand how modern AI achieves its remarkable alignment with human preferences, mastery of these core paradigms provides the foundation for effective work in one of AI's most important and rapidly advancing areas.

This comprehensive exploration of fine-tuning paradigms provides essential knowledge for understanding how modern Large Language Models achieve their alignment with human preferences. As the field of AI alignment continues to evolve, these foundational concepts will remain central to developing safe, beneficial, and effective AI systems.

'IT' 카테고리의 다른 글

Advanced Prompt Engineering and Reasoning Techniques for Large Language Models: Mastering Chain-of-Thought, Self-Consistency, and Beyond (0)	2025.05.25
Parameter-Efficient Fine-Tuning (PEFT) for Large Language Models: A Comprehensive Guide to LoRA, QLoRA, and Modern Optimization Techniques (0)	2025.05.25
Pre-training Objectives and Optimization Strategies: The Engine of LLM Learning (0)	2025.05.25
Scaling Laws and Data Curation: The Science Behind LLM Performance (0)	2025.05.25
Deep Dive into Transformer Architecture: The Foundation of Modern LLMs (0)	2025.05.25

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Fine-tuning Paradigms: SFT, RLHF, and DPO - Aligning LLMs with Human Preferences

Introduction: Beyond Raw Language Modeling

Supervised Fine-Tuning (SFT): The Foundation Layer

Theoretical Framework

Mathematical Foundation

Dataset Design Principles

Implementation Strategies

Data Collection Methods

Training Dynamics

Strengths and Limitations

Advantages of SFT

Limitations and Challenges

Reinforcement Learning from Human Feedback (RLHF)

Theoretical Foundation

The RLHF Pipeline

Mathematical Framework

Reward Model Design and Training

Preference Data Collection

Reward Model Architecture

Reward Model Implementation

Proximal Policy Optimization (PPO) for Language Models

PPO Adaptation to Language Generation

PPO Objective for Language Models

Implementation Challenges

Benefits and Challenges of RLHF

Advantages

Challenges and Limitations

Direct Preference Optimization (DPO)

Theoretical Innovation

Mathematical Derivation

DPO Objective Function

Implementation Advantages

Simplified Training Pipeline

Theoretical Guarantees

Practical Implementation

Training Procedure

Comparison with RLHF

Limitations and Considerations

Theoretical Limitations

Practical Challenges

Advanced Topics in Preference Learning

Preference Pair Sampling Strategies

Sampling from Model Outputs

Active Learning for Preferences

Constitutional AI and Principle-Based Training

Constitutional AI Framework

Implementation Strategies

Handling Reward Hacking and Gaming

Types of Reward Hacking

Mitigation Strategies

Evaluation and Assessment

Preference Evaluation Methodologies

Long-term Behavior Analysis

Future Directions and Emerging Approaches

Beyond Pairwise Preferences

Integration with Other Learning Paradigms

Scalability and Democratization

Conclusion: The Evolution of AI Alignment

'IT' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역