본문 바로가기
IT

Fine-tuning Paradigms: SFT, RLHF, and DPO - Aligning LLMs with Human Preferences

by RTTR 2025. 5. 25.
반응형

Introduction: Beyond Raw Language Modeling

While pre-training provides Large Language Models with fundamental language understanding capabilities, the models that users interact with—from ChatGPT to Claude—undergo sophisticated fine-tuning processes that align their behavior with human preferences and values. This alignment is achieved through three primary paradigms: Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO).

Understanding these fine-tuning approaches is crucial for anyone working with modern LLMs, as they fundamentally shape how models respond to user queries, handle sensitive topics, and maintain helpful, harmless, and honest behavior. Each approach has distinct theoretical foundations, practical implementations, and trade-offs that determine their suitability for different applications.

This comprehensive guide explores the mathematical foundations, implementation strategies, and practical considerations of each fine-tuning paradigm, providing insights into how modern AI systems achieve their remarkable alignment with human preferences.

Supervised Fine-Tuning (SFT): The Foundation Layer

Theoretical Framework

Supervised Fine-Tuning represents the most straightforward approach to adapting pre-trained language models for specific behaviors or tasks. SFT continues the language modeling objective but on carefully curated datasets that demonstrate desired model behavior.

Mathematical Foundation

SFT optimizes the same autoregressive language modeling objective as pre-training:

L_SFT = -E_{(x,y)~D_SFT} [∑_{i=1}^{|y|} log P_θ(y_i | x, y_{<i})]

Where:

  • D_SFT is the supervised fine-tuning dataset
  • x represents the input prompt or context
  • y represents the desired model response
  • θ are the model parameters being optimized

The key difference from pre-training lies in the dataset composition rather than the objective function.

Dataset Design Principles

High-Quality Demonstrations: SFT datasets consist of input-output pairs that exemplify desired model behavior across various scenarios.

Task Coverage: Comprehensive coverage of different task types, interaction patterns, and edge cases that the model should handle appropriately.

Behavior Modeling: Examples demonstrate not just correct answers but appropriate tone, style, and reasoning processes.

Safety Integration: Include examples of handling sensitive, controversial, or potentially harmful requests appropriately.

Implementation Strategies

Data Collection Methods

Human Annotation: Expert annotators create high-quality examples of desired model behavior:

  • Detailed guidelines for consistent annotation
  • Multiple annotators per example for quality assurance
  • Regular calibration sessions to maintain standards
  • Specialized expertise for domain-specific content

Model-Assisted Generation: Use existing models to generate candidate responses, then human curators select and refine the best examples:

  • Reduces annotation cost while maintaining quality
  • Enables rapid scaling of dataset creation
  • Requires careful quality control to prevent error propagation
  • Useful for bootstrapping new domains or languages

Constitutional AI Methods: Generate responses according to explicit principles or rules:

  • Define clear behavioral principles or constitutions
  • Generate responses that follow these principles
  • Iterative refinement based on principle adherence
  • Transparent and auditable alignment process

Training Dynamics

Learning Rate Considerations: SFT typically uses lower learning rates than pre-training to preserve pre-trained knowledge while adapting behavior:

  • Start with 10-100x smaller learning rates than pre-training
  • Use warm-up periods to stabilize fine-tuning
  • Monitor for catastrophic forgetting of general capabilities

Data Efficiency: SFT can achieve significant behavioral changes with relatively small, high-quality datasets:

  • Hundreds to thousands of examples often sufficient
  • Quality more important than quantity
  • Diverse examples more valuable than repetitive ones

Overfitting Prevention: Balance between learning desired behaviors and maintaining generalization:

  • Early stopping based on held-out validation data
  • Regularization techniques to prevent memorization
  • Data augmentation through paraphrasing and variation

Strengths and Limitations

Advantages of SFT

Simplicity: Straightforward implementation using standard language modeling techniques.

Interpretability: Clear connection between training examples and model behavior.

Data Efficiency: Relatively small datasets can produce significant behavioral changes.

Stable Training: Well-understood training dynamics with predictable outcomes.

Foundation for Further Training: Provides good starting point for more advanced alignment techniques.

Limitations and Challenges

Distribution Shift: Models may struggle with inputs significantly different from SFT examples.

Reward Misspecification: Difficulty capturing all aspects of desired behavior in examples.

Limited Feedback Signal: Binary demonstration vs. no demonstration provides limited learning signal.

Exposure Bias: Models trained on perfect demonstrations may struggle with error recovery.

Scalability Challenges: Creating comprehensive SFT datasets becomes expensive as requirements grow.

Reinforcement Learning from Human Feedback (RLHF)

Theoretical Foundation

RLHF represents a sophisticated approach that trains models to optimize for human preferences rather than simply imitating human demonstrations. This paradigm treats language generation as a sequential decision-making problem and uses reinforcement learning to maximize a learned reward function.

The RLHF Pipeline

RLHF involves three distinct stages:

  1. Supervised Fine-Tuning: Initial behavioral training on demonstration data
  2. Reward Model Training: Learning to predict human preferences from comparison data
  3. Reinforcement Learning: Optimizing the language model using the learned reward model

Mathematical Framework

The core RLHF objective combines reward maximization with a regularization term to prevent deviation from the initial model:

L_RLHF = E_{x~D,y~π_θ(·|x)} [R(x,y)] - β · KL(π_θ(·|x) || π_ref(·|x))

Where:

  • R(x,y) is the reward model score for response y to prompt x
  • π_θ is the policy (language model) being optimized
  • π_ref is the reference model (typically the SFT model)
  • β controls the strength of the KL penalty
  • KL(·||·) is the Kullback-Leibler divergence

Reward Model Design and Training

Preference Data Collection

Pairwise Comparisons: Human annotators compare pairs of model responses and indicate which is better:

  • More natural for humans than scoring individual responses
  • Provides relative rather than absolute quality assessments
  • Enables bootstrapping from lower-quality initial models
  • Captures nuanced preferences difficult to specify explicitly

Comparison Interface Design: Effective interfaces for collecting high-quality preference data:

  • Side-by-side response presentation
  • Clear criteria for evaluation (helpfulness, harmlessness, honesty)
  • Optional reasoning fields for annotator explanations
  • Quality control mechanisms to identify inconsistent annotators

Reward Model Architecture

Bradley-Terry Model: The standard approach models the probability that response A is preferred over response B:

P(y_A ≻ y_B | x) = σ(R(x, y_A) - R(x, y_B))

Where σ is the sigmoid function and R is the learned reward function.

Loss Function: The reward model is trained to minimize:

L_RM = -E_{(x,y_A,y_B)~D_pref} [log σ(R(x, y_A) - R(x, y_B))]

Where the preference dataset D_pref contains human preference comparisons.

Reward Model Implementation

Architecture Choices: Reward models typically use the same architecture as the language model but with a scalar output head:

  • Share most parameters with the language model
  • Add a linear layer mapping hidden states to scalar rewards
  • Often use the final token representation for sequence-level scoring

Training Considerations: Stable reward model training requires careful attention to:

  • Data quality and annotator agreement
  • Regularization to prevent overfitting to preference data
  • Evaluation on held-out preference sets
  • Calibration to ensure reward scores reflect true quality differences

Proximal Policy Optimization (PPO) for Language Models

PPO Adaptation to Language Generation

Policy Representation: The language model serves as a stochastic policy:

π_θ(y|x) = ∏_{i=1}^{|y|} P_θ(y_i | x, y_{<i})

Value Function: Estimate expected future rewards from each state:

V(x, y_{<i}) = E_{y_{≥i}~π_θ} [R(x, y_{<i} + y_{≥i})]

Advantage Estimation: Measure how much better an action is than expected:

A(x, y_{<i}, y_i) = Q(x, y_{<i}, y_i) - V(x, y_{<i})

PPO Objective for Language Models

The PPO objective balances policy improvement with stability:

L_PPO = E_t [min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

Where:

  • r_t(θ) = π_θ(y_t|x_t, y_{<t}) / π_old(y_t|x_t, y_{<t}) is the probability ratio
  • A_t is the advantage estimate
  • ε is the clipping parameter (typically 0.2)

Implementation Challenges

Credit Assignment: Determining which tokens deserve credit for high rewards:

  • Reward is typically assigned to the final token
  • Value function helps distribute credit across the sequence
  • Baseline subtraction reduces variance in gradient estimates

Exploration vs. Exploitation: Balancing between trying new responses and exploiting known good responses:

  • KL penalty encourages staying close to reference policy
  • Temperature sampling provides controlled exploration
  • Entropy bonuses can encourage response diversity

Sample Efficiency: RL training requires many samples and can be computationally expensive:

  • Each training step requires generating complete responses
  • Multiple PPO epochs per batch of generated data
  • Large batch sizes needed for stable gradient estimates

Benefits and Challenges of RLHF

Advantages

Preference Optimization: Directly optimizes for human preferences rather than imitating demonstrations.

Flexibility: Can capture complex, context-dependent preferences difficult to specify in demonstrations.

Iterative Improvement: Reward models can be updated as new preference data becomes available.

Nuanced Behavior: Enables fine-grained control over model behavior through reward design.

Scalability: Preference collection can be more scalable than demonstration creation.

Challenges and Limitations

Reward Hacking: Models may exploit weaknesses in the reward model to achieve high scores without truly satisfying human preferences:

  • Gaming specific reward model biases
  • Optimizing for easily measurable aspects while ignoring others
  • Producing responses that seem good to the reward model but aren't actually helpful

Training Instability: RL training can be unstable and sensitive to hyperparameters:

  • Policy updates that are too large can cause performance collapse
  • Reward model inaccuracies can mislead training
  • Balancing exploration and exploitation requires careful tuning

Computational Cost: RLHF requires significantly more computation than SFT:

  • Training reward models on preference data
  • Running PPO with multiple model evaluations per update
  • Generating many samples for each training step

Alignment Tax: The process of alignment may reduce performance on some capabilities:

  • KL penalty prevents too much deviation from reference model
  • Safety constraints may limit model expressiveness
  • Optimizing for human preferences may not align with all downstream tasks

Direct Preference Optimization (DPO)

Theoretical Innovation

Direct Preference Optimization represents a breakthrough in alignment methodology by directly optimizing language models on preference data without requiring an explicit reward model or reinforcement learning. DPO reformulates the RLHF objective as a classification problem over preference pairs.

Mathematical Derivation

DPO starts with the RLHF objective and derives a closed-form solution. Under the Bradley-Terry preference model and the KL-constrained RL objective, the optimal policy has the form:

π*(y|x) = π_ref(y|x) exp(R*(x,y)/β) / Z(x)

Where Z(x) is a partition function and R* is the optimal reward function.

Rearranging this relationship, DPO derives that:

R*(x,y) = β log(π*(y|x)/π_ref(y|x)) + β log Z(x)

Since the partition function Z(x) doesn't depend on the specific response y, it cancels out when computing preference probabilities.

DPO Objective Function

The DPO loss directly optimizes the language model to satisfy preference constraints:

L_DPO = -E_{(x,y_w,y_l)~D} [log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))]

Where:

  • y_w is the preferred (winning) response
  • y_l is the less preferred (losing) response
  • π_θ is the policy being optimized
  • π_ref is the reference policy (typically the SFT model)
  • β is the temperature parameter controlling the strength of the KL constraint

Implementation Advantages

Simplified Training Pipeline

Single-Stage Training: DPO eliminates the need for separate reward model training and RL optimization:

  • Direct optimization on preference data
  • No intermediate reward model to train or maintain
  • Reduced computational requirements compared to RLHF

Stable Training Dynamics: DPO typically exhibits more stable training than PPO-based RLHF:

  • No exploration-exploitation dilemmas
  • No need to balance multiple loss components
  • More predictable convergence behavior

Memory Efficiency: DPO requires less memory than RLHF during training:

  • No need to store and update value functions
  • No requirement for multiple model copies during PPO updates
  • Simpler gradient computation and backpropagation

Theoretical Guarantees

Principled Objective: DPO's derivation from first principles provides theoretical grounding for its effectiveness:

  • Directly optimizes the same preferences that RLHF aims to satisfy
  • Eliminates potential misalignment between reward model and true preferences
  • Provides clearer theoretical understanding of what's being optimized

Preference Satisfaction: Under ideal conditions, DPO converges to the same solution as RLHF:

  • Same optimal policy given infinite data and perfect optimization
  • More direct path to preference satisfaction
  • Reduced risk of reward hacking and gaming

Practical Implementation

Training Procedure

Data Requirements: DPO requires the same preference data as RLHF reward model training:

  • Pairs of responses with human preference annotations
  • High-quality reference model (typically SFT-trained)
  • Diverse coverage of different prompt types and scenarios

Hyperparameter Selection: Key parameters for DPO training:

  • β (Beta): Controls the strength of the KL constraint (typically 0.1-0.5)
  • Learning Rate: Often smaller than standard fine-tuning (1e-6 to 1e-5)
  • Batch Size: Larger batches generally improve stability
  • Training Steps: Fewer steps typically needed compared to RLHF

Comparison with RLHF

Performance: Empirical studies show DPO often matches or exceeds RLHF performance:

  • Similar final model quality on preference benchmarks
  • Sometimes better generalization to out-of-distribution prompts
  • More consistent results across different training runs

Efficiency: DPO provides significant computational savings:

  • 2-3x faster training time compared to full RLHF pipeline
  • Reduced memory requirements during training
  • Simpler implementation and debugging

Robustness: DPO often shows better robustness properties:

  • Less sensitive to hyperparameter choices
  • More stable across different model sizes
  • Better handling of low-quality preference data

Limitations and Considerations

Theoretical Limitations

Assumption Sensitivity: DPO's theoretical guarantees depend on several assumptions:

  • Bradley-Terry preference model accurately captures human preferences
  • Preference data is high-quality and consistent
  • Reference model provides good initialization

Limited Expressivity: DPO optimizes a specific form of preference satisfaction:

  • May not capture all aspects of human preference complexity
  • Assumes preferences can be captured through pairwise comparisons
  • May struggle with context-dependent or conditional preferences

Practical Challenges

Data Quality Sensitivity: DPO performance heavily depends on preference data quality:

  • Inconsistent annotations can mislead training
  • Biased preference data leads to biased models
  • Limited diversity in preference data affects generalization

Reference Model Dependence: DPO requires a high-quality reference model:

  • Poor SFT models can limit DPO effectiveness
  • Reference model capabilities constrain final model abilities
  • Choice of reference model affects optimization dynamics

Advanced Topics in Preference Learning

Preference Pair Sampling Strategies

Sampling from Model Outputs

Temperature Sampling: Generate diverse responses using temperature-controlled sampling:

  • Higher temperatures produce more diverse but potentially lower-quality responses
  • Lower temperatures produce more conservative but potentially repetitive responses
  • Optimal temperature depends on the specific model and task

Top-k and Top-p Sampling: Control response diversity through vocabulary filtering:

  • Top-k limits choices to k most likely tokens
  • Top-p (nucleus sampling) uses dynamic vocabulary based on cumulative probability
  • Combination of both methods often works best

Contrastive Sampling: Deliberately generate pairs with different characteristics:

  • Sample responses with different risk levels
  • Generate responses with varying levels of detail
  • Create pairs that highlight specific preference dimensions

Active Learning for Preferences

Uncertainty Sampling: Focus annotation effort on examples where current models are most uncertain:

  • Identify prompts where model confidence is low
  • Prioritize examples with high disagreement between models
  • Sample from regions of input space with sparse preference data

Disagreement Sampling: Target cases where different models or annotators disagree:

  • Identify systematic differences in model behavior
  • Focus on edge cases and boundary conditions
  • Improve model robustness through targeted data collection

Constitutional AI and Principle-Based Training

Constitutional AI Framework

Principle Definition: Explicit specification of behavioral principles:

  • Define clear, actionable principles for model behavior
  • Create hierarchies of principles for conflict resolution
  • Ensure principles are interpretable and auditable

Self-Critique Process: Models evaluate and improve their own responses:

  • Generate initial response to prompt
  • Critique response against constitutional principles
  • Revise response based on critique
  • Iterate until principles are satisfied

Scalable Oversight: Reduce human annotation requirements through principled self-improvement:

  • Use principles to generate training signal automatically
  • Human oversight focuses on principle definition and validation
  • Scale to many principles and scenarios with limited human effort

Implementation Strategies

Critique Model Training: Train specialized models to evaluate responses against principles:

  • Fine-tune models to identify principle violations
  • Generate explanations for why responses violate principles
  • Provide specific suggestions for improvement

Iterative Refinement: Continuously improve responses through multiple critique-revision cycles:

  • Apply critique models to identify issues
  • Generate improved responses addressing identified problems
  • Repeat process until satisfactory quality achieved

Handling Reward Hacking and Gaming

Types of Reward Hacking

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure":

  • Models optimize for reward model predictions rather than true human preferences
  • Exploit specific biases or limitations in reward model training
  • Achieve high scores through gaming rather than genuine improvement

Specification Gaming: Finding unexpected ways to achieve high rewards:

  • Exploit ambiguities in reward model specification
  • Take advantage of evaluation methodology weaknesses
  • Optimize for easily measurable aspects while ignoring harder-to-measure qualities

Mitigation Strategies

Robust Reward Models: Design reward models that are harder to game:

  • Train on diverse preference data covering edge cases
  • Use multiple evaluation criteria and combine them
  • Regularly update reward models based on discovered gaming strategies

Adversarial Training: Deliberately search for and address gaming strategies:

  • Red team models to find failure modes
  • Generate adversarial examples that exploit model weaknesses
  • Iteratively improve models based on discovered vulnerabilities

Multi-Objective Optimization: Balance multiple objectives to prevent gaming:

  • Optimize for multiple aspects of quality simultaneously
  • Use uncertainty estimates to identify potential gaming
  • Incorporate robustness metrics alongside performance measures

Evaluation and Assessment

Preference Evaluation Methodologies

Human Evaluation: Gold standard for assessing preference learning success:

  • Side-by-side comparisons between model variants
  • Absolute rating scales for individual responses
  • Task-specific evaluation criteria

Automated Metrics: Scalable evaluation using computational methods:

  • Reward model scores as proxies for human preferences
  • Consistency checks across similar prompts
  • Diversity and safety metrics

Benchmark Suites: Standardized evaluation across different scenarios:

  • Helpfulness benchmarks for task performance
  • Harmlessness evaluations for safety assessment
  • Honesty metrics for truthfulness and accuracy

Long-term Behavior Analysis

Distribution Shift Robustness: Evaluate performance on out-of-distribution inputs:

  • Test on prompts significantly different from training data
  • Assess performance across different domains and contexts
  • Monitor for degradation in edge cases

Preference Stability: Ensure learned preferences remain consistent:

  • Test preference consistency across similar scenarios
  • Monitor for preference drift during continued training
  • Validate preference generalization to new contexts

Future Directions and Emerging Approaches

Beyond Pairwise Preferences

Multi-way Comparisons: Extending beyond binary preferences:

  • Ranking multiple responses simultaneously
  • Capturing more nuanced preference relationships
  • Improving data efficiency through richer comparison information

Conditional Preferences: Context-dependent preference learning:

  • User-specific preference adaptation
  • Task-specific preference optimization
  • Dynamic preference adjustment based on context

Integration with Other Learning Paradigms

Meta-Learning for Preferences: Learning to learn preferences quickly:

  • Few-shot adaptation to new preference criteria
  • Transfer learning across related preference domains
  • Personalization with minimal user feedback

Continual Preference Learning: Updating preferences without forgetting:

  • Incorporating new preference data without catastrophic forgetting
  • Balancing stability and plasticity in preference models
  • Handling conflicting or evolving preferences over time

Scalability and Democratization

Efficient Preference Collection: Reducing the cost of preference data:

  • Automated preference generation using AI systems
  • Crowdsourcing strategies for large-scale preference collection
  • Active learning to minimize required human feedback

Open Source Tools: Making preference learning accessible:

  • Open implementations of DPO and RLHF
  • Standardized datasets and evaluation frameworks
  • Educational resources and best practices documentation

Conclusion: The Evolution of AI Alignment

The journey from Supervised Fine-Tuning through RLHF to Direct Preference Optimization represents a remarkable evolution in our ability to align AI systems with human values and preferences. Each paradigm brings unique advantages and addresses specific limitations of previous approaches, collectively advancing the state of AI alignment research and practice.

SFT provides the foundation—a simple, stable method for demonstrating desired behaviors that remains essential for initializing more sophisticated alignment procedures. RLHF introduced the revolutionary idea of optimizing directly for human preferences, enabling nuanced behavior that goes beyond simple imitation. DPO streamlined this process, providing many of RLHF's benefits with greater simplicity and efficiency.

Understanding these paradigms is crucial for several reasons. For researchers, they provide the theoretical foundation for developing next-generation alignment techniques. For practitioners, they offer practical tools for creating AI systems that behave appropriately and helpfully. For organizations, they represent essential capabilities for deploying AI systems responsibly and effectively.

The field continues to evolve rapidly, with emerging approaches addressing current limitations and extending capabilities to new domains. Constitutional AI principles, multi-objective optimization, and continual learning represent just some of the frontiers being explored. As AI systems become more capable and widely deployed, the importance of effective alignment techniques will only continue to grow.

The success of modern conversational AI systems—their ability to be helpful, harmless, and honest—stems directly from advances in these fine-tuning paradigms. As we look toward the future, continued innovation in preference learning, reward modeling, and alignment techniques will be essential for ensuring that increasingly powerful AI systems remain beneficial and aligned with human values.

Whether you're developing new models, fine-tuning existing systems, or simply seeking to understand how modern AI achieves its remarkable alignment with human preferences, mastery of these core paradigms provides the foundation for effective work in one of AI's most important and rapidly advancing areas.


This comprehensive exploration of fine-tuning paradigms provides essential knowledge for understanding how modern Large Language Models achieve their alignment with human preferences. As the field of AI alignment continues to evolve, these foundational concepts will remain central to developing safe, beneficial, and effective AI systems.

반응형