TL;DR: we can transplant >80% instruction-following performance from small models to large models, without actually tuning them.
The DPO authors provide a nice intuition for EFT, based on a model-as-rm perspective.
Recall that an RLHF-finetuned model optimizes the following objective:
$$ \pi_{\mathrm{ft}}=\pi^*\left(r, \pi_{\mathrm{ref}}\right)=\underset{\pi}{\arg \max } \underset{x \sim p(x), y \sim \pi(\cdot \mid x)}{\mathbb{E}}\left[r(x, y)-\beta \mathrm{KL}\left(\pi(\cdot \mid x) \| \pi_{\mathrm{ref}}(\cdot \mid x)\right)\right] $$The closed-form solution is given by
$$ \pi^*\left(r, \pi_{\mathrm{ref}}\right)(y \mid x)=\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) $$The main idea of this paper is as follows: any finetuned models, regardless of whether they are RLHF-finetuned or not, can be regarded as solutions to the KL-constrained RL problem with respect to certain reward functions.
$$ \pi_{\mathrm{ft}}(y \mid x)=\pi_{\mathrm{ref}}(y \mid x) \exp (\frac{1}{\beta} \underbrace{\beta \log \frac{\pi_{\mathrm{ft}}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}}_{\text{Implicit reward}}) $$From this interpretation, we have the following insights:
These insights provide a conceptual tool to consider scale decoupling, where a $M$-scaled reward function are distilled into a $N$-scaled base model.
$$ \pi_M^N(y \mid x)=\frac{1}{Z_M^N(x)} \pi_{\text {ref }}^N(y \mid x) \exp \left(r_\pi^M(x, y)\right) \propto \pi_{\text {ref }}^N(y \mid x) \frac{\pi^M(y \mid x)}{\pi_{\text {ref }}^M(y \mid x)} $$Sampling from this distribution emulates finetuning. When $N > M$ (up-scaling), we emulate the result of finetuning a large model; when $N < M$ (down-scaling), we emulate the result of finetuning a small model.
In this work, we mainly consider $N > M$ (up-scaling), which is a more practical scenario.
The core question is: what capabilities change when independently scaling pretraining vs finetuning?
Setups:
For independently scaling pretraining and finetuning, four models are compared:
small pretraining | large pretraining | |
---|---|---|
small finetuning | lower bound | EFT up-scaled |
large finetuning | EFT down-scaled | upper bound |
Normalized improvements in factuality and helpfulness from emulated fine-tuning. These normalized improvements can be seen as performance gap recovered (PGR) according to the weak-to-strong paper: PGR = (EFT scaled - lower bound) / (upper bound - lower bound).
We can observe that EFT down-scaling (top) almost matches the upper bound in helpfulness, whereas EFT up-scaling (bottom) matches the upper bound in factuality.
In addition to scaling, EFT provides a test-time method for controllable generation. Given the interpretation that any finetuned models are reward models, we can interpolate two models finetuned for different objectives to get a helpful-harmful frontier, without retraining.
$$ r_\lambda^M(x, y)=\lambda r_{\text {help }}^M(x, y)+(1-\lambda) \pi_{\text {safe }}^M $$GPT-4-evaluated helpfulness and harmfulness on Anthropic-HH prompts.
Recall that EFT up-scaling requires three models: 1) large pretrained, 2) small pretrained, and 3) small finetuned. This small-large combination naturally leads to the idea of speculative decoding.
Identifying tokens where the up-scaled small policy has high TV distance with the small policy alone.
We observe 2.5x speed-ups for EFT-upscaling:
Note that EFT up-scaling can be rewritten as:
$$ \log \tilde{\pi}\left(y_t \mid x, y_{Left: up-scaled Llama-2 on HumanEval; right: up-scaled Llama-2 on ELI5.
Sampling from up-scaled logits can be noisy (i.e., w/ high variance). To mitigate the potential issues, top-$p$ sampling can be applied on top of EFT up-scaling.
Top-$p$ mildly improves EFT up-scaling.
While two papers present almost the same methodology, there are some complementary findings from that paper.
Weak-to-strong generalization
We observe > 80% PGRs in this paper, but in the weak-to-strong paper the PGRs are relatively low for tasks like reward modeling. Does it mean that EFT can solve the weak-to-strong problem?