Introduction to the Finetuning Scheduler¶

The FinetuningScheduler callback accelerates and enhances foundational model experimentation with flexible finetuning schedules. Training with the FinetuningScheduler callback is simple and confers a host of benefits:

it dramatically increases finetuning flexibility
expedites and facilitates exploration of model tuning dynamics
enables marginal performance improvements of finetuned models

Note

If you’re exploring using the FinetuningScheduler, this is a great place to start! You may also find the notebook-based tutorial useful (link provided here as soon as it is published on the pytorch lightning production documentation site) and for those using the LightningCLI, there is a CLI-based example at the bottom of this introduction.

Setup¶

Setup is straightforward, just install from PyPI!

pip install finetuning-scheduler

Additional installation options (from source etc.) are discussed under “Additional installation options” in the README

Motivation¶

Fundamentally, the FinetuningScheduler callback enables multi-phase, scheduled finetuning of foundational models. Gradual unfreezing (i.e. thawing) can help maximize foundational model knowledge retention while allowing (typically upper layers of) the model to optimally adapt to new tasks during transfer learning 1 2 3 .

FinetuningScheduler orchestrates the gradual unfreezing of models via a finetuning schedule that is either implicitly generated (the default) or explicitly provided by the user (more computationally efficient). Finetuning phase transitions are driven by FTSEarlyStopping criteria (a multi-phase extension of EarlyStopping), user-specified epoch transitions or a composition of the two (the default mode). A FinetuningScheduler training session completes when the final phase of the schedule has its stopping criteria met. See Early Stopping for more details on that callback’s configuration.

Basic Usage¶

If no finetuning schedule is user-provided, FinetuningScheduler will generate a default schedule and proceed to finetune according to the generated schedule, using default FTSEarlyStopping and FTSCheckpoint callbacks with monitor=val_loss.

from pytorch_lightning import Trainer
from finetuning_scheduler import FinetuningScheduler

trainer = Trainer(callbacks=[FinetuningScheduler()])

The Default Finetuning Schedule¶

Schedule definition is facilitated via gen_ft_schedule() which dumps a default finetuning schedule (by default using a naive, 2-parameters per level heuristic) which can be adjusted as desired by the user and/or subsequently passed to the callback. Using the default/implicitly generated schedule will often be less computationally efficient than a user-defined finetuning schedule but can often serve as a good baseline for subsequent explicit schedule refinement and will marginally outperform many explicit schedules.

Specifying a Finetuning Schedule¶

To specify a finetuning schedule, it’s convenient to first generate the default schedule and then alter the thawed/unfrozen parameter groups associated with each finetuning phase as desired. Finetuning phases are zero-indexed and executed in ascending order.

First, generate the default schedule to Trainer.log_dir. It will be named after your LightningModule subclass with the suffix _ft_schedule.yaml.

from pytorch_lightning import Trainer
from finetuning_scheduler import FinetuningScheduler

trainer = Trainer(callbacks=[FinetuningScheduler(gen_ft_sched_only=True)])

Alter the schedule as desired.

Changing the generated schedule for this boring model…

  0:
      params:
      - layer.3.bias
      - layer.3.weight
  1:
      params:
      - layer.2.bias
      - layer.2.weight
  2:
      params:
      - layer.1.bias
      - layer.1.weight
  3:
      params:
      - layer.0.bias
      - layer.0.weight

… to have three finetuning phases instead of four:

  0:
      params:
      - layer.3.bias
      - layer.3.weight
  1:
      params:
      - layer.2.*
      - layer.1.bias
      - layer.1.weight
  2:
      params:
      - layer.0.*

Once the finetuning schedule has been altered as desired, pass it to FinetuningScheduler to commence scheduled training:

from pytorch_lightning import Trainer
from finetuning_scheduler import FinetuningScheduler

trainer = Trainer(callbacks=[FinetuningScheduler(ft_schedule="/path/to/my/schedule/my_schedule.yaml")])

EarlyStopping and Epoch-Driven Phase Transition Criteria¶

By default, FTSEarlyStopping and epoch-driven transition criteria are composed. If a max_transition_epoch is specified for a given phase, the next finetuning phase will begin at that epoch unless FTSEarlyStopping criteria are met first. If epoch_transitions_only is True, FTSEarlyStopping will not be used and transitions will be exclusively epoch-driven.

Tip

Use of regex expressions can be convenient for specifying more complex schedules. Also, a per-phase base_max_lr can be specified:

 0:
   params: # the parameters for each phase definition can be fully specified
   - model.classifier.bias
   - model.classifier.weight
   max_transition_epoch: 3
 1:
   params: # or specified via a regex
   - model.albert.pooler.*
 2:
   params:
   - model.albert.encoder.*.ffn_output.*
   max_transition_epoch: 9
   lr: 1e-06 # per-phase maximum learning rates can be specified
 3:
   params: # both approaches to parameter specification can be used in the same phase
   - model.albert.encoder.*.(ffn\.|attention|full*).*
   - model.albert.encoder.embedding_hidden_mapping_in.bias
   - model.albert.encoder.embedding_hidden_mapping_in.weight
   - model.albert.embeddings.*

For a practical end-to-end example of using FinetuningScheduler in implicit versus explicit modes, see scheduled finetuning for SuperGLUE below or the notebook-based tutorial (link will be added as soon as it is released on the PyTorch Lightning production documentation site).

Resuming Scheduled Finetuning Training Sessions¶

Resumption of scheduled finetuning training is identical to the continuation of other training sessions with the caveat that the provided checkpoint must have been saved by a FinetuningScheduler session. FinetuningScheduler uses FTSCheckpoint (an extension of ModelCheckpoint) to maintain schedule state with special metadata.

from pytorch_lightning import Trainer
from finetuning_scheduler import FinetuningScheduler

trainer = Trainer(callbacks=[FinetuningScheduler()], ckpt_path="some/path/to/my_checkpoint.ckpt")

Training will resume at the depth/level of the provided checkpoint according the specified schedule. Schedules can be altered between training sessions but schedule compatibility is left to the user for maximal flexibility. If executing a user-defined schedule, typically the same schedule should be provided for the original and resumed training sessions.

Tip

By default ( restore_best is True), FinetuningScheduler will attempt to restore the best available checkpoint before finetuning depth transitions.

trainer = Trainer(
    callbacks=[FinetuningScheduler()],
    ckpt_path="some/path/to/my_kth_best_checkpoint.ckpt",
)

Note that similar to the behavior of ModelCheckpoint, (specifically this PR), when resuming training with a different FTSCheckpoint dirpath from the provided checkpoint, the new training session’s checkpoint state will be re-initialized at the resumption depth with the provided checkpoint being set as the best checkpoint.

Finetuning all the way down!¶

There are plenty of options for customizing FinetuningScheduler’s behavior, see scheduled finetuning for SuperGLUE below for examples of composing different configurations.

Note

Currently, FinetuningScheduler supports the following Strategy s:

DataParallelStrategy

Example: Scheduled Finetuning For SuperGLUE¶

A demonstration of the scheduled finetuning callback FinetuningScheduler using the RTE and BoolQ tasks of the SuperGLUE benchmark and the LightningCLI is available under ./fts_examples/.

Since this CLI-based example requires a few additional packages (e.g. transformers, sentencepiece), you should install them using the [examples] extra:

pip install finetuning-scheduler['examples']

There are three different demo schedule configurations composed with shared defaults (./config/fts_defaults.yaml) provided for the default ‘rte’ task. Note DDP (with auto-selected GPUs) is the default configuration so ensure you adjust the configuration files referenced below as desired for other configurations.

# Generate a baseline without scheduled finetuning enabled:
python fts_superglue.py fit --config config/nofts_baseline.yaml

# Train with the default finetuning schedule:
python fts_superglue.py fit --config config/fts_implicit.yaml

# Train with a non-default finetuning schedule:
python fts_superglue.py fit --config config/fts_explicit.yaml

All three training scenarios use identical configurations with the exception of the provided finetuning schedule. See the tensorboard experiment summaries and table below for a characterization of the relative computational and performance tradeoffs associated with these FinetuningScheduler configurations.

FinetuningScheduler expands the space of possible finetuning schedules and the composition of more sophisticated schedules can yield marginal finetuning performance gains. That stated, it should be emphasized the primary utility of FinetuningScheduler is to grant greater finetuning flexibility for model exploration in research. For example, glancing at DeBERTa-v3’s implicit training run, a critical tuning transition point is immediately apparent:

Our val_loss begins a precipitous decline at step 3119 which corresponds to phase 17 in the schedule. Referring to our schedule, in phase 17 we’re beginning tuning the attention parameters of our 10th encoder layer (of 11). Interesting! Though beyond the scope of this documentation, it might be worth investigating these dynamics further and FinetuningScheduler allows one to do just that quite easily.

In addition to the tensorboard experiment summaries, full logs/schedules for all three scenarios are available as well as the checkpoints produced in the scenarios (caution, ~3.5GB).

Example Scenario	nofts_baseline	fts_implicit	fts_explicit
Finetuning Schedule	None	Default	User-defined
RTE Accuracy (`0.81`, `0.84`, `0.85`)

Note that though this example is intended to capture a common usage scenario, substantial variation is expected among use cases and models. In summary, FinetuningScheduler provides increased finetuning flexibility that can be useful in a variety of contexts from exploring model tuning behavior to maximizing performance.

FinetuningScheduler Explicit Loss Animation

Note

The FinetuningScheduler callback is currently in beta.

Footnotes¶

1: Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. ArXiv, abs/1801.06146.
2: Chronopoulou, A., Baziotis, C., & Potamianos, A. (2019). An embarrassingly simple approach for transfer learning from pretrained language models. arXiv preprint arXiv:1902.10547.
3: Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987.

Introduction to the Finetuning Scheduler¶

Setup¶

Motivation¶

Basic Usage¶

The Default Finetuning Schedule¶

Specifying a Finetuning Schedule¶

EarlyStopping and Epoch-Driven Phase Transition Criteria¶

Resuming Scheduled Finetuning Training Sessions¶

Finetuning all the way down!¶

Example: Scheduled Finetuning For SuperGLUE¶

Footnotes¶

Indices and tables¶