Shortcuts

Introduction to the Fine-Tuning Scheduler

The FinetuningScheduler callback accelerates and enhances foundation model experimentation with flexible fine-tuning schedules. Training with the FinetuningScheduler (FTS) callback is simple and confers a host of benefits:

  • it dramatically increases fine-tuning flexibility

  • expedites and facilitates exploration of model tuning dynamics

  • enables marginal performance improvements of fine-tuned models

Note

If you’re exploring using the FinetuningScheduler, this is a great place to start! You may also find the notebook-based tutorial useful and for those using the LightningCLI, there is a CLI-based example at the bottom of this introduction.

Setup

Setup is straightforward, just install from PyPI!

pip install finetuning-scheduler

Additional installation options (from source etc.) are discussed under “Additional installation options” in the README

Motivation

Fundamentally, the FinetuningScheduler callback enables multi-phase, scheduled fine-tuning of foundation models. Gradual unfreezing (i.e. thawing) can help maximize foundation model knowledge retention while allowing (typically upper layers of) the model to optimally adapt to new tasks during transfer learning 1 2 3 .

FinetuningScheduler orchestrates the gradual unfreezing of models via a fine-tuning schedule that is either implicitly generated (the default) or explicitly provided by the user (more computationally efficient). fine-tuning phase transitions are driven by FTSEarlyStopping criteria (a multi-phase extension of EarlyStopping), user-specified epoch transitions or a composition of the two (the default mode). A FinetuningScheduler training session completes when the final phase of the schedule has its stopping criteria met. See Early Stopping for more details on that callback’s configuration.

Basic Usage

If no fine-tuning schedule is user-provided, FinetuningScheduler will generate a default schedule and proceed to fine-tune according to the generated schedule, using default FTSEarlyStopping and FTSCheckpoint callbacks with monitor=val_loss.

import lightning as L
from finetuning_scheduler import FinetuningScheduler

trainer = L.Trainer(callbacks=[FinetuningScheduler()])

The Default Fine-Tuning Schedule

Schedule definition is facilitated via gen_ft_schedule() which dumps a default fine-tuning schedule (by default using a naive, 2-parameters per level heuristic) which can be adjusted as desired by the user and/or subsequently passed to the callback. Using the default/implicitly generated schedule will often be less computationally efficient than a user-defined fine-tuning schedule but can often serve as a good baseline for subsequent explicit schedule refinement and will marginally outperform many explicit schedules.

Specifying a Fine-Tuning Schedule

To specify a fine-tuning schedule, it’s convenient to first generate the default schedule and then alter the thawed/unfrozen parameter groups associated with each fine-tuning phase as desired. Fine-tuning phases are zero-indexed and executed in ascending order. In addition to being zero-indexed, fine-tuning phase keys should be contiguous and either integers or convertible to integers via int().

  1. First, generate the default schedule to Trainer.log_dir. It will be named after your LightningModule subclass with the suffix _ft_schedule.yaml.

import lightning as L
from finetuning_scheduler import FinetuningScheduler

trainer = L.Trainer(callbacks=[FinetuningScheduler(gen_ft_sched_only=True)])
  1. Alter the schedule as desired.

Changing the generated schedule for this boring model…

 1  0:
 2      params:
 3      - layer.3.bias
 4      - layer.3.weight
 5  1:
 6      params:
 7      - layer.2.bias
 8      - layer.2.weight
 9  2:
10      params:
11      - layer.1.bias
12      - layer.1.weight
13  3:
14      params:
15      - layer.0.bias
16      - layer.0.weight

… to have three fine-tuning phases instead of four:

 1  0:
 2      params:
 3      - layer.3.bias
 4      - layer.3.weight
 5  1:
 6      params:
 7      - layer.2.*
 8      - layer.1.bias
 9      - layer.1.weight
10  2:
11      params:
12      - layer.0.*
  1. Once the fine-tuning schedule has been altered as desired, pass it to FinetuningScheduler to commence scheduled training:

import lightning as L
from finetuning_scheduler import FinetuningScheduler

trainer = L.Trainer(callbacks=[FinetuningScheduler(ft_schedule="/path/to/my/schedule/my_schedule.yaml")])

Note

For each fine-tuning phase, FinetuningScheduler will unfreeze/freeze parameters as directed in the explicitly specified or implicitly generated schedule. Prior to beginning the first phase of training (phase 0), FinetuningScheduler will inspect the optimizer to determine if the user has manually initialized the optimizer with parameters that are non-trainable or otherwise altered the parameter trainability states from that expected of the configured phase 0. By default (starting with FinetuningScheduler 2.0), FTS ensures the optimizer configured in configure_optimizers will optimize the parameters (and only those parameters) scheduled to be optimized in phase 0 of the current fine-tuning schedule. This auto-configuration can be disabled if desired by setting enforce_phase0_params to False.

EarlyStopping and Epoch-Driven Phase Transition Criteria

By default, FTSEarlyStopping and epoch-driven transition criteria are composed. If a max_transition_epoch is specified for a given phase, the next finetuning phase will begin at that epoch unless FTSEarlyStopping criteria are met first. If epoch_transitions_only is True, FTSEarlyStopping will not be used and transitions will be exclusively epoch-driven.

Tip

Use of regex expressions can be convenient for specifying more complex schedules. Also, a per-phase base_max_lr can be specified:

 1 0:
 2   params: # the parameters for each phase definition can be fully specified
 3   - model.classifier.bias
 4   - model.classifier.weight
 5   max_transition_epoch: 3
 6 1:
 7   params: # or specified via a regex
 8   - model.albert.pooler.*
 9 2:
10   params:
11   - model.albert.encoder.*.ffn_output.*
12   max_transition_epoch: 9
13   lr: 1e-06 # per-phase maximum learning rates can be specified
14 3:
15   params: # both approaches to parameter specification can be used in the same phase
16   - model.albert.encoder.*.(ffn\.|attention|full*).*
17   - model.albert.encoder.embedding_hidden_mapping_in.bias
18   - model.albert.encoder.embedding_hidden_mapping_in.weight
19   - model.albert.embeddings.*

For a practical end-to-end example of using FinetuningScheduler in implicit versus explicit modes, see scheduled fine-tuning for SuperGLUE below or the notebook-based tutorial.

Resuming Scheduled Fine-Tuning Training Sessions

Resumption of scheduled fine-tuning training is identical to the continuation of other training sessions with the caveat that the provided checkpoint must have been saved by a FinetuningScheduler session. FinetuningScheduler uses FTSCheckpoint (an extension of ModelCheckpoint) to maintain schedule state with special metadata.

import lightning as L
from finetuning_scheduler import FinetuningScheduler

trainer = L.Trainer(callbacks=[FinetuningScheduler()])
trainer.ckpt_path = "some/path/to/my_checkpoint.ckpt"
trainer.fit(...)

Training will resume at the depth/level of the provided checkpoint according the specified schedule. Schedules can be altered between training sessions but schedule compatibility is left to the user for maximal flexibility. If executing a user-defined schedule, typically the same schedule should be provided for the original and resumed training sessions.

Tip

By default ( restore_best is True), FinetuningScheduler will attempt to restore the best available checkpoint before fine-tuning depth transitions.

trainer = Trainer(callbacks=[FinetuningScheduler()])
trainer.ckpt_path = "some/path/to/my_kth_best_checkpoint.ckpt"
trainer.fit(...)

Note that similar to the behavior of ModelCheckpoint, when resuming training with a different FTSCheckpoint dirpath from the provided checkpoint, the new training session’s checkpoint state will be re-initialized at the resumption depth with the provided checkpoint being set as the best checkpoint.

Fine-Tuning All The Way Down!

There are plenty of options for customizing FinetuningScheduler’s behavior, see scheduled fine-tuning for SuperGLUE below for examples of composing different configurations.

Note

Currently, FinetuningScheduler supports the following distributed strategies:

  • DDPStrategy:ddp, ddp_find_unused_parameters_false, ddp_find_unused_parameters_true, ddp_spawn, ddp_fork, ddp_notebook

Note

FinetuningScheduler supports reinitializing all PyTorch optimizers (or subclasses thereof) provided in torch.optim in the context of all supported training strategies (including FSDP). Use of ZeroRedundancyOptimizer is also supported, but currently only outside the context of optimizer reinitialization.

Tip

Custom or officially unsupported strategies and lr schedulers can be used by setting allow_untested to True.

Some officially unsupported strategies may work unaltered and are only unsupported due to the Fine-Tuning Scheduler project’s lack of CI/testing resources for that strategy (e.g. single_tpu). Most unsupported strategies and schedulers, however, are currently unsupported because they require varying degrees of modification to be compatible.

For instance, with respect to strategies, deepspeed will require a StrategyAdapter similar to the one written for FSDP (FSDPStrategyAdapter) to be written before support can be added, while tpu_spawn would require an override of the current broadcast method to include python objects.

Regarding lr schedulers, ChainedScheduler and SequentialLR are examples of schedulers not currently supported due to the configuration complexity and semantic conflicts supporting them would introduce. If a supported torch lr scheduler does not meet your requirements, one can always subclass a supported lr scheduler and modify it as required (e.g. LambdaLR is especially useful for this). PRs are also always welcome!


Example: Scheduled Fine-Tuning For SuperGLUE

A demonstration of the scheduled fine-tuning callback FinetuningScheduler using the RTE and BoolQ tasks of the SuperGLUE benchmark and the LightningCLI is available under ./fts_examples/stable.

Since this CLI-based example requires a few additional packages (e.g. transformers, sentencepiece), you should install them using the [examples] extra:

pip install finetuning-scheduler['examples']

There are three different demo schedule configurations composed with shared defaults (./config/fts_defaults.yaml) provided for the default ‘rte’ task. Note DDP (with auto-selected GPUs) is the default configuration so ensure you adjust the configuration files referenced below as desired for other configurations.

Note there will likely be minor variations in training paths and performance as packages (e.g. transformers, datasets, finetuning-scheduler itself etc.) evolve. The precise package versions and salient environmental configuration used in the building of this tutorial is available in the tensorboard summaries, logs and checkpoints referenced below if you’re interested.

# Generate a baseline without scheduled fine-tuning enabled:
python fts_superglue.py fit --config config/nofts_baseline.yaml

# Train with the default fine-tuning schedule:
python fts_superglue.py fit --config config/fts_implicit.yaml

# Train with a non-default fine-tuning schedule:
python fts_superglue.py fit --config config/fts_explicit.yaml

All three training scenarios use identical configurations with the exception of the provided fine-tuning schedule. See the tensorboard experiment summaries and table below for a characterization of the relative computational and performance tradeoffs associated with these FinetuningScheduler configurations.

FinetuningScheduler expands the space of possible fine-tuning schedules and the composition of more sophisticated schedules can yield marginal fine-tuning performance gains. That stated, it should be emphasized the primary utility of FinetuningScheduler is to grant greater fine-tuning flexibility for model exploration in research. For example, glancing at DeBERTa-v3’s implicit training run, a critical tuning transition point is immediately apparent:

Our val_loss begins a precipitous decline at step 3119 which corresponds to phase 17 in the schedule. Referring to our schedule, in phase 17 we’re beginning tuning the attention parameters of our 10th encoder layer (of 11). Interesting! Though beyond the scope of this documentation, it might be worth investigating these dynamics further and FinetuningScheduler allows one to do just that quite easily.

In addition to the tensorboard experiment summaries, full logs/schedules for all three scenarios are available as well as the checkpoints produced in the scenarios (caution, ~3.5GB).

Example Scenario
nofts_baseline
fts_implicit
fts_explicit
Fine-Tuning Schedule

None

Default

User-defined

RTE Accuracy
(0.81, 0.84, 0.85)

Note that though this example is intended to capture a common usage scenario, substantial variation is expected among use cases and models. In summary, FinetuningScheduler provides increased fine-tuning flexibility that can be useful in a variety of contexts from exploring model tuning behavior to maximizing performance.

FinetuningScheduler Explicit Loss Animation

Footnotes

1

Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. ArXiv, abs/1801.06146.

2

Chronopoulou, A., Baziotis, C., & Potamianos, A. (2019). An embarrassingly simple approach for transfer learning from pretrained language models. arXiv preprint arXiv:1902.10547.

3

Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987.

Fine-Tuning Scheduler API

fts

Fine-Tuning Scheduler

fts_supporters

Fine-Tuning Scheduler Supporters

strategy_adapters

Fine-Tuning Scheduler Strategy Adapters

LR Scheduler Reinitialization

Overview

In some contexts it can be useful to re-wrap your optimizer with new lr scheduler configurations at the beginning of one or more scheduled training phases. Among others, example use cases include:

  • implementing complex lr schedules along with multi-phase early-stopping

  • injecting new parameter group specific rates on a scheduled basis

  • programmatically exploring training behavioral dynamics with heterogenous schedulers and early-stopping

LR scheduler reinitialization is supported:

  • In both explicit and implicit fine-tuning schedule modes (see the Fine-Tuning Scheduler intro for more on basic usage modes)

  • With or without concurrent optimizer reinitialization (FTS >= 2.0.2)

  • In the context of all supported training strategies (including FSDP).

  • With FTS >= 0.1.4

As lr scheduler reinitialization is likely to be applied most frequently in the context of explicitly defined fine-tuning schedules, we’ll cover configuration in that mode first. Please see the optimizer reinitialization feature introduction for a review of concurrent optimizer and lr scheduler reinitialization.

Specifying LR Scheduler Configurations For Specific Fine-Tuning Phases

When defining a fine-tuning schedule (see the intro for basic schedule specification), a new lr scheduler configuration can be applied to the existing optimizer at the beginning of a given phase by specifying the desired configuration in the new_lr_scheduler key. The new_lr_scheduler dictionary format is described in the annotated yaml schedule below and can be explored using the advanced usage example.

When specifying an lr scheduler configuration for a given phase, the new_lr_scheduler dictionary requires at minimum an lr_scheduler_init dictionary containing a class_path key indicating the class of the lr scheduler (list of supported schedulers) to be instantiated and wrapped around your optimizer.

Any arguments you would like to pass to initialize the specified lr scheduler with should be specified in the init_args key of the lr_scheduler_init dictionary.

 1  0:
 2    params:
 3    - model.classifier.bias
 4    - model.classifier.weight
 5  1:
 6    params:
 7    - model.pooler.dense.bias
 8    - model.pooler.dense.weight
 9    - model.deberta.encoder.LayerNorm.bias
10    - model.deberta.encoder.LayerNorm.weight
11    new_lr_scheduler:
12      lr_scheduler_init:
13        class_path: torch.optim.lr_scheduler.StepLR
14        init_args:
15          step_size: 1
16          gamma: 0.7
17  ...

Optionally, one can include arguments to pass to Lightning’s lr scheduler configuration (LRSchedulerConfig) in the pl_lrs_cfg dictionary.

 1  0:
 2    ...
 3  1:
 4    params:
 5    - model.pooler.dense.bias
 6    ...
 7    new_lr_scheduler:
 8      lr_scheduler_init:
 9        class_path: torch.optim.lr_scheduler.StepLR
10        init_args:
11          step_size: 1
12          ...
13      pl_lrs_cfg:
14        interval: epoch
15        frequency: 1
16        name: Explicit_Reinit_LR_Scheduler

If desired, one can also specify new initial learning rates to use for each of the existing parameter groups in the optimizer being wrapped via a list in the init_pg_lrs key.

1  ...
2  1:
3    params:
4    ...
5    new_lr_scheduler:
6      lr_scheduler_init:
7        ...
8      init_pg_lrs: [2.0e-06, 2.0e-06]

Note

It is currently is up to the user to ensure the number of parameter groups listed in init_pg_lrs matches the number of optimizer parameter groups created in previous phases (and if using ReduceLROnPlateau with a list of min_lr s, the current number parameter groups). This number of groups is dependent on a number of factors including the no_decay mapping of parameters specified in previous phases and isn’t yet introspected/simulated in the current FinetuningScheduler version.

Finally, when reinitializing an lr scheduler for a given phase, one can direct FTS to use the current optimizer parameter group lr s rather than defaulting to the existing optimizer’s initial_lr configuration for existing parameter groups. This mode is enabled by setting the use_current_optimizer_pg_lrs key to True. For a concrete example of this behavior, see this example. The init_pg_lrs key takes precedence over the use_current_optimizer_pg_lrs key if both are present. 1

1  ...
2  1:
3    params:
4    ...
5    new_lr_scheduler:
6      lr_scheduler_init:
7        ...
8      use_current_optimizer_pg_lrs: true

All lr scheduler reinitialization configurations specified in the fine-tuning schedule will have their configurations sanity-checked prior to training initiation.

Note that specifying lr scheduler reinitialization configurations is only supported for phases >= 1. This is because for fine-tuning phase 0, the lr scheduler configuration will be the scheduler that you initiate your training session with, usually via the configure_optimizer method of LightningModule.

Tip

If you want your learning rates logged on the same graph for each of the scheduler configurations defined in various phases, ensure that you provide the same name in the lr_scheduler configuration for each of the defined lr schedulers. For instance, in the lr scheduler reinitialization example, we provide:

 1  model:
 2    class_path: fts_examples.stable.fts_superglue.RteBoolqModule
 3    init_args:
 4      lr_scheduler_init:
 5        class_path: torch.optim.lr_scheduler.LinearLR
 6        init_args:
 7          start_factor: 0.1
 8          total_iters: 4
 9      pl_lrs_cfg:
10        # use the same name for your initial lr scheduler
11        # configuration and your ``new_lr_scheduler`` configs
12        # if you want LearningRateMonitor to generate a single graph
13        name: Explicit_Reinit_LR_Scheduler

As you can observe in the explicit mode lr scheduler reinitialization example below, lr schedulers specified in different fine-tuning phases can be of differing types.

 1  0:
 2    params:
 3    - model.classifier.bias
 4    - model.classifier.weight
 5  1:
 6    params:
 7    - model.pooler.dense.bias
 8    - model.pooler.dense.weight
 9    - model.deberta.encoder.LayerNorm.bias
10    - model.deberta.encoder.LayerNorm.weight
11    new_lr_scheduler:
12      lr_scheduler_init:
13        class_path: torch.optim.lr_scheduler.StepLR
14        init_args:
15          step_size: 1
16          gamma: 0.7
17      pl_lrs_cfg:
18        interval: epoch
19        frequency: 1
20        name: Explicit_Reinit_LR_Scheduler
21      init_pg_lrs: [2.0e-06, 2.0e-06]
22  2:
23    params:
24    - model.deberta.encoder.rel_embeddings.weight
25    - model.deberta.encoder.layer.{0,11}.(output|attention|intermediate).*
26    - model.deberta.embeddings.LayerNorm.bias
27    - model.deberta.embeddings.LayerNorm.weight
28    new_lr_scheduler:
29      lr_scheduler_init:
30        class_path: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
31        init_args:
32          T_0: 3
33          T_mult: 2
34          eta_min: 1.0e-07
35      pl_lrs_cfg:
36        interval: epoch
37        frequency: 1
38        name: Explicit_Reinit_LR_Scheduler
39      init_pg_lrs: [1.0e-06, 1.0e-06, 2.0e-06, 2.0e-06]

Once a new lr scheduler is re-initialized, it will continue to be used for subsequent phases unless replaced with another lr scheduler configuration defined in a subsequent schedule phase.

Prior to the execution of each phase transition, the latest lr state 2 from the previous phase will be restored before proceeding with any lr scheduler reinitialization directive. This is predominantly relevant only when training in restore_best mode or reinitializing the optimizer as well as lr scheduler.

Tip

If you have specified an lr scheduler with an lr_lambdas attribute in any phase, (e.g. LambdaLR) you can have the last configured lambda automatically applied to new groups in subsequent phases by setting the apply_lambdas_new_pgs parameter to True. Note this option will only affect phases without reinitialized lr schedulers. Phases with defined lr scheduler reinitialization configs will always apply the specified config, including new lambdas if provided.

LR Scheduler Reinitialization With Generated (Implicit Mode) Fine-Tuning Schedules

One can also specify lr scheduler reinitialization in the context of implicit mode fine-tuning schedules. Since the fine-tuning schedule is automatically generated, the same lr scheduler configuration will be applied at each of the phase transitions. In implicit mode, the lr scheduler reconfiguration should be supplied to the reinit_lr_cfg parameter of FinetuningScheduler.

For example, configuring this dictionary via the LightningCLI, one could use:

 1  model:
 2    class_path: fts_examples.stable.fts_superglue.RteBoolqModule
 3    init_args:
 4      lr_scheduler_init:
 5        class_path: torch.optim.lr_scheduler.StepLR
 6        init_args:
 7          step_size: 1
 8      pl_lrs_cfg:
 9        name: Implicit_Reinit_LR_Scheduler
10  trainer:
11    callbacks:
12      - class_path: finetuning_scheduler.FinetuningScheduler
13        init_args:
14          reinit_lr_cfg:
15            lr_scheduler_init:
16              class_path: torch.optim.lr_scheduler.StepLR
17              init_args:
18                step_size: 1
19                gamma: 0.7
20            pl_lrs_cfg:
21              interval: epoch
22              frequency: 1
23              name: Implicit_Reinit_LR_Scheduler

Note that an initial lr scheduler configuration should also still be provided per usual (again, typically via the configure_optimizer method of LightningModule) and the initial lr scheduler configuration can differ in lr scheduler type and configuration from the configuration specified in reinit_lr_cfg applied at each phase transition. Because the same schedule is applied at each phase transition, the init_pg_lrs list is not supported in an implicit fine-tuning context.

Application of lr scheduler reinitialization in both explicit and implicit modes may be best understood via examples, so we’ll proceed to those next.

Advanced Usage Examples: Explicit and Implicit Mode LR Scheduler Reinitialization

Demonstration lr scheduler reinitialization configurations for both explicit and implicit fine-tuning scheduling contexts are available under ./fts_examples/stable/config/advanced/reinit_lr.

The lr scheduler reinitialization examples use the same code and have the same dependencies as the basic scheduled fine-tuning for SuperGLUE examples.

The two different demo schedule configurations are composed with shared defaults (./config/fts_defaults.yaml).

cd ./fts_examples/stable
# Demo lr scheduler reinitialization with an explicitly defined fine-tuning schedule:
python fts_superglue.py fit --config config/advanced/reinit_lr/fts_explicit_reinit_lr.yaml

# Demo lr scheduler reinitialization with an implicitly defined fine-tuning schedule:
python fts_superglue.py fit --config config/advanced/reinit_lr/fts_implicit_reinit_lr.yaml

Notice in the explicitly defined schedule scenario, we are using three distinct lr schedulers for three different training phases:

Phase 0

LR log for parameter group 1 (LinearLR initial target lr = 1.0e-05)

Phase 0 in yellow (passed to our LightningModule via the model definition in our LightningCLI configuration) uses a LinearLR scheduler (defined in ./config/advanced/reinit_lr/fts_explicit_reinit_lr.yaml) with the initial lr defined via the shared initial optimizer configuration (defined in ./config/fts_defaults.yaml).

This is the effective phase 0 config (defined in ./config/advanced/reinit_lr/fts_explicit_reinit_lr.yaml, applying defaults defined in ./config/fts_defaults.yaml):

 1  model:
 2    class_path: fts_examples.stable.fts_superglue.RteBoolqModule
 3    init_args:
 4      optimizer_init:
 5        class_path: torch.optim.AdamW
 6        init_args:
 7          weight_decay: 1.0e-05
 8          eps: 1.0e-07
 9          lr: 1.0e-05
10      ...
11      lr_scheduler_init:
12        class_path: torch.optim.lr_scheduler.LinearLR
13        init_args:
14          start_factor: 0.1
15          total_iters: 4
16      pl_lrs_cfg:
17        interval: epoch
18        frequency: 1
19        name: Explicit_Reinit_LR_Scheduler

Phase 1 in blue uses a StepLR scheduler, including the specified initial lr for the existing parameter groups (2.0e-06).

LR log for parameter groups 1 and 3 respectively

pg1 starts at 2.0e-06

pg3 starts at the default of 1.0e-05

Explicit pg1
Explicit pg3

This is the phase 1 config (defined in our explicit schedule ./config/advanced/reinit_lr/explicit_reinit_lr.yaml):

 1  ...
 2  1:
 3    params:
 4    - model.pooler.dense.bias
 5    - model.pooler.dense.weight
 6    - model.deberta.encoder.LayerNorm.bias
 7    - model.deberta.encoder.LayerNorm.weight
 8    new_lr_scheduler:
 9      lr_scheduler_init:
10        class_path: torch.optim.lr_scheduler.StepLR
11        init_args:
12          step_size: 1
13          gamma: 0.7
14      pl_lrs_cfg:
15        interval: epoch
16        frequency: 1
17        name: Explicit_Reinit_LR_Scheduler
18      init_pg_lrs: [2.0e-06, 2.0e-06]

Phase 2 in green uses a CosineAnnealingWarmRestarts scheduler, with the assigned initial lr for each of the parameter groups (1.0e-06 for pg1 and 2.0e-06 for pg3).

LR log for parameter groups 1 and 3 respectively

pg1 oscillates between 1.0e-06 and 1.0e-07

pg3 oscillates between 2.0e-06 and 1.0e-07

Explicit pg1
Explicit pg3

This is the phase 2 config (like all non-zero phases, defined in our explicit schedule ./config/advanced/reinit_lr/explicit_reinit_lr.yaml):

 1  ...
 2  2:
 3    params:
 4    - model.deberta.encoder.rel_embeddings.weight
 5    - model.deberta.encoder.layer.{0,11}.(output|attention|intermediate).*
 6    - model.deberta.embeddings.LayerNorm.bias
 7    - model.deberta.embeddings.LayerNorm.weight
 8    new_lr_scheduler:
 9      lr_scheduler_init:
10        class_path: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
11        init_args:
12          T_0: 3
13          T_mult: 2
14          eta_min: 1.0e-07
15      pl_lrs_cfg:
16        interval: epoch
17        frequency: 1
18        name: Explicit_Reinit_LR_Scheduler
19      init_pg_lrs: [1.0e-06, 1.0e-06, 2.0e-06, 2.0e-06]

In the implicitly defined schedule scenario, the StepLR lr scheduler specified via reinit_lr_cfg (which happens to be the same as the initially defined lr scheduler in this case) is reinitialized at each phase transition and applied to all optimizer parameter groups.

 1  ...
 2  - class_path: finetuning_scheduler.FinetuningScheduler
 3    init_args:
 4      # note, we're not going to see great performance due
 5      # to the shallow depth, just demonstrating the lr scheduler
 6      # reinitialization behavior in implicit mode
 7      max_depth: 4
 8      # disable restore_best for lr pattern clarity
 9      restore_best: false
10      reinit_lr_cfg:
11        lr_scheduler_init:
12          class_path: torch.optim.lr_scheduler.StepLR
13          init_args:
14            step_size: 1
15            gamma: 0.7
16        pl_lrs_cfg:
17          interval: epoch
18          frequency: 1
19          name: Implicit_Reinit_LR_Scheduler
LR log for parameter groups 1 and 3 respectively
Explicit pg1
Explicit pg3

Note that we have disabled restore_best in both examples for clarity of lr patterns.

Footnotes

1

The following precedence governs the configuration of existing parameter group lr s when reinitializing an lr scheduler:

  1. User-provided lr s from the init_pg_lrs directive if it exists

  2. Existing optimizer lr s if use_current_optimizer_pg_lrs is set to True

  3. The initial_lr of the current optimizer parameter groups by default

  4. The existing optimizer lr s if use_current_optimizer_pg_lrs is not set to True but the relevant parameter group does not have an initial_lr key

2

The latest lr state consists of the previous lr scheduler state_dict and the lr s of each optimizer parameter group.

Optimizer Reinitialization

Overview

FinetuningScheduler (FTS) supports the initialization of new optimizers according to a user-specified fine-tuning schedule. Similarly motivated to Fine-Tuning Scheduler’s lr scheduler reinitialization feature, one can initialize new optimizers (or reinitialize an existing one) at the beginning of one or more scheduled training phases.

Optimizer reinitialization is supported:

  • In both explicit and implicit fine-tuning schedule modes (see the Fine-Tuning Scheduler intro for more on basic usage modes)

  • With or without concurrent lr scheduler reinitialization

  • In the context of all supported training strategies (including FSDP)

  • With FTS >= 2.0.2

We’ll cover both implicit and explicit configuration modes below and provide a slightly altered version of the lr scheduler reinitialization example that demonstrates concurrent reinitialization of optimizers and lr schedulers at different phases.

Specifying Optimizer Configurations For Specific Fine-Tuning Phases

When defining a fine-tuning schedule (see the intro for basic schedule specification), a new optimizer configuration can be applied to the existing training session at the beginning of a given phase by specifying the desired configuration in the new_optimizer key. The new_optimizer dictionary format is described in the annotated yaml schedule below and can be explored using the advanced usage example.

When specifying an optimizer configuration for a given phase, the new_optimizer dictionary requires at minimum an optimizer_init dictionary containing a class_path key indicating the class of the optimizer (list of supported optimizers) to be instantiated.

Any arguments with which you would like to initialize the specified optimizer should be specified in the init_args key of the optimizer_init dictionary.

 1  0:
 2    params:
 3    - model.classifier.bias
 4    - model.classifier.weight
 5  1:
 6    params:
 7    - model.pooler.dense.bias
 8    - model.pooler.dense.weight
 9    - model.deberta.encoder.LayerNorm.bias
10    - model.deberta.encoder.LayerNorm.weight
11    new_optimizer:
12      optimizer_init:
13        class_path: torch.optim.SGD
14        init_args:
15          lr: 2.0e-03
16          momentum: 0.9
17          weight_decay: 2.0e-06
18  ...

Optionally, one can also provide an lr scheduler reinitialization directive in the same phase as an optimizer reinitialization directive. If one does not provide a new_lr_scheduler directive, the latest lr state will still be restored and wrapped around the new optimizer prior to the execution of the new phase (as with lr scheduler reinitialization):

 1  0:
 2    ...
 3  1:
 4    params:
 5    - model.pooler.dense.bias
 6    ...
 7    new_optimizer:
 8      optimizer_init:
 9        class_path: torch.optim.SGD
10        init_args:
11          lr: 2.0e-03
12          momentum: 0.9
13          weight_decay: 2.0e-06
14    new_lr_scheduler:
15      lr_scheduler_init:
16        class_path: torch.optim.lr_scheduler.StepLR
17        init_args:
18          ...
19      pl_lrs_cfg:
20          ...
21      init_pg_lrs: [2.0e-06, 2.0e-06]

All optimizer reinitialization configurations specified in the fine-tuning schedule will have their configurations sanity-checked prior to training initiation.

Note

When reinitializing optimizers, FTS does not fully simulate/evaluate all compatibility scenarios so it is the user’s responsibility to ensure compatibility between optimizer instantiations or to set restore_best to False. For example consider the following training scenario:

Phase 0: SGD training
Phase 1: Reinitialize the optimizer and continue training with an Adam optimizer
Phase 2: Restore best checkpoint from phase 0 (w/ `restore_best` default of `True`)

Phase 2 would fail due to incompatibility between Adam and SGD optimizer states. This issue could be avoided by either reinitializing the Adam optimizer again in phase 2 or setting restore_best` to False. 1

Both lr scheduler and optimizer reinitialization configurations are only supported for phases >= 1. This is because for fine-tuning phase 0, training component configurations will be the ones the user initiated the training session with, usually via the configure_optimizer method of LightningModule.

As you can observe in the explicit mode optimizer reinitialization example below, optimizers specified in different fine-tuning phases can be of differing types.

 1  0:
 2    params:
 3    - model.classifier.bias
 4    - model.classifier.weight
 5  1:
 6    params:
 7    - model.pooler.dense.bias
 8    - model.pooler.dense.weight
 9    - model.deberta.encoder.LayerNorm.bias
10    - model.deberta.encoder.LayerNorm.weight
11    new_optimizer:
12      optimizer_init:
13        class_path: torch.optim.SGD
14        init_args:
15          lr: 2.0e-03
16          momentum: 0.9
17          weight_decay: 2.0e-06
18    ...
19  2:
20    params:
21    - model.deberta.encoder.rel_embeddings.weight
22    - model.deberta.encoder.layer.{0,11}.(output|attention|intermediate).*
23    - model.deberta.embeddings.LayerNorm.bias
24    - model.deberta.embeddings.LayerNorm.weight
25    new_optimizer:
26      optimizer_init:
27        class_path: torch.optim.AdamW
28        init_args:
29          weight_decay: 1.0e-05
30          eps: 1.0e-07
31          lr: 1.0e-05
32    ...

Once a new optimizer is re-initialized, it will continue to be used for subsequent phases unless replaced with another optimizer configuration defined in a subsequent schedule phase.

Optimizer Reinitialization With Generated (Implicit Mode) Fine-Tuning Schedules

One can also specify optimizer reinitialization in the context of implicit mode fine-tuning schedules. Since the fine-tuning schedule is automatically generated, the same optimizer configuration will be applied at each of the phase transitions. In implicit mode, the optimizer reconfiguration should be supplied to the reinit_optim_cfg parameter of FinetuningScheduler.

For example, configuring this dictionary via the LightningCLI, one could use:

 1  model:
 2    ...
 3  trainer:
 4    callbacks:
 5      - class_path: finetuning_scheduler.FinetuningScheduler
 6        init_args:
 7          reinit_optim_cfg:
 8            optimizer_init:
 9              class_path: torch.optim.AdamW
10              init_args:
11                weight_decay: 1.0e-05
12                eps: 1.0e-07
13                lr: 1.0e-05
14          reinit_lr_cfg:
15            lr_scheduler_init:
16              class_path: torch.optim.lr_scheduler.StepLR
17              ...

Note that an initial optimizer configuration should also still be provided per usual (again, typically via the configure_optimizer method of LightningModule) and the initial optimizer configuration can differ in optimizer type and configuration from the configuration specified in reinit_optim_cfg applied at each phase transition. As with explicit mode, concurrent reinit_lr_cfg configurations can also be specified in implicit mode.

Advanced Usage Examples: Explicit and Implicit Mode Concurrent Optimizer and LR Scheduler Reinitialization

Demonstration optimizer and concurrent lr scheduler reinitialization configurations for both explicit and fine-tuning scheduling contexts are available under ./fts_examples/stable/config/advanced/reinit_optim_lr.

The concurrent optimizer and lr scheduler reinitialization examples use the same code and have the same dependencies as the lr scheduler reinitialization-only (with the exception of requiring FTS >= 2.0.2 ) examples.

The two different demo schedule configurations are composed with shared defaults (./config/fts_defaults.yaml).

# Demo concurrent optimizer and lr scheduler reinitializations...
cd ./fts_examples/stable

# with an explicitly defined fine-tuning schedule:
python fts_superglue.py fit --config config/advanced/reinit_optim_lr/fts_explicit_reinit_optim_lr.yaml

# with an implicitly defined fine-tuning schedule:
python fts_superglue.py fit --config config/advanced/reinit_optim_lr/fts_implicit_reinit_optim_lr.yaml

# with non-default `use_current_optimizer_pg_lrs` mode (and an implicit schedule):
python fts_superglue.py fit --config config/advanced/reinit_optim_lr/fts_implicit_reinit_optim_lr_use_curr.yaml

Similar to the explicitly defined lr reinitialization-only schedule example, we are using three distinct lr schedulers for three different training phases. In this case, there are also distinctly configured optimizers being used:

Phase 0

Because we turned on DEBUG-level logging to trace reinitializations, we observe the following in our training log upon the phase 1 optimizer reinitialization:

1Epoch 8: 100%|██████████| 78/78 ...
2...
3Fine-Tuning Scheduler has reinitialized the optimizer as directed:
4Previous optimizer state: AdamW
5... (followed by parameter group config details)
6New optimizer state: SGD
7... (followed by parameter group initial config details, note existing lr state or user directives may subsequently override the `lr`s in this initial config)

In the implicitly defined schedule scenario, we begin using the AdamW optimizer but the SGD optimizer and StepLR lr scheduler are specified via reinit_optim_cfg and reinit_lr_cfg respectively. Both training components are reinitialized at each phase transition and applied to all optimizer parameter groups.

 1  ...
 2  - class_path: finetuning_scheduler.FinetuningScheduler
 3    init_args:
 4      # note, we're not going to see great performance due
 5      # to the shallow depth, just demonstrating the lr scheduler
 6      # reinitialization behavior in implicit mode
 7      max_depth: 4
 8      restore_best: false  # disable restore_best for lr pattern clarity
 9      logging_level: 10  # enable DEBUG logging to trace all reinitializations
10      reinit_optim_cfg:
11        optimizer_init:
12          class_path: torch.optim.SGD
13          init_args:
14            lr: 1.0e-05
15            momentum: 0.9
16            weight_decay: 1.0e-06
17      reinit_lr_cfg:
18        lr_scheduler_init:
19          class_path: torch.optim.lr_scheduler.StepLR
20          init_args:
21            step_size: 1
22            gamma: 0.7
23        pl_lrs_cfg:
24          interval: epoch
25          frequency: 1
26          name: Implicit_Reinit_LR_Scheduler
27        # non-default behavior set in `fts_implicit_reinit_optim_lr_use_curr.yaml`
28        use_current_optimizer_pg_lrs: true
Phase 0

LR log for parameter group 1 reflecting repeated reinitialization of the SGD optimizer and StepLR lr scheduler (initial target lr = 1.0e-05) at each phase transition. The behavioral impact of use_current_optimizer_pg_lrs (line 28 above) on the lr scheduler reinitializations can be clearly observed.

Note that we have disabled restore_best in both examples for clarity of lr patterns.

Note

Optimizer reinitialization with FinetuningScheduler is currently in beta.

Configuration Appendix

Effective phase 0 config defined in ./config/advanced/reinit_optim_lr/fts_explicit_reinit_optim_lr.yaml, applying defaults defined in ./config/fts_defaults.yaml

 1...
 2model:
 3  class_path: fts_examples.stable.fts_superglue.RteBoolqModule
 4  init_args:
 5    optimizer_init:
 6      class_path: torch.optim.AdamW
 7      init_args:
 8        weight_decay: 1.0e-05
 9        eps: 1.0e-07
10        lr: 1.0e-05
11    ...
12    lr_scheduler_init:
13      class_path: torch.optim.lr_scheduler.LinearLR
14      init_args:
15        start_factor: 0.1
16        total_iters: 4
17    pl_lrs_cfg:
18      interval: epoch
19      frequency: 1
20      name: Explicit_Reinit_LR_Scheduler

Phase 1 config, defined in our explicit schedule ./config/advanced/reinit_optim_lr/explicit_reinit_optim_lr.yaml

 1...
 21:
 3  params:
 4  - model.pooler.dense.bias
 5  - model.pooler.dense.weight
 6  - model.deberta.encoder.LayerNorm.bias
 7  - model.deberta.encoder.LayerNorm.weight
 8  new_optimizer:
 9    optimizer_init:
10      class_path: torch.optim.SGD
11      init_args:
12        lr: 1.0e-05
13        momentum: 0.9
14        weight_decay: 1.0e-06
15  new_lr_scheduler:
16    lr_scheduler_init:
17      class_path: torch.optim.lr_scheduler.StepLR
18      init_args:
19        step_size: 1
20        gamma: 0.7
21    pl_lrs_cfg:
22      interval: epoch
23      frequency: 1
24      name: Explicit_Reinit_LR_Scheduler
25    init_pg_lrs: [2.0e-06, 2.0e-06]

Phase 2 config, like all non-zero phases, defined in our explicit schedule ./config/advanced/reinit_optim_lr/explicit_reinit_optim_lr.yaml

 1...
 22:
 3  params:
 4  - model.deberta.encoder.rel_embeddings.weight
 5  - model.deberta.encoder.layer.{0,11}.(output|attention|intermediate).*
 6  - model.deberta.embeddings.LayerNorm.bias
 7  - model.deberta.embeddings.LayerNorm.weight
 8  new_optimizer:
 9    optimizer_init:
10      class_path: torch.optim.AdamW
11      init_args:
12        weight_decay: 1.0e-05
13        eps: 1.0e-07
14        lr: 1.0e-05
15  new_lr_scheduler:
16    lr_scheduler_init:
17      class_path: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
18      init_args:
19        T_0: 3
20        T_mult: 2
21        eta_min: 1.0e-07
22    pl_lrs_cfg:
23      interval: epoch
24      frequency: 1
25      name: Explicit_Reinit_LR_Scheduler
26    init_pg_lrs: [1.0e-06, 1.0e-06, 2.0e-06, 2.0e-06]

Footnotes

1

While FTS could theoretically cache optimizer state prior to checkpoint restoration for potentially incompatible optimizer reinitialization configurations, that functionality is not currently implemented because of the resource overhead and unnecessary complexity it would add to the default restoration path. If there is sufficient interest in the user community, that functionality may be added in the future.

FSDP Scheduled Fine-Tuning

Overview

FinetuningScheduler (FTS) now supports flexible, multi-phase, scheduled fine-tuning with the Fully Sharded Data Parallel (FSDP) strategy ( FSDPStrategy). This tutorial assumes a basic understanding of FSDP training, please see this PyTorch tutorial for a good introduction to FSDP training.

As with standard FSDP usage, FSDP wrapping of a LightningModule can be performed either by providing an auto_wrap_policy or (for maximal control) by overriding the configure_model method of LightningModule and manually wrapping the module.

This tutorial walks through the configuration of an example multi-phase, scheduled FSDP fine-tuning training session and largely uses the same code as the basic scheduled fine-tuning for SuperGLUE examples.

Example: Multi-Phase Scheduled Fine-Tuning with FSDP

Demonstration FTS FSDP training/profiling configurations and a DDP baseline for comparison are available under ./fts_examples/stable/config/advanced/fsdp.

Most of these FTS FSDP training examples have the same dependencies as the basic scheduled fine-tuning for SuperGLUE examples except PyTorch >= 2.0 is required. Running the basic example requires PyTorch >= 2.1.0.

Note

The examples below are not configured to execute a full training session but instead to generate the minimal meaningful profiling statistics for analysis and exposition (e.g. using only 2 batches, very limited epochs, etc.)

The demo schedule configurations are composed with the basic FTS example’s shared defaults (./config/fts_defaults.yaml) and can be executed as follows:

cd ./fts_examples/stable

# there is an open issue regarding superfluous profiler messages (still as of 2023.04.15)
# setting the environmental variable below is a workaround to keep the example output clean:

export TORCH_CPP_LOG_LEVEL=ERROR

# Profiled demo of basic scheduled fine-tuning with FSDP (requires PyTorch >= 2.1.0)
python fts_superglue.py fit --config config/advanced/fsdp/fts_fsdp_basic_profile.yaml

# Profiled demo of FSDP scheduled fine-tuning using the ``awp_overrides`` option:
python fts_superglue.py fit --config config/advanced/fsdp/fts_fsdp_awp_overrides_profile.yaml

# Profiled demo of comparable DDP scheduled fine-tuning baseline:
python fts_superglue.py fit --config config/advanced/fsdp/fts_ddp_fsdp_baseline_profile.yaml

# Profiled demo of FSDP scheduled fine-tuning with CPU Offloading but full precision
# (for reference, not reviewed in this tutorial)
python fts_superglue.py fit --config config/advanced/fsdp/fts_fsdp_awp_overrides_offload_profile.yaml

Basic Scheduled Fine-Tuning with FSDP

Beginning with PyTorch version 2.1.0, the effective constraints FSDP imposed on fine-tuning schedules were substantially relaxed. As you’ll see below, scheduled fine-tuning with FSDP is pretty straightforward! All one need do:

  1. Pass use_orig_params to the FSDP strategy configuration.

  2. Provide a simple auto_wrap_policy configuration (not technically required but almost always desired).

For a given fine-tuning schedule:

 10:
 2  params:
 3  - model.classifier.*
 4  max_transition_epoch: 1
 51:
 6  params:
 7  - model.pooler.dense.*
 8  - model.deberta.encoder.layer.11.(output|attention|intermediate).*
 9  max_transition_epoch: 2
102:
11  params:
12  - model.deberta.encoder.layer.([0-9]|10).(output|attention|intermediate).*
13  - model.deberta.encoder.LayerNorm.bias
14  - model.deberta.encoder.LayerNorm.weight
15  - model.deberta.encoder.rel_embeddings.weight

We can just define an auto_wrap_policy for our DeBERTa-v3 module, directing FTS/FSDP to wrap the specified Transformer layers in separate FSDP modules:

 1strategy:
 2  class_path: lightning.pytorch.strategies.FSDPStrategy
 3  init_args:
 4    # other FSDP args as desired ...
 5    use_orig_params: True
 6    auto_wrap_policy:
 7      class_path: torch.distributed.fsdp.wrap.ModuleWrapPolicy
 8      init_args:
 9        module_classes: !!set
10          ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Layer

That’s it! Note that we set use_orig_params to True in line 5 as it allows for more flexible fine-tuning schedules with PyTorch >= 2.1.0.

In the next section, we’ll cover some of the more advanced configuration options available for customizing scheduled fine-tuning with FSDP.

Advanced FSDP Wrapping For Scheduled Fine-Tuning

There are a number of usage contexts that might motivate moving beyond the simple configuration above. For instance:

Motivations for Advanced FSDP Wrapping

Potential Use case

Relevant Features & Info

Optimize resource utilization (whether memory, compute or network)

activation checkpointing, cpu offload, awp_overrides

More granular control over module wrapping policy w/o manually writing a “configure_model” method

awp_overrides

A desire to use FSDP in the default “use_orig_params=False” mode

See PyTorch documentation for possible issues

if using a version of PyTorch < 2.1.0

As with standard FSDP module wrapping, one can use an auto_wrap_policy to wrap a model for FSDP scheduled fine-tuning. In the current FTS release, there is only one FTS-specific FSDP configuration enhancement to consider: the awp_overrides list.

awp_overrides is an optional list of module names that should be wrapped in separate FSDP instances, complementing the modules that would be individually wrapped by auto_wrap_policy provided in the FSDPStrategy strategy configuration.

Starting with a defined auto_wrap_policy and providing module name-based complements/overrides as needed using awp_overrides is often the most expedient approach to auto-wrapping models in alignment with a fine-tuning schedule.

We again start by defining a simple fine-tuning schedule that we would like to ensure our module wrapping supports:

 10:
 2  params:
 3  - model.classifier.*
 4  max_transition_epoch: 1
 51:
 6  params:
 7  - model.pooler.dense.*
 8  - model.deberta.encoder.layer.11.(output|attention|intermediate).*
 9  max_transition_epoch: 2
102:
11  params:
12  - model.deberta.encoder.layer.([0-9]|10).(output|attention|intermediate).*
13  - model.deberta.encoder.LayerNorm.bias
14  - model.deberta.encoder.LayerNorm.weight
15  - model.deberta.encoder.rel_embeddings.weight
16  # excluding these parameters from the schedule to enhance the debugging demonstration
17  #- model.deberta.embeddings.LayerNorm.bias
18  #- model.deberta.embeddings.LayerNorm.weight
19  #- model.deberta.embeddings.word_embeddings.weight

We define the auto_wrap_policy for our DeBERTa-v3 module as follows:

 1strategy:
 2  class_path: lightning.pytorch.strategies.FSDPStrategy
 3  init_args:
 4    # other FSDP args as desired ...
 5    auto_wrap_policy:
 6      class_path: torch.distributed.fsdp.wrap.ModuleWrapPolicy
 7      init_args:
 8        module_classes: !!set
 9          ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Layer
10          ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Embeddings
11          ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Encoder

We’ll inspect the rationale for this policy below, but first, notice we have not referenced our classifier and pooler layers. Because we would like to thaw our classifier and pooler layers in separate phases from some other layers, we need to separately wrap these layers as well. If we specified separate wrapping of all Linear layers however in our auto_wrap_policy, we would end up unnecessarily (and in many cases problematically) separately wrapping the many Linear layers within our currently FSDP wrapped modules (DebertaV2Layer etc.).

To facilitate module wrapping in alignment with fine-tuning schedule phases, FTS provides the awp_overrides feature which allows users to provide module name-based complements to a given auto_wrap_policy.

In this case, simply listing the names of (or regex patterns matching) modules we would like to separately wrap allows us to achieve FSDP wrapping that aligns with our fine-tuning schedule. FTS support for FSDP training is provided via a StrategyAdapter (FSDPStrategyAdapter). Configuration for FTS-extensions of strategies like FSDP is passed to FTS via the strategy_adapter_cfg configuration dictionary.

So in our example, we can pass the awp_overrides configuration option to FTS like so:

1# in ./fts_examples/stable/config/advanced/fsdp/fts_fsdp_awp_overrides_profile.yaml
2...
3  - class_path: finetuning_scheduler.FinetuningScheduler
4  init_args:
5    ft_schedule: ./config/RteBoolqModule_ft_schedule_deberta_base_fsdp.yaml
6    max_depth: 2
7    strategy_adapter_cfg:
8      awp_overrides: ["model.pooler.dense", "model.classifier"]
9...

Finally, we configure the FSDP training strategy as desired per usual, for instance, specifying activation_checkpointing_policy and cpu_offload configurations in addition the auto_wrap_policy we defined above:

 1# in ./fts_examples/stable/config/advanced/fsdp/fts_fsdp_awp_overrides_profile.yaml
 2  ...
 3  strategy:
 4    class_path: lightning.pytorch.strategies.FSDPStrategy
 5    init_args:
 6      cpu_offload: false
 7      activation_checkpointing_policy: !!set
 8        ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Layer
 9      auto_wrap_policy:
10        class_path: torch.distributed.fsdp.wrap.ModuleWrapPolicy
11        init_args:
12          module_classes: !!set
13            ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Layer
14            ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Embeddings
15            ? transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Encoder

That’s all there is to it! We’ve successfully defined our fine-tuning schedule and FSDP wrapped our model in a manner that supports FSDP multi-phase scheduled fine-tuning.

Additional FSDP Wrapping and Debugging Guidance

In order to support multi-phase scheduled fine-tuning with FSDP in use_orig_params=False mode, FTS’s key precondition is that the defined fine-tuning schedule phases have disjoint sets of FSDP-flattened parameters (a FlatParameter is created when wrapping a set of modules in a FSDP instance/unit). This constraint is derived from the fact that (for PyTorch < 2.1.0 or use_orig_params=False mode) the requires_grad attribute must be the same for all parameters flattened into the same FlatParameter. 1

FTS will attempt to validate that the module is wrapped in a manner that aligns with the defined fine-tuning schedule phases prior to the start of training and provide detailed feedback for the user if a misalignment is discovered.

For example, note that because we wanted to thaw some DebertaV2Layer s separately from others, we directed FSDP to wrap DebertaV2Layer s in their own FSDP instances rather than just the entire DebertaV2Encoder.

What happens if we just direct FSDP to wrap DebertaV2Layer s and not DebertaV2Encoder s and DebertaV2Embeddings as well?

FTS stops before beginning training and provides extensive context via this error message:

"Fine-tuning schedule phases do not have disjoint FSDP-flattened parameter sets. Because the `requires_grad` attribute of FSDP-flattened parameters currently must be the same for all flattened parameters (for PyTorch < ``2.1.0`` or if in ``use_orig_params=False`` mode), fine-tuning schedules must avoid thawing parameters in the same FSDP-flattened parameter in different phases. Please ensure parameters associated with each phase are wrapped in separate phase-aligned FSDP instances.

In this particular case, there are parameters not included in your fine-tuning schedule that span more than one fine-tuning phase. HINT: parameters associated with unwrapped modules will be included in the top-level (aka 'root') FSDP instance so ensuring all modules associated with fine-tuning scheduled parameters are wrapped separately from the top-level FSDP instance may avoid triggering this exception.

The following logical parameters are associated with an FSDP-flattened parameter that spans more than one fine-tuning phase. The mapping of each logical parameter with the module name wrapped by its associated FSDP instance is provided below:

{'model.deberta.embeddings.LayerNorm.bias': 'DebertaV2ForSequenceClassification',
 'model.deberta.embeddings.LayerNorm.weight': 'DebertaV2ForSequenceClassification',
 'model.deberta.embeddings.word_embeddings.weight': 'DebertaV2ForSequenceClassification',
 'model.deberta.encoder.LayerNorm.bias': 'DebertaV2ForSequenceClassification',
 'model.deberta.encoder.LayerNorm.weight': 'DebertaV2ForSequenceClassification',
 'model.deberta.encoder.rel_embeddings.weight': 'DebertaV2ForSequenceClassification'}"

This helps us understand that we have parameters that all belong to the same top-level FSDP instance (the instance that wraps DebertaV2ForSequenceClassification). By failing to specify separate wrapping of DebertaV2Encoder s, parameters associated with that module fell to the top-level/root FSDP instance to be managed. While DebertaV2Embeddings parameters were not included in our schedule, they still must be wrapped by FSDP and so also are included with DebertaV2Encoder parameters in the same top-level FlatParameter. If training had been permitted to proceed in this case, DebertaV2Embeddings parameters would have been thawed along with the DebertaV2Encoder parameters in phase 2, violating of our specified fine-tuning schedule.

To avoid violating the phase-wise disjointness constraint, we add DebertaV2Encoder to our auto_wrap_policy. While not technically required, we add DebertaV2Embeddings separately as well for future experimental flexibility.

As always, if needed, one can alternatively override configure_model and manually wrap a given LightningModule to align with a desired fine-tuning schedule.

Warning

FSDPStrategyAdapter is in BETA and subject to change. The interface can bring breaking changes and new features with the next release of PyTorch.

Note

The no_decay attribute that FTS supports on LightningModule with the base StrategyAdapter is not currently supported in the context of FSDP fine-tuning.

Note

Resuming across heterogeneous use_orig_params contexts with FTS is not currently supported (e.g. use_orig_params=True checkpoints need to be resumed with use_orig_params=True set)

Note

With PyTorch versions < 2.0, optimizer state dicts are not currently saved/loaded when restoring checkpoints in the context of FSDP training. This comports with upstream Lightning behavior/limitations. Please use PyTorch >= 2.0 if restoring optimizer state from checkpoints (while FSDP training) is critical to your use case. For more regarding this version constraint, see this issue.

Tip

If FSDP training with PyTorch >= 2.1.0 and use_orig_params=True, DEBUG level logging will provide parameter shard allocation diagnostic info where relevant.

Tip

If you want to extend FTS to use a custom, currently unsupported strategy or override current FTS behavior with a given training strategy, subclassing StrategyAdapter is a way to do so.

Footnotes

1

As of PyTorch 2.1.0, FlatParameter s constructed in use_orig_params mode are allowed to contain original params with non-uniform requires_grad.

Contributor Covenant Code of Conduct

Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language

  • Being respectful of differing viewpoints and experiences

  • Gracefully accepting constructive criticism

  • Focusing on what is best for the community

  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances

  • Trolling, insulting/derogatory comments, and personal or political attacks

  • Public or private harassment

  • Publishing others’ private information, such as a physical or electronic address, without explicit permission

  • Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at waf2107@columbia.edu. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq

Contributing

Welcome to the community! Fine-Tuning Scheduler extends the most advanced DL research platform on the planet (Lightning) and strives to support the latest, best practices and integrations that the amazing PyTorch team and other research organizations roll out!

As Fine-Tuning Scheduler is an extension of Lightning, the remainder of the contribution guidelines conform to (and many are drawn from) the Lightning contribution documentation.

A giant thank you to the Lightning team for their tireless effort building the immensely useful Lightning project and their thoughtful feedback on and review of this extension.

Main Core Value: One less thing to remember

Simplify the API as much as possible from the user perspective. Any additions or improvements should minimize the things the user needs to remember.

Design Principles

We encourage all sorts of contributions you’re interested in adding! When coding for Fine-Tuning Scheduler, please follow these principles.

No PyTorch Interference

We don’t want to add any abstractions on top of pure PyTorch. This gives researchers all the control they need without having to learn yet another framework.

Simple Internal Code

It’s useful for users to look at the code and understand very quickly what’s happening. Many users won’t be engineers. Thus we need to value clear, simple code over condensed ninja moves. While that’s super cool, this isn’t the project for that :)

Simple External API

What makes sense to you may not make sense to others. When creating an issue with an API change suggestion, please validate that it makes sense for others. Treat code changes the way you treat a startup: validate that it’s a needed feature, then add if it makes sense for many people.

Backward-compatible API

We all hate updating our deep learning packages because we don’t want to refactor a bunch of stuff. With the Fine-Tuning Scheduler, we make sure every change we make which could break an API is backward compatible with good deprecation warnings.

You shouldn’t be afraid to upgrade the Fine-Tuning Scheduler :)

Gain User Trust

As a researcher, you can’t have any part of your code going wrong. So, make thorough tests to ensure that every implementation of a new trick or subtle change is correct.


Contribution Types

We are always open to contributions of new features or bug fixes.

A lot of good work has already been done in project mechanics (requirements.txt, setup.py, pep8, badges, ci, etc…) so we’re in a good state there thanks to all the early contributors (even pre-beta release)!

Bug Fixes:
  1. If you find a bug please submit a GitHub issue.

    • Make sure the title explains the issue.

    • Describe your setup, what you are trying to do, expected vs. actual behaviour. Please add configs and code samples.

    • Add details on how to reproduce the issue - a minimal test case is always best, colab is also great. Note, that the sample code shall be minimal and if needed with publicly available data.

  2. Try to fix it or recommend a solution. We highly recommend to use test-driven approach:

    • Convert your minimal code example to a unit/integration test with assert on expected results.

    • Start by debugging the issue… You can run just this particular test in your IDE and draft a fix.

    • Verify that your test case fails on the main branch and only passes with the fix applied.

  3. Submit a PR!

Note, even if you do not find the solution, sending a PR with a test covering the issue is a valid contribution, and we can help you or finish it with you :]

New Features:
  1. Submit a GitHub issue - describe what is the motivation of such feature (adding the use case, or an example is helpful).

  2. Determine the feature scope with us.

  3. Submit a PR! We recommend test driven approach to adding new features as well:

    • Write a test for the functionality you want to add.

    • Write the functional code until the test passes.

  4. Add/update the relevant tests!

Test cases:

Want to keep Fine-Tuning Scheduler healthy? Love seeing those green tests? So do we! How to we keep it that way? We write tests! We value tests contribution even more than new features.


Guidelines

Developments scripts

To build the documentation locally, simply execute the following commands from project root (only for Unix):

  • make clean cleans repo from temp/generated files

  • make docs builds documentation under docs/build/html

  • make test runs all project’s tests with coverage

Original code

All added or edited code shall be the own original work of the particular contributor. If you use some third-party implementation, all such blocks/functions/modules shall be properly referred and if possible also agreed by code’s author. For example - This code is inspired from http://....

Coding Style
  1. Use f-strings for output formation

  2. You can use pre-commit to make sure your code style is correct.

Documentation

We are using Sphinx with Napoleon extension. Moreover, we set Google style to follow with type convention.

See following short example of a sample function taking one position string and optional

from typing import Optional


def my_func(param_a: int, param_b: Optional[float] = None) -> str:
    """Sample function.

    Args:
        param_a: first parameter
        param_b: second parameter

    Return:
        sum of both numbers

    Example::

        Sample doctest example...
        >>> my_func(1, 2)
        3

    Note:
        If you want to add something.
    """
    p = param_b if param_b else 0
    return str(param_a + p)

When updating the docs make sure to build them first locally and visually inspect the html files (in the browser) for formatting errors. In certain cases, a missing blank line or a wrong indent can lead to a broken layout. Run these commands

pip install -r requirements/docs.txt
make clean
cd docs
make html

and open docs/build/html/index.html in your browser.

Notes:

  • You need to have LaTeX installed for rendering math equations. You can for example install TeXLive by doing one of the following:

    • on Ubuntu (Linux) run apt-get install texlive or otherwise follow the instructions on the TeXLive website

    • use the RTD docker image

  • with PL used class meta you need to use python 3.7 or higher

Testing

Local: Testing your work locally will help you speed up the process since it allows you to focus on particular (failing) test-cases. To setup a local development environment, install both local and test dependencies:

# PACKAGE_NAME variable currently required to specify pytorch-lightning dev package dep (as of lightning 1.8.0)
export PACKAGE_NAME=pytorch
python -m pip install ".[all]"
python -m pip install pre-commit
pre-commit install

Note: if your computer does not have multi-GPU nor TPU these tests are skipped.

GitHub Actions: For convenience, you can also use your own GHActions building which will be triggered with each commit. This is useful if you do not test against all required dependency versions.

You can then run:

python -m pytest src/finetuning_scheduler src/fts_examples/stable tests -v
Pull Request

We welcome any useful contribution! For your convenience here’s a recommended workflow:

  1. Think about what you want to do - fix a bug, repair docs, etc. If you want to implement a new feature or enhance an existing one.

    • Start by opening a GitHub issue to explain the feature and the motivation. In the case of features, ask yourself first - Is this NECESSARY for Fine-Tuning Scheduler? There are some PRs that are just purely about adding engineering complexity which has no place in Fine-Tuning Scheduler.

    • Core contributors will take a look (it might take some time - we are often overloaded with issues!) and discuss it.

    • Once an agreement was reached - start coding.

  2. Start your work locally.

    • Create a branch and prepare your changes.

    • Tip: do not work on your main branch directly, it may become complicated when you need to rebase.

    • Tip: give your PR a good name! It will be useful later when you may work on multiple tasks/PRs.

  3. Test your code!

    • It is always good practice to start coding by creating a test case, verifying it breaks with current behavior, and passes with your new changes.

    • Make sure your new tests cover all different edge cases.

    • Make sure all exceptions raised are tested.

    • Make sure all warnings raised are tested.

  4. If your PR is not ready for reviews, but you want to run it on our CI, open a “Draft PR” to let us know you don’t need feedback yet.

  5. When you feel ready for integrating your work, mark your PR “Ready for review”.

    • Your code should be readable and follow the project’s design principles.

    • Make sure all tests are passing and any new code is tested for (coverage!).

    • Make sure you link the GitHub issue to your PR.

    • Make sure any docs for that piece of code are updated, or added.

    • The code should be elegant and simple. No over-engineering or hard-to-read code.

    Do your best but don’t sweat about perfection! We do code-review to find any missed items. If you need help, don’t hesitate to ping the core team on the PR.

  6. Use tags in PR name for the following cases:

    • [blocked by #] if your work is dependent on other PRs.

    • [wip] when you start to re-edit your work, mark it so no one will accidentally merge it in meantime.

Question & Answer
How can I help/contribute?

All types of contributions are welcome - reporting bugs, fixing documentation, adding test cases, solving issues, and preparing bug fixes. To get started with code contributions, look for issues marked with the label good first issue or chose something close to your domain with the label help wanted. Before coding, make sure that the issue description is clear and comment on the issue so that we can assign it to you (or simply self-assign if you can).

Is there a recommendation for branch names?

We recommend you follow this convention <type>/<issue-id>_<short-name> where the types are: bugfix, feature, docs, or tests (but if you are using your own fork that’s optional).

How to add new tests?

We are using pytest with Fine-Tuning Scheduler.

Here is the process to create a new test

    1. Find a file in tests/ which match what you want to test. If none, create one.

    1. Use this template to get started !

    1. Use BoringModel and derivatives to test out your code.

# TEST SHOULD BE IN YOUR FILE: tests/..../...py
# TEST CODE TEMPLATE

# [OPTIONAL] pytest decorator
# @pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires GPU machine")
def test_explain_what_is_being_tested(tmpdir):
    """
    Test description about text reason to be
    """

    class ExtendedModel(BoringModel):
        ...

    model = ExtendedModel()

    # BoringModel is a functional model. You might want to set methods to None to test your behaviour
    # Example: model.training_step_end = None

    trainer = Trainer(default_root_dir=tmpdir, ...)  # will save everything within a tmpdir generated for this test
    trainer.fit(model)
    trainer.test()  # [OPTIONAL]

    # assert the behaviour is correct.
    assert ...

run our/your test with

python -m pytest tests/..../...py::test_explain_what_is_being_tested -v --capture=no

Fine-Tuning Scheduler Governance

This document describes governance processes we follow in developing the Fine-Tuning Scheduler.

Persons of Interest

BDFL

Role: All final decisions related to Fine-Tuning Scheduler.

  • Dan Dale (speediedan) (Fine-Tuning Scheduler author)

Releases

Release cadence TBD

Project Management and Decision Making

TBD

API Evolution

For API removal, renaming or other forms of backward-incompatible changes, the procedure is:

  1. A deprecation process is initiated at version X, producing warning messages at runtime and in the documentation.

  2. Calls to the deprecated API remain unchanged in their function during the deprecation phase.

  3. Two minor versions in the future at version X+2 the breaking change takes effect.

The “X+2” rule is a recommendation and not a strict requirement. Longer deprecation cycles may apply for some cases.

New API and features are declared as:

  • Experimental: Anything labelled as experimental or beta in the documentation is considered unstable and should

    not be used in production. The community is encouraged to test the feature and report issues directly on GitHub.

  • Stable: Everything not specifically labelled as experimental should be considered stable. Reported issues will be

    treated with priority.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[2.1.0] - 2023-10-12

[2.1.0] - Added
  • Support for Lightning and PyTorch 2.1.0

  • Support for Python 3.11

  • Support for simplified scheduled FSDP training with PyTorch >= 2.1.0 and use_orig_params set to True

  • Unified different FSDP use_orig_params mode code-paths to support saving/restoring full, consolidated OSD (PyTorch versions >= 2.0.0)

  • added support for FSDP activation_checkpointing_policy and updated FSDP profiling examples accordingly

  • added support for CustomPolicy and new implementation of ModuleWrapPolicy with FSDP 2.1.0

[2.1.0] - Changed
  • FSDP profiling examples now use a patched version of FSDPStrategy to avoid https://github.com/omni-us/jsonargparse/issues/337 with jsonargparse < 4.23.1

[2.1.0] - Fixed
  • updated validate_min_wrap_condition to avoid overly restrictive validation in some use_orig_params contexts

  • for PyTorch versions < 2.0, when using the FSDP strategy, disabled optimizer state saving/restoration per https://github.com/Lightning-AI/lightning/pull/18296

  • improved fsdp strategy adapter no_decay attribute handling

[2.1.0] - Deprecated
  • FSDPStrategyAdapter now uses the configure_model hook rather than the deprecated configure_sharded_model hook to apply the relevant model wrapping. See https://github.com/Lightning-AI/lightning/pull/18004 for more context regarding configure_sharded_model deprecation.

  • Dropped support for PyTorch 1.11.x.

[2.0.9] - 2023-10-02

  • Support for Lightning 2.0.8 and 2.0.9

[2.0.7] - 2023-08-16

  • Support for Lightning 2.0.7

[2.0.6] - 2023-08-15

  • Support for Lightning 2.0.5 and 2.0.6

[2.0.4] - 2023-06-22

  • Support for PyTorch Lightning 2.0.3 and 2.0.4

  • adjusted default example log name

  • disabled fsdp 1.x mixed precision tests temporarily until https://github.com/Lightning-AI/lightning/pull/17807 is merged

[2.0.2] - 2023-04-06

[2.0.2] - Added
  • Beta support for optimizer reinitialization. Resolves #6

  • Use structural typing for Fine-Tuning Scheduler supported optimizers with ParamGroupAddable

  • Support for jsonargparse version 4.20.1

[2.0.2] - Changed
  • During schedule phase transitions, the latest LR state will be restored before proceeding with the next phase configuration and execution (mostly relevant to lr scheduler and optimizer reinitialization but also improves configuration when restoring best checkpoints across multiple depths)

[2.0.2] - Fixed
  • Allow sharded optimizers ZeroRedundancyOptimizer to be properly reconfigured if necessary in the context of enforce_phase0_params set to True.

[2.0.1] - 2023-04-05

[2.0.1] - Added
  • Support for PyTorch Lightning 2.0.1

  • Lightning support for use_orig_params via (#16733)

[2.0.0] - 2023-03-15

[2.0.0] - Added
  • Support for PyTorch and PyTorch Lightning 2.0.0!

  • New enforce_phase0_params feature. FTS ensures the optimizer configured in configure_optimizers will optimize the parameters (and only those parameters) scheduled to be optimized in phase 0 of the current fine-tuning schedule. (#9)

  • Support for torch.compile

  • Support for numerous new FSDP options including preview support for some FSDP options coming soon to Lightning (e.g. use_orig_params)

  • When using FTS with FSDP, support the use of _FSDPPolicy auto_wrap_policy wrappers (new in PyTorch 2.0.0)

  • Extensive testing for FSDP in many newly supported 2.x contexts (including 1.x FSDP compatibility multi-gpu tests)

  • Support for strategies that do not have a canonical strategy_name but use _strategy_flag

[2.0.0] - Changed
  • Now that the core Lightning package is lightning rather than pytorch-lightning, Fine-Tuning Scheduler (FTS) by default depends upon the lightning package rather than the standalone pytorch-lightning. If you would like to continue to use FTS with the standalone pytorch-lightning package instead, you can still do so (see README). Resolves (#8).

  • Fine-Tuning Scheduler (FTS) major version numbers will align with the rest of the PyTorch ecosystem (e.g. FTS 2.x supports PyTorch and Lightning >= 2.0)

  • Switched to use ruff instead of flake8 for linting

  • Replaced fsdp_optim_view with either fsdp_optim_transform or fsdp_optim_inspect depending on usage context because the transformation is now not always read-only

  • Moved Lightning 1.x examples to legacy subfolder and created new FTS/Lightning 2.x examples in stable subfolder

[2.0.0] - Removed
  • Removed training_epoch_end and validation_epoch_end in accord with Lightning

  • Removed DP strategy support in accord with Lightning

  • Removed support for Python 3.7 and PyTorch 1.10 in accord with Lightning

[2.0.0] - Fixed
  • Adapted loop synchronization during training resume to upstream Lightning changes

[0.4.1] - 2023-03-14

[0.4.1] - Added
  • Support for pytorch-lightning 1.9.4 (which may be the final Lightning 1.x release as PyTorch 2.0 will be released tomorrow)

[0.4.0] - 2023-01-25

[0.4.0] - Added
  • FSDP Scheduled Fine-Tuning is now supported! See the tutorial here.

  • Introduced StrategyAdapters. If you want to extend Fine-Tuning Scheduler (FTS) to use a custom, currently unsupported strategy or override current FTS behavior in the context of a given training strategy, subclassing StrategyAdapter is now a way to do so. See FSDPStrategyAdapter for an example implementation.

  • support for pytorch-lightning 1.9.0

[0.4.0] - Changed
  • decomposed add_optimizer_groups to accommodate the corner case where FTS is being used without an lr scheduler configuration, also cleanup unrequired example testing warning exceptions

  • updated the fts repo issue template

[0.4.0] - Fixed
  • removed PATH adjustments that are no longer necessary due to https://github.com/Lightning-AI/lightning/pull/15485

[0.4.0] - Removed
  • removed references to the finetuning-scheduler conda-forge package (at least temporarily) due to the current unavailability of upstream dependencies (i.e. the pytorch-lightning conda-forge package ). Installation of FTS via pip within a conda env is the recommended installation approach (both in the interim and in general).

[0.3.4] - 2023-01-24

[0.3.4] - Added
  • support for pytorch-lightning 1.8.6

  • Notify the user when max_depth is reached and provide the current training session stopping conditions. Resolves #7.

[0.3.4] - Changed
  • set package version ceilings for the examples requirements along with a note regarding their introduction for stability

  • promoted PL CLI references to top-level package

[0.3.4] - Fixed
  • replaced deprecated Batch object reference with LazyDict

[0.3.3] - 2022-12-09

[0.3.3] - Added
  • support for pytorch-lightning 1.8.4

[0.3.3] - Changed
  • pinned jsonargparse dependency to <4.18.0 until #205 is fixed

[0.3.2] - 2022-11-18

[0.3.2] - Added
  • support for pytorch-lightning 1.8.2

[0.3.1] - 2022-11-10

[0.3.1] - Added
  • support for pytorch-lightning 1.8.1

  • augmented standalone_tests.sh to be more robust to false negatives

[0.3.1] - Changed
  • added temporary expected distutils warning until fixed upstream in PL

  • updated depth type hint to accommodate updated mypy default config

  • bumped full test timeout to be more conservative given a dependent package that is currently slow to install in some contexts (i.e. grpcio on MacOS 11 with python 3.10)

[0.3.0] - 2022-11-04

[0.3.0] - Added
  • support for pytorch-lightning 1.8.0

  • support for python 3.10

  • support for PyTorch 1.13

  • support for ZeroRedundancyOptimizer

[0.3.0] - Fixed
  • call to PL BaseFinetuning.freeze did not properly hand control of BatchNorm module thawing to FTS schedule. Resolves #5.

  • fixed codecov config for azure pipeline gpu-based coverage

[0.3.0] - Changed
  • Refactored unexpected and expected multi-warning checks to use a single test helper function

  • Adjusted multiple FTS imports to adapt to reorganized PL/Lite imports

  • Refactored fts-torch collect_env interface to allow for (slow) collect_env evolution on a per-torch version basis

  • Bumped required jsonargparse version

  • adapted to PL protection of _distributed_available

  • made callback setup stage arg mandatory

  • updated mypy config to align with PL Trainer handling

  • updated dockerfile defs for PyTorch 1.13 and python 3.10

  • updated github actions versions to current versions

  • excluded python 3.10 from torch 1.9 testing due to incompatibility

[0.3.0] - Deprecated
  • removed use of deprecated LightningCLI save_config_overwrite in PL 1.8

[0.2.3] - 2022-10-01

[0.2.3] - Added
  • support for pytorch-lightning 1.7.7

  • add new temporary HF expected warning to examples

  • added HF evaluate dependency for examples

[0.2.3] - Changed
  • Use HF evaluate.load() instead of datasets.load_metric()

[0.2.2] - 2022-09-17

[0.2.2] - Added
  • support for pytorch-lightning 1.7.6

  • added detection of multiple instances of a given callback dependency parent

  • add new expected warning to examples

[0.2.2] - Fixed
  • import fts to workaround pl TypeError via sphinx import, switch to non-TLS pytorch inv object connection due to current certificate issues

[0.2.2] - Changed
  • bumped pytorch dependency in docker image to 1.12.1

[0.2.1] - 2022-08-13

[0.2.1] - Added
  • support for pytorch-lightning 1.7.1

  • added support for ReduceLROnPlateau lr schedulers

  • improved user experience with additional lr scheduler configuration inspection (using an allowlist approach) and enhanced documentation. Expanded use of allow_untested to allow use of unsupported/untested lr schedulers

  • added initial user-configured optimizer state inspection prior to phase 0 execution, issuing warnings to the user if appropriate. Added associated documentation #4

[0.2.1] - Fixed
  • pruned test_examples.py from wheel

[0.2.1] - Changed
  • removed a few unused internal conditions relating to lr scheduler reinitialization and parameter group addition

[0.2.0] - 2022-08-06

[0.2.0] - Added
  • support for pytorch-lightning 1.7.0

  • switched to src-layout project structure

  • increased flexibility of internal package management

  • added a patch to examples to allow them to work with torch 1.12.0 despite issue #80809

  • added sync for test log calls for multi-gpu testing

[0.2.0] - Fixed
  • adjusted runif condition for examples tests

  • minor type annotation stylistic correction to avoid jsonargparse issue fixed in #148

[0.2.0] - Changed
  • streamlined MANIFEST.in directives

  • updated docker image dependencies

  • disable mypy unused ignore warnings due to variable behavior depending on ptl installation method (e.g. pytorch-lightning vs full lightning package)

  • changed full ci testing on mac to use macOS-11 instead of macOS-10.15

  • several type-hint mypy directive updates

  • unpinned protobuf in requirements as no longer necessary

  • updated cuda docker images to use pytorch-lightning 1.7.0, torch 1.12.0 and cuda-11.6

  • refactored mock strategy test to use a different mock strategy

  • updated pyproject.toml with jupytext metadata bypass configuration for nb test cleanup

  • updated ptl external class references for ptl 1.7.0

  • narrowed scope of runif test helper module to only used conditions

  • updated nb tutorial links to point to stable branch of docs

  • unpinned jsonargparse and bumped min version to 4.9.0

  • moved core requirements.txt to requirements/base.txt and update load_requirements and setup to reference lightning meta package

  • update azure pipelines ci to use torch 1.12.0

  • renamed instantiate_registered_class meth to instantiate_class due to ptl 1.7 deprecation of cli registry functionality

[0.2.0] - Deprecated
  • removed ddp2 support

  • removed use of ptl cli registries in examples due to its deprecation

[0.1.8] - 2022-07-13

[0.1.8] - Added
  • enhanced support and testing for lr schedulers with lr_lambdas attributes

  • accept and automatically convert schedules with non-integer phase keys (that are convertible to integers) to integers

[0.1.8] - Fixed
  • pinned jsonargparse to be <= 4.10.1 due to regression with PTL cli with 4.10.2

[0.1.8] - Changed
  • updated PL links for new lightning-ai github urls

  • added a minimum hydra requirement for cli usage (due to omegaconf version incompatibility)

  • separated cli requirements

  • replace closed compound instances of finetuning with the hyphenated compound version fine-tuning in textual contexts. (The way language evolves, fine-tuning will eventually become finetuning but it seems like the research community prefers the hyphenated form for now.)

  • update fine-tuning scheduler logo for hyphenation

  • update strategy resolution in test helper module runif

[0.1.8] - Deprecated

[0.1.7] - 2022-06-10

[0.1.7] - Fixed
  • bump omegaconf version requirement in examples reqs (in addition to extra reqs) due to omegaconf bug

[0.1.7] - Added
[0.1.7] - Changed
[0.1.7] - Deprecated

[0.1.6] - 2022-06-10

[0.1.6] - Added
  • Enable use of untested strategies with new flag and user warning

  • Update various dependency minimum versions

  • Minor example logging update

[0.1.6] - Fixed
  • minor privacy policy link update

  • bump omegaconf version requirement due to omegaconf bug

[0.1.6] - Changed
[0.1.6] - Deprecated

[0.1.5] - 2022-06-02

[0.1.5] - Added
  • Bumped latest tested PL patch version to 1.6.4

  • Added basic notebook-based example tests a new ipynb-specific extra

  • Updated docker definitions

  • Extended multi-gpu testing to include both oldest and latest supported PyTorch versions

  • Enhanced requirements parsing functionality

[0.1.5] - Fixed
  • cleaned up acknowledged warnings in multi-gpu example testing

[0.1.5] - Changed
[0.1.5] - Deprecated

[0.1.4] - 2022-05-24

[0.1.4] - Added
  • Added LR scheduler reinitialization functionality (#2)

  • Added advanced usage documentation

  • Added advanced scheduling examples

  • added notebook-based tutorial link

  • enhanced cli-based example hparam logging among other code clarifications

[0.1.4] - Changed
[0.1.4] - Fixed
  • addressed URI length limit for custom badge

  • allow new deberta fast tokenizer conversion warning for transformers >= 4.19

[0.1.4] - Deprecated

[0.1.3] - 2022-05-04

[0.1.3] - Added
[0.1.3] - Changed
  • bumped latest tested PL patch version to 1.6.3

[0.1.3] - Fixed
[0.1.3] - Deprecated

[0.1.2] - 2022-04-27

[0.1.2] - Added
  • added multiple badges (docker, conda, zenodo)

  • added build status matrix to readme

[0.1.2] - Changed
  • bumped latest tested PL patch version to 1.6.2

  • updated citation cff configuration to include all version metadata

  • removed tag-based trigger for azure-pipelines multi-gpu job

[0.1.2] - Fixed
[0.1.2] - Deprecated

[0.1.1] - 2022-04-15

[0.1.1] - Added
  • added conda-forge package

  • added docker release and pypi workflows

  • additional badges for readme, testing enhancements for oldest/newest pl patch versions

[0.1.1] - Changed
  • bumped latest tested PL patch version to 1.6.1, CLI example depends on PL logger fix (#12609)

[0.1.1] - Deprecated
[0.1.1] - Fixed
  • Addressed version prefix issue with readme transformation for pypi

[0.1.0] - 2022-04-07

[0.1.0] - Added
  • None (initial release)

[0.1.0] - Changed
  • None (initial release)

[0.1.0] - Deprecated
  • None (initial release)

[0.1.0] - Fixed
  • None (initial release)

Indices and tables


© Copyright Copyright (c) 2021-2023, Dan Dale. Revision 5616d274.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: v2.1.0
Versions
latest
stable
v2.1.0
v2.0.9
v2.0.7
v2.0.6
v2.0.4
v2.0.2
v2.0.1
v2.0.0
v0.4.1
v0.4.0
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.3
v0.2.2
v0.2.1
v0.2.0
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.