What is: Discriminative Fine-Tuning?

Discriminative Fine-Tuning is a fine-tuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters $\theta$ at time step $t$ looks like the following (Ruder, 2016):

$\theta\_{t} = \theta\_{t-1} − \eta\cdot\nabla\_{\theta}J\left(\theta\right)$

where $\eta$ is the learning rate and $\nabla\_{\theta}J\left(\theta\right)$ is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters $\theta$ into { $\theta\_{1}, \ldots, \theta\_{L}$ } where $\theta\_{l}$ contains the parameters of the model at the $l$ -th layer and $L$ is the number of layers of the model. Similarly, we obtain { $\eta\_{1}, \ldots, \eta\_{L}$ } where $\theta\_{l}$ where $\eta\_{l}$ is the learning rate of the $l$ -th layer. The SGD update with discriminative finetuning is then:

$\theta\_{t}^{l} = \theta\_{t-1}^{l} - \eta^{l}\cdot\nabla\_{\theta^{l}}J\left(\theta\right)$

The authors find that empirically it worked well to first choose the learning rate $\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\eta^{l-1}=\eta^{l}/2.6$ as the learning rate for lower layers.

Source	Universal Language Model Fine-tuning for Text Classification
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com