Model Fitting¶
In this notebook, we will perform model fitting and collect metrics for evaluating the performance of our fitted model.
We are fine-tuning a MiniLM-L6 model. This model generates embeddings that capture information about passages of text, and can be used for various NLP tasks. This model is much lighter-weight (i.e., has fewer parameters) than other transformer models well-suited to the same task, but still delivers good quality performance. A helpful comparison of model architecture performance can be found here. This comparison was created by the authors of this (and several other) models uploaded to the Hugging Face model repository.
First, let's import the libraries that will be required for this notebook.
We will use code from our custom package for this project, myutilpy. Specifically, we will utilize model implementations and utility functions from the myutilpy.models module. All of the source code is available in this repository.
import numpy as np
import pandas as pd
import yaml
import multiprocessing
import gc
import warnings
import torch
import lightning.pytorch as pl
from torch.utils.data import DataLoader
from lightning.pytorch.callbacks import RichProgressBar, ModelCheckpoint
from lightning.pytorch.loggers import TensorBoardLogger, CSVLogger
from lightning.pytorch.utilities.model_summary import ModelSummary
from transformers import AutoModel, AutoTokenizer
from datasets import load_from_disk
from myutilpy.models.text_regressor import TextRegressor, LitTextRegressor
from myutilpy.models.pooling import pooling_fns, pool_cls, pool_mean
Before we move on, let's silence a few unconsequential, known warning messages that will clutter our cell output.
warnings.filterwarnings("ignore", ".*does not have many workers.*")
warnings.filterwarnings("ignore", ".*Detected KeyboardInterrupt, attempting graceful shutdown.*")
warnings.filterwarnings("ignore", ".*Only `best_model_path` will be reloaded.*")
Configurations¶
Next, let’s do some setup. We will load the associated configurations for the desired experimenory.
config_id = "mlml6_rate_pred_clsp"
num_cores_avail = max(1, multiprocessing.cpu_count() - 1)
with open(f"../experiments/configs/{config_id}/main.yaml", 'r') as f:
main_config = yaml.safe_load(f)
with open(f"../experiments/configs/{config_id}/model.yaml", 'r') as f:
model_config = yaml.safe_load(f)
dataset_checkpoint = main_config["dataset_checkpoint"]
dataset_checkpoint_revision = main_config["dataset_checkpoint_revision"]
pt_model_checkpoint = main_config["pt_model_checkpoint"]
pt_model_checkpoint_revision = main_config["pt_model_checkpoint_revision"]
dataset_id = main_config["dataset_id"]
frozen_model_checkpoint_path = model_config["frozen_model_checkpoint_path"]
finetune_model_checkpoint_path = model_config["finetune_model_checkpoint_path"]
model_seed = model_config["model_seed"]
Base model, tokenizer, and dataset¶
Let's load in our base embedding model, tokenizer, and preprocessed dataset.
embedding_model = AutoModel.from_pretrained(
pt_model_checkpoint,
revision=pt_model_checkpoint_revision
)
tokenizer = AutoTokenizer.from_pretrained(
pt_model_checkpoint,
revision=pt_model_checkpoint_revision
)
datasets = load_from_disk(f"../data/pitchfork/{dataset_id}/dataset")
Tokenization¶
We only need a subset of the dataset columns for our model-fitting. Our tokenizer will output the "input_ids" and "attention_mask" columns. Strictly speaking, we only really need the "input_ids", "attention_mask", and "rating" columns, but keeping additional columns can be helpful for debugging and development purposes when performance is not a key concern.
keeper_cols = ["artist", "album", "year_released", "rating", "input_ids", "attention_mask"]
drop_cols = set(datasets["train"].column_names).difference(set(keeper_cols))
Remember how we found that most reviews were longer than the maximum sequence length our model could handle? Here, we will use the simplest solution available and truncate the review text. There are more sophisticated approaches we could take, but we will leave that as a possible extension to the model.
tokenized_datasets = (
datasets
.map(lambda examples: tokenizer(examples["review"], padding=True, truncation=True), batched=True, num_proc=num_cores_avail)
.remove_columns(drop_cols)
)
DataLoaders¶
Let's set up the DataLoader objects that will be used for fitting and evaluating our mode.
Our first task will be to define a collation function whose job it is to organize our batched examples into tensors that are ready to be passed to the model.
def collate_reviews(batch):
# Extract input_ids and labels from the batch
input_ids = [item['input_ids'] for item in batch]
attention_masks = [item['attention_mask'] for item in batch]
ratings = [item['rating'] for item in batch]
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
ratings = torch.tensor(ratings)
return input_ids, attention_masks, ratings
Next, let's instantiate the DataLoader objects. Notice that we have some commented-out code. We also define our batch size, since this is required to instantiate a DataLoader.
batch_size = 16
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=batch_size, collate_fn=collate_reviews, shuffle=True)
valid_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=batch_size, collate_fn=collate_reviews)
test_dataloader = DataLoader(tokenized_datasets["test"], batch_size=batch_size, collate_fn=collate_reviews)
# # Random subsets for quick development
# train_dataloader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(1500)), batch_size=batch_size, collate_fn=collate_reviews, shuffle=True)
# valid_dataloader = DataLoader(tokenized_datasets["validation"].shuffle(seed=42).select(range(1500)), batch_size=batch_size, collate_fn=collate_reviews)
# test_dataloader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(1500)), batch_size=batch_size, collate_fn=collate_reviews)
Before moving on, let's do a quick spot check to make sure that our forward pass and pooling code are outputting tensors of correct dimensionality.
for batch_idx, batch in enumerate(valid_dataloader):
input_ids, attention_masks, ratings = batch
break
with torch.no_grad():
embedding = embedding_model(input_ids=input_ids, attention_mask=attention_masks).last_hidden_state
mp_embedding = pool_mean(embedding, attention_masks)
cp_embedding = pool_cls(embedding, attention_masks)
print(batch_size, embedding_model.config.hidden_size)
print(*mp_embedding.shape)
print(*cp_embedding.shape)
16 384 16 384 16 384
Everything looks good in terms of dimensionality. Our pooled embeddings are outputting tensors of shape (batch size, embedding dimension).
Setup for fitting¶
Now, let's prepare to fit our model. Notice that we have two epoch variablesWe are fine-tuning a pre-trained model and will follow the common pattern of freezing the base model's parameters and training the new model head, then un-freezing the base model's parameters and training the full model with a low learning rate.
Here, we will perform relatively few epochs of training in both stages for a couple of reasons. First, this code is not being executed in a cloud environment with large amounts of compute, so we will try and keep things manageable. Second, given the dataset size and the relatively small number of parameters in the regression model head, the first step likely does not need all that many epochs to obtain good performance. Finally, when fine-tuning the model, it is often the case that overfitting will occur within just a few epochs due to how powerful the base model will typically be.
Let's define our epoch counts and also define the device that our model will run on.
accelerator = "gpu" if torch.cuda.is_available() else "cpu"
frozen_epochs = 20
finetune_epochs = 10
Let's also set a random seed for model fitting so that we can reproduce our results.
pl.seed_everything(seed=model_seed)
Global seed set to 42
42
Frozen fitting¶
Frozen trainer¶
For this project, we will be using the Lightning library. It provides many helpful utilities and takes away a lot of the boilerplate programming work that can slow down model development in PyTorch. For details on how Lightning is used to help with our model fitting and evaluation code, check out the myutilpy.models.text_regressor module in the source code. For the purposes of this notebook, we only really need to set up the Trainer object. The trainer will handle model training and evaluation for us during the fitting process.
The first thing we will take care of for setting up the trainer is to define our logging and checkpointing utilities. We will log metrics to both a .csv file and for tensorboard visualization. While the tensorboard output will not appear in this notebook's output, it is a very convenient tool for examining model behavior during fitting, and was used during the development of this notebook. We also want to make sure we retain checkpoints of our model so that we can load them in later. Here, we decide to keep only the checkpoint whose model parameters yielded the minimum validation loss out of all training epochs (i.e., the "best" version of the frozen model). Notice that we specify this checkpointer is only for the "frozen" training epochs.
results_base = f"../experiments/results/{config_id}"
csv_logger = CSVLogger(results_base, "frozen_lightning_logs")
tb_logger = TensorBoardLogger(results_base, name="frozen_tb_logs")
frozen_model_checkpointer = ModelCheckpoint(
f"{results_base}/frozen_checkpoints/version_{csv_logger.version}",
filename="checkpoint",
monitor="avg_val_loss",
mode="min",
save_top_k=1
)
loggers = [csv_logger, tb_logger]
callbacks = [frozen_model_checkpointer, RichProgressBar()]
Our next step is to instantiate our frozen Trainer object. Notice that we also make a (potential) modification to the number of frozen training epochs. The reason for this is that a saved checkpoint will contain the epoch at which the checkpoint was written out. Since this notebook was already run (behind the scenes, like a cooking show where all of the ingredients are always magically prepped!) and the best version of the frozen model was already saved out, we don't want to start training where we left off if the best version of the model was found before the final epoch. This would be the case if overfitting began to occur before the final epoch. So, we update our max_epochs to reflect the number of epochs it took to find the best frozen model (behind the scenes) and this means the trainer will not attempt to run additional fitting epochs.
if frozen_model_checkpoint_path is not None:
checkpoint = torch.load(f"../{frozen_model_checkpoint_path}")
# Account for zero indexing
frozen_epochs = checkpoint["epoch"] + 1
frozen_trainer = pl.Trainer(
max_epochs=frozen_epochs,
accelerator=accelerator,
callbacks=callbacks,
precision="16-mixed",
logger=loggers
)
Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
Model¶
Let's now instantiate our full model (base model + regression head) using the TextRegressor class defined in myutilpy.models.text_regressor.
Note that the pooling method we use for this experiment is "CLS pooling" (pooling by extracting the starting "[CLS]") token. This is a popular method when using BERT-based models.
tr_model = TextRegressor(
embedding_model,
embed_dim=embedding_model.config.hidden_size,
pooling_fn=pooling_fns[model_config["pooling"]]
)
Freeze parameters¶
The final step is to instantiate our Lightning model LitTextRegressor which wraps around our TextRegressor model and freeze the base model parameters. A helper method LitTextRegressor.freeze_pretrained_model() has been implemented to freeze the base model parameters and set the learning rate appropriately for frozen training. For full implementation details, see myutilpy.models.text_regressor.
lit_model = LitTextRegressor(tr_model)
lit_model.freeze_pretrained_model(lr=1e-3)
Let's verify that the base model parameters are indeed frozen
ModelSummary(lit_model, max_depth=2)
| Name | Type | Params ----------------------------------------------------------------- 0 | text_regressor | TextRegressor | 22.7 M 1 | text_regressor.embedder | BertModel | 22.7 M 2 | text_regressor.regression_head | Linear | 385 ----------------------------------------------------------------- 385 Trainable params 22.7 M Non-trainable params 22.7 M Total params 90.854 Total estimated model params size (MB)
Fitting the frozen model¶
Finally, we fit our model with base parameters frozen. Again, because the model was already fit behind the scenes, a checkpoint path is used. Apologies for being denied the satisfaction of observing training bars that have reached completion 🙁.
if frozen_model_checkpoint_path is not None:
print(f"Loading checkpoint from: {frozen_model_checkpoint_path}")
frozen_trainer.fit(
model=lit_model,
ckpt_path=f"../{frozen_model_checkpoint_path}",
train_dataloaders=train_dataloader,
val_dataloaders=valid_dataloader
)
else:
print(f"Training from scratch")
frozen_trainer.fit(
model=lit_model,
train_dataloaders=train_dataloader,
val_dataloaders=valid_dataloader
)
Restoring states from the checkpoint path at ../experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt
Loading checkpoint from: experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ ┡━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ 0 │ text_regressor │ TextRegressor │ 22.7 M │ └───┴────────────────┴───────────────┴────────┘
Trainable params: 385 Non-trainable params: 22.7 M Total params: 22.7 M Total estimated model params size (MB): 90
Restored all states from the checkpoint at ../experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt
Output()
`Trainer.fit` stopped: `max_epochs=20` reached.
Validation data check¶
Before we move on to the next stage, let's see how well the best version of our frozen model performs on our validation set.
best_frozen_checkpoint_path = frozen_trainer.checkpoint_callback.best_model_path
frozen_trainer.test(lit_model, dataloaders=valid_dataloader, ckpt_path=best_frozen_checkpoint_path)
Restoring states from the checkpoint path at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Loaded model weights from the checkpoint at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt
Output()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ avg_test_loss │ 1.2091145515441895 │ └───────────────────────────┴───────────────────────────┘
[{'avg_test_loss': 1.2091145515441895}]
print(lit_model.test_epoch_metrics)
{'mse': 1.2091149, 'rmse': 1.0995976}
Pretty good! Using a fully frozen pre-trained base model, we have trained a regression model head to predict ratings to within +/- 1.1 rating "points" on average in our validation set, based only on review text.
Un-frozen fitting (fine-tuning)¶
It is now time for the second stage, where we un-freeze our base model's parameters and perform fine-tuning.
Before we move on, let's clear up some memory on our GPU (if applicable).
# Free GPU memory
del lit_model
gc.collect()
torch.cuda.empty_cache()
Load in best frozen model¶
Let's load in the best checkpoint for our frozen model and use it in the fine-tuning stage.
lit_model = LitTextRegressor.load_from_checkpoint(
frozen_trainer.checkpoint_callback.best_model_path,
text_regressor = tr_model
)
Let's now un-freeze the base model's parameters (and set a lower learning rate).
lit_model.unfreeze_pretrained_model(1e-5)
ModelSummary(lit_model, max_depth=2)
| Name | Type | Params ----------------------------------------------------------------- 0 | text_regressor | TextRegressor | 22.7 M 1 | text_regressor.embedder | BertModel | 22.7 M 2 | text_regressor.regression_head | Linear | 385 ----------------------------------------------------------------- 22.7 M Trainable params 0 Non-trainable params 22.7 M Total params 90.854 Total estimated model params size (MB)
Now, we instantiate our trainer, loggers, and callbacks as in the frozen training section above. Again, we update our maximum epochs argument to be the number of epochs it took to find the best model parameters when the checkpoint was saved out.
results_base = f"../experiments/results/{config_id}"
csv_logger = CSVLogger(results_base, "finetune_lightning_logs")
tb_logger = TensorBoardLogger(results_base, name="finetune_tb_logs")
finetune_model_checkpointer = ModelCheckpoint(
f"{results_base}/finetune_checkpoints/version_{csv_logger.version}",
filename="finetune_checkpoint",
monitor="avg_val_loss",
mode="min",
save_top_k=1
)
loggers = [csv_logger, tb_logger]
callbacks = [finetune_model_checkpointer, RichProgressBar()]
if finetune_model_checkpoint_path is not None:
checkpoint = torch.load(f"../{finetune_model_checkpoint_path}")
# Account for zero indexing
finetune_epochs = checkpoint["epoch"] + 1
finetune_trainer = pl.Trainer(
max_epochs=finetune_epochs,
accelerator=accelerator,
callbacks=callbacks,
precision="16-mixed",
logger=loggers
)
Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
Finally, we perform a fine-tuning fit. Again, since a behind-the-scenes fit has already taken place and we are using a checkpoint, no progress bars or current epoch will show up.
if finetune_model_checkpoint_path is not None:
print(f"Loading checkpoint from: {frozen_model_checkpoint_path}")
finetune_trainer.fit(
model=lit_model,
ckpt_path=f"../{finetune_model_checkpoint_path}",
train_dataloaders=train_dataloader,
val_dataloaders=valid_dataloader
)
else:
print(f"Training from scratch")
finetune_trainer.fit(
model=lit_model,
train_dataloaders=train_dataloader,
val_dataloaders=valid_dataloader
)
Restoring states from the checkpoint path at ../experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading checkpoint from: experiments/results/mlml6_rate_pred_clsp/frozen_checkpoints/version_0/checkpoint.ckpt
┏━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ ┡━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ 0 │ text_regressor │ TextRegressor │ 22.7 M │ └───┴────────────────┴───────────────┴────────┘
Trainable params: 22.7 M Non-trainable params: 0 Total params: 22.7 M Total estimated model params size (MB): 90
Restored all states from the checkpoint at ../experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt
Output()
`Trainer.fit` stopped: `max_epochs=10` reached.
Validation data check¶
Let's compare the performance of our fine-tuned model to that of our frozen model on the validation data.
best_ft_checkpoint_path = finetune_trainer.checkpoint_callback.best_model_path
finetune_trainer.test(lit_model, dataloaders=valid_dataloader, ckpt_path=best_ft_checkpoint_path)
Restoring states from the checkpoint path at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Loaded model weights from the checkpoint at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt
Output()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ avg_test_loss │ 0.8832166194915771 │ └───────────────────────────┴───────────────────────────┘
[{'avg_test_loss': 0.8832166194915771}]
print(lit_model.test_epoch_metrics)
{'mse': 0.8832167, 'rmse': 0.9397961}
It looks like fine-tuning definitely yielded an improvement!
Test data check¶
To wrap things up, let's take a look at how our fine-tuned model performs on a completely held-out test set. Since we used the validation data to decide which checkpoint to retain for both the frozen and fine-tuning fitting, it is better to use an entirely held-out set of examples to assess estimated model performance.
best_ft_checkpoint_path = finetune_trainer.checkpoint_callback.best_model_path
finetune_trainer.test(lit_model, dataloaders=test_dataloader, ckpt_path=best_ft_checkpoint_path)
Restoring states from the checkpoint path at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Loaded model weights from the checkpoint at /home/carcook/dev/nlp-projects/experiments/results/mlml6_rate_pred_clsp/finetune_checkpoints/version_0/finetune_checkpoint.ckpt
Output()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ avg_test_loss │ 0.8020689487457275 │ └───────────────────────────┴───────────────────────────┘
[{'avg_test_loss': 0.8020689487457275}]
print(lit_model.test_epoch_metrics)
{'mse': 0.80206877, 'rmse': 0.8955829}
Again, pretty good! It's important to bear in mind that these values are just estimates of generalizability and will likely vary depending on the train/validation/test splits.
Finally, let's save our results for further analysis.
pred_df = pd.DataFrame(
data={
"y": torch.concat(lit_model.test_epoch_out["y"]).to("cpu").numpy(),
"yhat": torch.concat(lit_model.test_epoch_out["yhat"]).to("cpu").numpy()
}
)
pred_df.to_csv(f"{results_base}/predictions_df.csv", index=False)