Initial Dataset Preparation¶
In this notebook, we will take a look at preparation of our root dataset. To view the next notebook in the sequence, use the navigation link above, or at the bottom of this notebook.
First, let's import the libraries that will be required for this notebook.
Note that myutilpy is a custom package that has been created for this repo. It contains code that will be helpful for this sequence of notebooks. In this notebook, we import the myutilpy.data_processing module as dprep and utilize its utility functions. All of the source code is available in this repository.
import multiprocessing
import yaml
from pathlib import Path
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
# Custom package for this project
import myutilpy.data_processing as dprep
Configurations¶
Next, let’s do some setup. We will load the associated configurations for the desired experiment.
For this sequence of notebooks, we will be fine-tuning a MiniLM-L6 model. This model generates embeddings that capture information about passages of text, and can be used for various NLP tasks. This model is much lighter-weight (i.e., has fewer parameters) than other transformer models well-suited to the same task, but still delivers good quality performance. A helpful comparison of model architecture performance can be found here. This comparison was created by the authors of this (and several other) models uploaded to the Hugging Face model repository.
config_id = "mlml6_rate_pred_clsp"
num_cores_avail = max(1, multiprocessing.cpu_count() - 1)
Configuration settings are stored in .yaml files in the experiments/configs/ directory.
with open(f"../experiments/configs/{config_id}/main.yaml", 'r') as f:
main_config = yaml.safe_load(f)
dataset_checkpoint = main_config["dataset_checkpoint"]
dataset_checkpoint_revision = main_config["dataset_checkpoint_revision"]
pt_model_checkpoint = main_config["pt_model_checkpoint"]
pt_model_checkpoint_revision = main_config["pt_model_checkpoint_revision"]
dataset_id = main_config["dataset_id"]
data_seed = main_config["data_seed"]
root_dataset_dir = f"../data/pitchfork/{dataset_id}"
raw_data_cache_dir = f"../data/pitchfork/raw/cache"
Path(raw_data_cache_dir).mkdir(parents=True, exist_ok=True)
Path(root_dataset_dir).mkdir(parents=True, exist_ok=True)
Tokenizer and dataset loading¶
Now, we will load the tokenizer associated with our model of choice.
tokenizer = AutoTokenizer.from_pretrained(
pt_model_checkpoint,
revision=pt_model_checkpoint_revision
)
Let's also download the dataset that we will be using for this project. The dataset consists of Pitchfork music reviews scraped from their website. A full description of the dataset can be found in the dataset description card.
# Make sure to specify "reviews.csv" since it will default to album images
raw_datasets = load_dataset(
dataset_checkpoint,
revision=dataset_checkpoint_revision,
data_files=["reviews.csv"],
cache_dir=raw_data_cache_dir
)
Before moving on, let's have a quick look at the dataset summary. Notice that the data do not come pre-split. All rows (observations) are in the "Train" split by default.
raw_datasets
DatasetDict({
train: Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url'],
num_rows: 25709
})
})
Preprocess raw dataset¶
The first major step is to clean and preprocess the raw data. We will do some exploratory analysis after this step is completed.
dataset = raw_datasets["train"]
Missing data¶
The first filtration step we will do is to exclude rows where the "artist", "album", "review", or "reviewer" fields are non-strings (e.g., None). This is because, if we decide we want to do any analysis about any of these columns, we want to make sure valid data are present in the rows of our prepared dataset.
# The artist, album, review, and reviewer columns should be strings (e.g., should not be None)
dataset = dataset.filter(
lambda examples: dprep.detect_wrong_type_batched(examples, ["artist", "album", "review", "reviewer"], str),
batched=True,
num_proc=num_cores_avail
)
We see that we filtered out a decent number of rows with this step.
dataset
Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url'],
num_rows: 23034
})
Duplicates¶
One common issue with datasets scraped from the web is that, along with missing values, they may contain duplicate rows. Fortunately, our dataset is small enough that we can use built-in pandas functionality to drop duplicate rows.
dataset = Dataset.from_pandas(
dataset.to_pandas().drop_duplicates().reset_index(drop=True)
)
We see that several rows were dropped when checking for duplicates.
dataset
Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url'],
num_rows: 22063
})
Unknown tokens¶
Next, we will attempt to minimize the number of "unknown" tokens that find their way into our dataset. The MiniLM-L6 model uses a tokenizer that has an [UNK] token for words/letters that did not appear in the training dataset. This is not a huge issue in general, but can degrade performance if it occurs frequently. For this reason, we will attempt to replace common characters in our data that map to the [UNK] token. For example, we will replace the '“' character with '"' and the '♡' character with 'heart'. This will help prevent easily avoidable degradation of performance. For a full list of replaced characters (or sequences of characters), see myutilpy/data_preprocessing.py in the project source code.
blacklist_pattern = dprep.get_blacklist_pattern(dataset_id)
# Replace known "unk" tokens
dataset = dataset.map(
lambda examples: dprep.replace_known_unk_tokens_batched(examples, ["artist", "album", "review", "reviewer"], blacklist_pattern),
batched=True,
num_proc=num_cores_avail
)
Map (num_proc=15): 0%| | 0/22063 [00:00<?, ? examples/s]
Let's check to see how many rows still contain unknown tokens in the "review" column. Note that many of the "review" entries exceed the maximum model sequence length. More on this later.
dataset_leftover = dataset.filter(
lambda examples: dprep.detect_unk_batched(examples, ["review"], tokenizer),
batched=True,
num_proc=num_cores_avail
)
Filter (num_proc=15): 0%| | 0/22063 [00:00<?, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (660 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (721 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (560 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (586 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (664 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (924 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (852 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (670 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (562 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (902 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (572 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (727 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (791 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (568 > 512). Running this sequence through the model will result in indexing errors
Fortunately, there do not appear to be many left.
dataset_leftover
Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url'],
num_rows: 48
})
Closer look¶
Let's take a closer look at which tokens are still mapped to [UNK].
unk_tokens = set()
for i in range(len(dataset_leftover)):
text = dataset_leftover[i]["review"]
inputs = tokenizer(text, return_offsets_mapping=True)
ids = inputs.input_ids
offsets = inputs.offset_mapping
for j, id in enumerate(ids):
if id == tokenizer.unk_token_id:
unk_tokens.add(text[offsets[j][0]: offsets[j][1]])
Token indices sequence length is longer than the specified maximum sequence length for this model (773 > 512). Running this sequence through the model will result in indexing errors
print(*unk_tokens)
개꿈 佛 うたのきしゃ 先 霊 emphatic¸ ♑ 敗 七 冥 愚 閃 玉 想 □ 音 所 靈 蒸 ‽ 绿 观 ƚI 界 戰 卡 節 轉 dᴉlɟ 偉 乱 去 駭 共 狗 36℃ 夢 者 燕 詩 14℃ 只 ؟ 10℃ YTI⅃AƎЯ ♈ ◕ 印 옛날이야기 會 ዘላለም 兰 疊 鬼 物 💯 傍 剣 ɪᴍᴘᴀᴄᴛ21 指 ¯ ❀ 縞 浴 ƨbnƎ ⌘v 殺 蛰 ☕ 制 怕 奏 茶 過 ☽ 박혜진 念 吸 九 観 惊 曜 希 ゾット 重 害 來 呼 隠 波 象 。 Ⓡ 市 廁 0℃ 17℃ 幽 與 苑 客 ˂stranger˃ 縦 矮 ✓ ⌘
print(dataset_leftover["artist"])
['Lucy Liyou', 'Mark Barrott', 'Tzusing', 'Lucinda Chua', 'otay:onii', 'Two Shell', 'Bill Callahan', 'Sam Gendel', 'Willow', 'death’s dynamic shroud', '4s4ki', 'Tatsuro Yamashita', 'Two Shell', 'Whatever the Weather', 'Pan Daijing', 'JPEGMAFIA', 'Yikii', '박혜진 Park Hye Jin', 'Pan Daijing', 'Jusell, Prymek, Sage, Shiroishi', 'Rian Treanor', '박혜진 Park Hye Jin', 'Okkyung Lee', 'Gong Gong Gong 工工工', 'Fire-Toolz', 'Brian Eno', 'BTS', 'HARAM', 'RRUCCULLA', 'George Clanton', 'Fire-Toolz', 'Meuko! Meuko!', 'BTS', 'Mukqs', 'Guided by Voices', 'Varg2TM', 'Grandaddy', 'Toyomu', 'Mikael Seifu', 'Especia', 'Creepoid', 'Kosmo Kat', 'TV on the Radio', 'Lee', 'Ryan Hemsworth', 'Javelin', 'The Soft Moon', 'Pit Er Pat']
print(dataset_leftover["album"])
['Dog Dreams (개꿈)', 'Jōhatsu (蒸発)', '绿帽 Green Hat', 'YIAN', '夢之駭客 Dream Hacker', 'lil spirits', 'YTI⅃AƎЯ', 'Blueblue', '<CopingMechanism>', 'Darklife', 'Killer in Neverland', 'Softly', 'Icons EP', 'Whatever the Weather', 'Tissues', 'LP!', 'Crimson Poem', 'Before I Die', 'Jade 玉观音', 'Fuubutsushi (風物詩)', 'File Under UK Metaplasm', 'How can I', 'Yeo\u200b-\u200bNeun', 'Phantom Rhythm 幽靈節奏 (幽霊リズム)', 'Field Whispers (Into the Crystal Palace)', 'Apollo: Atmospheres & Soundtracks - Extended Edition', 'MAP OF THE SOUL : PERSONA', 'وين كنيت بي 11\u200b/\u200b9؟? “Where Were You on 9\u200b/\u200b11\u200b?\u200b” EP', 'SHuSH', 'Slide', 'Skinless X-1', '鬼島 Ghost Island EP', 'Love Yourself 轉 ‘Tear’', '起き上がり', 'August by Cake', 'Nordic Flora Series Pt. 3: Gore-Tex City', 'Last Place', '印象III : なんとなく、パブロ (Imagining “The Life of Pablo”)', 'Zelalem', 'Carta', 'Cemetery Highrise Slum', 'Square EP', 'Seeds', 'TANHÂ', 'Still Awake EP', 'Hi Beams', 'Zeros', 'High Time']
It appears that many of the unknown characters are within the "artist" or "album" name, and these names will obviously nearly always appear within the body of the review. Fortunately, though, we are not directly embedding artist or album names (aside from their occurrences within the "review" text) when performing prediction in our model. For this reason, we can move on.
Analysis prep¶
Let's prepare a summary of our dataset that will be useful for conducting some exploratory analysis before we fit our model.
Token counts¶
When we do exploratory analysis of data characteristics and modeling results, we may want to know the number of tokens that appeared in each review. Let's add that column to the data.
dataset = dataset.map(
lambda examples: dprep.get_n_tokens_batched(examples, "review", tokenizer),
batched=True,
num_proc=num_cores_avail
)
Map (num_proc=15): 0%| | 0/22063 [00:00<?, ? examples/s]
Collect summary features into dataframe¶
Let's compile the important columns for exploratory analysis into a summary_dataset, and convert it to a dataframe for subsequent analysis.
summary_dataset_df = (
dataset
.remove_columns(["year_released", "small_text", "album_art_url", "review"])
.to_pandas()
)
Split data¶
Finally, we want to prepare our data for model fitting by breaking it up into train, validation, and test sets.
Let's go with a 70-15-15 train-validation-test split.
- 70% for training is solid for fine-tuning.
- 15% each for val and test for reliable overfitting estimates and testing.
- A 60-20-20 split would be better for a smaller dataset or a simpler model.
# First, split the dataset into "train" and "test" where "test" will be used to
# build the true "validation" and "test" splits
datasets = dataset.train_test_split(test_size=0.3, seed=data_seed)
# Now, split the temp dataset into validation and test sets
datasets_val_test = datasets.pop("test").train_test_split(test_size=0.5, seed=data_seed)
datasets["validation"] = datasets_val_test.pop("train")
datasets["test"] = datasets_val_test.pop("test")
Let's look at the outputs of splitting our dataset.
datasets
DatasetDict({
train: Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url', 'review_n_tokens'],
num_rows: 15444
})
validation: Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url', 'review_n_tokens'],
num_rows: 3309
})
test: Dataset({
features: ['artist', 'album', 'year_released', 'rating', 'small_text', 'review', 'reviewer', 'genre', 'label', 'reviewed', 'album_art_url', 'review_n_tokens'],
num_rows: 3310
})
})
Save out data¶
Finally, let's save out our dataset.
summary_dataset_df.to_csv(f"{root_dataset_dir}/summary_df.csv", index=False)
datasets.save_to_disk(f"{root_dataset_dir}/dataset")
Saving the dataset (0/1 shards): 0%| | 0/15444 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards): 0%| | 0/3309 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards): 0%| | 0/3310 [00:00<?, ? examples/s]