<< Previous: Data Preparation   |   Next: Model Fitting >>

Exploratory dataset analysis¶

In this notebook, we will conduct some high-level exploratory analysis of our dataset before fitting our model. Due to the simplicity and readability of tidyverse syntax, as well as the convenient features of ggplot and dplyr, we will use the R language and tidyverse tools for this analysis.

First, let's import the libraries that will be required for this notebook.

In [1]:
library(tidyverse)
library(yaml)
library(coda)
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Configurations¶

Before we do our analysis, let's make sure we are retrieving the configurations for the appropriate experiment.

In [2]:
config_id <- "mlml6_rate_pred_clsp"
In [3]:
main_config <- yaml.load_file(paste0("../experiments/configs/", config_id, "/main.yaml"))
dataset_id <- main_config$dataset_id

Load data¶

Let's load in the summary dataframe we saved out previously.

In [4]:
df <- read_csv(paste0("../data/pitchfork/", dataset_id, "/summary_df.csv"))
Rows: 22063 Columns: 8
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (6): artist, album, reviewer, genre, label, reviewed
dbl (2): rating, review_n_tokens

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before moving on, we should do a quick check to make sure the data look like we expect them to.

In [5]:
df %>% head()
A tibble: 6 × 8
artistalbumratingreviewergenrelabelreviewedreview_n_tokens
<chr><chr><dbl><chr><chr><chr><chr><dbl>
Joni Mitchell Joni Mitchell at Newport6.0Grayson Haver CurrinRock Rhino July 29, 2023721
The Clientele I Am Not There Anymore 8.0Marc Hogan Rock Merge July 28, 2023718
Nas Magic 2 6.5Dylan Green Rap Mass Appeal July 28, 2023736
Beverly Glenn-CopelandThe Ones Ahead 8.0Jesse Dorris Experimental Transgressive July 28, 2023704
Damon Locks New Future City Radio 6.9Matthew Blackwell Jazz / ExperimentalInternational AnthemJuly 28, 2023797
DJ K PANICO NO SUBMUNDO 7.9Nadine Smith Experimental Nyege Nyege Tapes July 27, 2023682

Exploratory analysis¶

We are going to do some exploratory analysis of our dataset. We will do this to make sure that the preprocessing yielded a high quality dataset, and to get a sense of some general characteristics of the data.

In [6]:
# Settings for plots
options(repr.plot.width = 16, repr.plot.height = 8, repr.plot.res = 100)
In [7]:
# Helpers to avoid redundant code
set_theme_elements <- function(plt) {
    plt +
        theme_classic() +
        theme(text = element_text(size=16))
}

Spot checks¶

First we want to do some basic quality checks to make sure our preprocessing was successful.

Duplicate check¶

Let's make sure that our removal of duplicates was successful, since we do not want there to be any duplicates when we fine-tune our model. Let's make sure there are no duplicate reviews by checking to make sure "album", "artist", "rating", "reviewer", and "reviewed" are not repeated. It could be that an album was reviewed multiple times - review platforms are known to "revisit" material. However, if we find an instance where the same album by a given artist was given the same rating by the same reviewer on the same date as a pre-existing review, this could point to an issue that got past us in our pre-processing step.

In [8]:
df %>%
    add_count(album, artist, rating, reviewer, reviewed) %>%
    filter(n > 1) %>%
    nrow()
0

It seems that we were able to successfully remove duplicate rows from our data.

Key investigations¶

Distribution of token counts¶

The first item of interest is to investigate the distribution of "review" token counts. Our model has a maximum sequence length of 512 tokens and we noticed that many reviews exceeded this limit.

We will look at a quantile table and then a visualization of token counts.

In [9]:
df %>%
    pull(review_n_tokens) %>%
    quantile() %>%
    as_tibble(rownames="quantile")
A tibble: 5 × 2
quantilevalue
<chr><dbl>
0% 17
25% 631
50% 707
75% 804
100%1231

Let's use a density plot rather than a histogram since bin-widths sometimes have a distorting effect on clarity.

We will also look at something called a Highest Density Region (HDR). In this case, we'll look at the 95% HDR - that is, the smallest interval that contains 90% of the density.

In [10]:
hdr_result <- HPDinterval(as.mcmc(df %>% pull(review_n_tokens)), prob = 0.9)
hdr_lower <- hdr_result[,1]
hdr_upper <- hdr_result[,2]
print(hdr_lower)
print(hdr_upper)
[1] 550
[1] 922
In [11]:
plt <- df %>%
    ggplot() +
    geom_density(aes(review_n_tokens), alpha=0.7, fill="gray") +
    geom_vline(aes(xintercept = hdr_lower), color = "firebrick", linetype = "dashed") +
    geom_vline(aes(xintercept = hdr_upper), color = "firebrick", linetype = "dashed")

plt <- plt %>%
    set_theme_elements() +
    xlab("Num. Tokens") +
    ylab("Density")

plt
No description has been provided for this image

It appears that about 95% of reviews fall between (roughly) 550 and 900 tokens. This will need to be addressed when we fit our model, but for now, we will just make a note of it.

Distribution of ratings¶

Next, let's take a look at the distribution of ratings among the reviews.

In [12]:
df %>%
    pull(rating) %>%
    quantile() %>%
    as_tibble(rownames="quantile")
A tibble: 5 × 2
quantilevalue
<chr><dbl>
0% 0.0
25% 6.6
50% 7.3
75% 7.8
100%10.0
In [13]:
hdr_result <- HPDinterval(as.mcmc(df %>% pull(rating)), prob = 0.9)
hdr_lower <- hdr_result[,1]
hdr_upper <- hdr_result[,2]
print(hdr_lower)
print(hdr_upper)
[1] 5.5
[1] 9
In [14]:
plt <- df %>%
    ggplot() +
    geom_density(aes(rating), alpha=0.7, fill="gray") +
    geom_vline(aes(xintercept = hdr_lower), color = "firebrick", linetype = "dashed") +
    geom_vline(aes(xintercept = hdr_upper), color = "firebrick", linetype = "dashed")

plt <- plt %>%
    set_theme_elements() +
    xlab("Rating") +
    ylab("Density")

plt
No description has been provided for this image
In [15]:
# ggsave("pitchfork_rating_distribution.png", plot = plt, height = 4, width = 10, dpi = 300)

We see that 90% of the ratings lie between 5.5 and 9 review points. This may cause bias in our regression model, so it will be important to evaluate this after fitting our model.

Other investigations¶

Unique artists¶

Many artists will have multiple albums present in the dataset. How many unique artists are there?

In [16]:
n_artists <- df %>%
    distinct(artist) %>%
    nrow()
n_artists
10242

How does this compare to the overall number of rows in our dataset?

In [17]:
n_rows <- df %>%
    nrow()

round(n_rows / n_artists, digits = 2)
2.15

Artist album counts¶

What does the distribution of album counts look like for each artist?

In [18]:
album_count_df <- df %>%
    # Mindful of albums with multiple reviews
    distinct(artist, album) %>%
    count(artist, name = "n_albums")
In [19]:
album_count_df %>%
    pull(n_albums) %>%
    quantile() %>%
    as_tibble(rownames="quantile")
A tibble: 5 × 2
quantilevalue
<chr><dbl>
0% 1
25% 1
50% 1
75% 2
100%711

It appears we have an extreme outlier at the $100$th percentile mark. Let's investigate the cause.

In [20]:
album_count_df %>%
    arrange(desc(n_albums)) %>%
    head()
A tibble: 6 × 2
artistn_albums
<chr><int>
Various Artists 711
Neil Young 33
Gucci Mane 25
Guided by Voices 25
Bob Dylan 24
David Bowie 23

Even the legendary Neil Young cannot compete with "Various Artists" in terms of output. Let's remove this from the summary data to get a better sense of the distribution.

In [21]:
album_count_df <- album_count_df %>%
    filter(artist != "Various Artists")
In [22]:
album_count_df %>%
    pull(n_albums) %>%
    quantile() %>%
    as_tibble(rownames="quantile")
A tibble: 5 × 2
quantilevalue
<chr><dbl>
0% 1
25% 1
50% 1
75% 2
100%33
In [23]:
plt <- album_count_df %>%
    ggplot() +
    geom_bar(aes(n_albums), alpha=0.7, fill="gray")

plt <- plt %>%
    set_theme_elements() +
    xlab("Num. Albums") +
    ylab("Count")

plt
No description has been provided for this image

We see that it is quite rare for an artist to have more than even 2 albums in this dataset.

Multiple reviews¶

How many of the albums in this dataset have multiple reviews?

In [24]:
df %>%
    count(artist, album, name = "n_reviews") %>%
    filter(n_reviews > 1) %>%
    nrow()
10

This turns out to be a very small number.

Distribution of review counts per reviewer¶

What does the distribution of review counts look like for the reviewers in our dataset?

Because the number of reviewers is in the hundreds, we can use "rain cloud" plots for extra information about how many points lie at each peak.

In [25]:
plt <- df %>%
    count(reviewer, name="n_reviews") %>%
    # Filter out reviewers with very few reviews
    filter(n_reviews > 3) %>%
    ggplot() +
    geom_density(aes(n_reviews), alpha=0.7, fill="gray") +
    geom_point(aes(x=n_reviews, y=0), shape="|", size=3)

plt <- plt %>%
    set_theme_elements() +
    xlab("Num. Reviews") +
    ylab("Density")

plt
No description has been provided for this image

Distribution of average rating among reviewers¶

How harsh are the reviewers on average? Is this average harshness skewed?

In [26]:
plt <- df %>%
    group_by(reviewer) %>%
    summarize(mean_rating = mean(rating), n_ratings = n()) %>%
    filter(n_ratings > 3) %>%
    ggplot() +
    geom_density(aes(mean_rating), alpha=0.7, fill="gray") +
    geom_point(aes(x=mean_rating, y=0), shape="|", size=3)

plt <- plt %>%
    set_theme_elements() +
    xlab("Rating") +
    ylab("Density")

plt
No description has been provided for this image

The National - where do they stand?¶

How does the iconic indie rock band The National measure up to other bands in the dataset?

In [27]:
tnl_med_rating <- df %>%
    filter(artist == "The National") %>%
    summarize(median_rating = median(rating)) %>%
    pull(median_rating)
tnl_med_rating
7.95

Not bad! The median rating of The National falls above the $75$th percentile of reviews in the dataset. What happens when we compare their median rating to the ratings of other prolific artists in the dataset?

In [28]:
df %>%
    filter(artist != "The National") %>%
    add_count(artist, name = "n_album_reviews") %>%
    filter(n_album_reviews > 5) %>%
    group_by(artist) %>%
    summarize(median_rating = median(rating)) %>%
    pull(median_rating) %>%
    quantile() %>%
    as_tibble(rownames = "quantile")
A tibble: 5 × 2
quantilevalue
<chr><dbl>
0% 4.00
25% 6.95
50% 7.40
75% 7.80
100%9.50

Impressive! Their median rating is well above the $75$th percentile of median reviews for bands with more than 5 album reviews.

<< Previous: Data Preparation   |   Next: Model Fitting >>