rlabuonora.com

Scraping NBA data with rvest and purrr

The NBA does a great job releasing statistics on every aspect of the game. Most teams have analytics experts crunching those numbers for insights to get a competitive advantage.

In this post I go through the process of scraping data from basketball-reference.com using the R package rvest. I also do some data munging with purrr and string interpolation with glue.

NBA Reference

This website has statistics on every aspect of the game for a long time. In this post, I’ll focus on individual player stats (Field Goal Percentage, Average Minutes, etc.) from the 20002001 season up to 20172018.

If you follow this link, you can find the data for the 2017 season in an html table.

Html uses a special markup language to format the tables so that your browser can render it properly. Typically, an html table has this structure:

<table>
  <th>Player</th>
  <td>Data<td>
</table>

To get this into an R data frame, you can use rvest. The first step is to get all the html from the page using read_html.

library(rvest)
nba_url <- "https://www.basketball-reference.com/leagues/NBA_2017_per_game.html"
web_page <- read_html(nba_url)

Now we need to get the actual data from the table. html_table does just that:

data <- html_table(web_page)
head(data[[1]])
## # A tibble: 6 x 30
##   Rk    Player Pos   Age   Tm    G     GS    MP    FG    FGA   `FG%` `3P`  `3PA`
##   <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1     Álex ~ SG    23    OKC   68    6     15.5  2.0   5.0   .393  1.4   3.6  
## 2 2     Quinc~ PF    26    TOT   38    1     14.7  1.8   4.5   .412  1.0   2.4  
## 3 2     Quinc~ PF    26    DAL   6     0     8.0   0.8   2.8   .294  0.2   1.2  
## 4 2     Quinc~ PF    26    BRK   32    1     15.9  2.0   4.8   .425  1.1   2.6  
## 5 3     Steve~ C     23    OKC   80    80    29.9  4.7   8.2   .571  0.0   0.0  
## 6 4     Arron~ SG    31    SAC   61    45    25.9  3.0   6.9   .440  1.0   2.5  
## # ... with 17 more variables: `3P%` <chr>, `2P` <chr>, `2PA` <chr>,
## #   `2P%` <chr>, `eFG%` <chr>, FT <chr>, FTA <chr>, `FT%` <chr>, ORB <chr>,
## #   DRB <chr>, TRB <chr>, AST <chr>, STL <chr>, BLK <chr>, TOV <chr>, PF <chr>,
## #   PTS <chr>

That was easy! html_table returns a list with all the tables in the web_page as a data_frame. What we want is the first element of that list.

data <- data[[1]]

Let’s inspect the data:

str(data)
## tibble [619 x 30] (S3: tbl_df/tbl/data.frame)
##  $ Rk    : chr [1:619] "1" "2" "2" "2" ...
##  $ Player: chr [1:619] "Álex Abrines" "Quincy Acy" "Quincy Acy" "Quincy Acy" ...
##  $ Pos   : chr [1:619] "SG" "PF" "PF" "PF" ...
##  $ Age   : chr [1:619] "23" "26" "26" "26" ...
##  $ Tm    : chr [1:619] "OKC" "TOT" "DAL" "BRK" ...
##  $ G     : chr [1:619] "68" "38" "6" "32" ...
##  $ GS    : chr [1:619] "6" "1" "0" "1" ...
##  $ MP    : chr [1:619] "15.5" "14.7" "8.0" "15.9" ...
##  $ FG    : chr [1:619] "2.0" "1.8" "0.8" "2.0" ...
##  $ FGA   : chr [1:619] "5.0" "4.5" "2.8" "4.8" ...
##  $ FG%   : chr [1:619] ".393" ".412" ".294" ".425" ...
##  $ 3P    : chr [1:619] "1.4" "1.0" "0.2" "1.1" ...
##  $ 3PA   : chr [1:619] "3.6" "2.4" "1.2" "2.6" ...
##  $ 3P%   : chr [1:619] ".381" ".411" ".143" ".434" ...
...

Ok, so all the columns are read in as text. That’s natural, since all html pages are text, so we need to convert the data to the correct data types.

Most columns seem to be numeric, except Player Name which is a character vector, Player Position and Team which are factors.

One approach to correcting the data types is to use dplyr’s mutate:

library(dplyr)
data_1 <- data %>% 
  mutate(Rk = as.numeric(Rk),
         Pos = as.factor(Pos),
         Age = as.numeric(Age))
#       ...

str(select(data_1, Rk, Pos, Age))
## tibble [619 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Rk : num [1:619] 1 2 2 2 3 4 5 6 7 8 ...
##  $ Pos: Factor w/ 7 levels "C","PF","PF-C",..: 7 2 2 2 1 7 1 1 2 2 ...
##  $ Age: num [1:619] 23 26 26 26 23 31 28 28 31 27 ...

But there’s a better way. mutate_at applies a function to a list of columns.

data_2 <- data %>% 
  mutate_at(vars(Tm, Pos), factor)

str(select(data_2, Tm, Pos))
## tibble [619 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Tm : Factor w/ 32 levels "ATL","BOS","BRK",..: 21 30 7 3 21 26 19 18 27 12 ...
##  $ Pos: Factor w/ 7 levels "C","PF","PF-C",..: 7 2 2 2 1 7 1 1 2 2 ...

The Columns need to be specified as a list of variables: vars(Tm, Pos). In this case, the function applied is factor.

Repeat for all the numeric variables using the : notation to select all the columns from G to PS.G.

data_3 <- data %>%
  mutate_at(vars(G:`PTS`, Age), as.numeric)
str(select(data_3, G:`PTS`, Age))
## tibble [619 x 26] (S3: tbl_df/tbl/data.frame)
##  $ G   : num [1:619] 68 38 6 32 80 61 39 62 72 61 ...
##  $ GS  : num [1:619] 6 1 0 1 80 45 15 0 72 5 ...
##  $ MP  : num [1:619] 15.5 14.7 8 15.9 29.9 25.9 15 8.6 32.4 14.3 ...
##  $ FG  : num [1:619] 2 1.8 0.8 2 4.7 3 2.3 0.7 6.9 1.3 ...
##  $ FGA : num [1:619] 5 4.5 2.8 4.8 8.2 6.9 4.6 1.4 14.6 2.8 ...
##  $ FG% : num [1:619] 0.393 0.412 0.294 0.425 0.571 0.44 0.5 0.523 0.477 0.458 ...
##  $ 3P  : num [1:619] 1.4 1 0.2 1.1 0 1 0 0 0.3 0 ...
##  $ 3PA : num [1:619] 3.6 2.4 1.2 2.6 0 2.5 0.1 0 0.8 0 ...
##  $ 3P% : num [1:619] 0.381 0.411 0.143 0.434 0 0.411 0 NA 0.411 0 ...
##  $ 2P  : num [1:619] 0.6 0.9 0.7 0.9 4.7 2 2.3 0.7 6.6 1.3 ...
##  $ 2PA : num [1:619] 1.4 2.1 1.7 2.2 8.2 4.4 4.5 1.4 13.8 2.7 ...
##  $ 2P% : num [1:619] 0.426 0.413 0.4 0.414 0.572 0.457 0.511 0.523 0.48 0.461 ...
##  $ eFG%: num [1:619] 0.531 0.521 0.324 0.542 0.571 0.514 0.5 0.523 0.488 0.458 ...
##  $ FT  : num [1:619] 0.6 1.2 0.3 1.3 2 1.4 0.7 0.2 3.1 0.4 ...
...

But wait, what about the rest of the years? You probably heard about the DRY principe: Don’t Repeat Yourself. It’d be silly to write a whole script just to download data from another season, because most of the code would be the same. So we need to reuse our code to get data from the rest of the seasons.

To make our code more reusable, the first thing we need to do is build the url to get the html. We’ll use a super cool package called glue for that.

library(glue)
name <- "Rafa"
glue("Hello, {name}")
## Hello, Rafa

I’m sure you get the idea. It’s what programmer’s call string interpolation.

For loops are used to iterate over collections, so we can repeat some process without copying the code. With a for loop, we can generate a vector of urls for every season we’re interested in:

base_url <- "https://www.basketball-reference.com/leagues/"
page <- "NBA_{year}_per_game.html"
urls <- vector("double", length(2000:2017))
for (year in 2000:2017) {
  urls[[year-1999]] <- glue(base_url, page) ## year -1999 gives 1, 2, 3, ...
}
head(urls)
## [1] "https://www.basketball-reference.com/leagues/NBA_2000_per_game.html"
## [2] "https://www.basketball-reference.com/leagues/NBA_2001_per_game.html"
## [3] "https://www.basketball-reference.com/leagues/NBA_2002_per_game.html"
## [4] "https://www.basketball-reference.com/leagues/NBA_2003_per_game.html"
## [5] "https://www.basketball-reference.com/leagues/NBA_2004_per_game.html"
## [6] "https://www.basketball-reference.com/leagues/NBA_2005_per_game.html"

Awesome! Now instead of creating a vector of urls, we’ll create a list of data frames, one for every season.

base_url <- "https://www.basketball-reference.com/leagues/"
page <- "NBA_{year}_per_game.html"
data <- list()

for (year in 2015:2017) {
  nba_url <- glue(base_url, page) 
  web_page <- read_html(nba_url)
  data[[year-2014]] <- html_table(web_page)[[1]]

}
str(data)
## List of 3
##  $ : tibble [675 x 30] (S3: tbl_df/tbl/data.frame)
##   ..$ Rk    : chr [1:675] "1" "2" "3" "4" ...
##   ..$ Player: chr [1:675] "Quincy Acy" "Jordan Adams" "Steven Adams" "Jeff Adrien" ...
##   ..$ Pos   : chr [1:675] "PF" "SG" "C" "PF" ...
##   ..$ Age   : chr [1:675] "24" "20" "21" "28" ...
##   ..$ Tm    : chr [1:675] "NYK" "MEM" "OKC" "MIN" ...
##   ..$ G     : chr [1:675] "68" "30" "70" "17" ...
##   ..$ GS    : chr [1:675] "22" "0" "67" "0" ...
##   ..$ MP    : chr [1:675] "18.9" "8.3" "25.3" "12.6" ...
##   ..$ FG    : chr [1:675] "2.2" "1.2" "3.1" "1.1" ...
##   ..$ FGA   : chr [1:675] "4.9" "2.9" "5.7" "2.6" ...
##   ..$ FG%   : chr [1:675] ".459" ".407" ".544" ".432" ...
##   ..$ 3P    : chr [1:675] "0.3" "0.3" "0.0" "0.0" ...
##   ..$ 3PA   : chr [1:675] "0.9" "0.8" "0.0" "0.0" ...
...

Better iteration with map

Now our script does everything we want it to do. But there’s a wrinkle we have to iron out.

A lot of programming is about iteration, and every programming language supports for loops. R has a great way to make iteration cleaner with the purrr package. There’s a whole chapter on iteration in R for Data Science, which is a must read for anyone interested in R.

Our script has to repeat the same steps for every year from 2000 to 2017. That’s a job for map! map takes a list and applies a function to every element of the list.

For example, here we apply the head function to a list of data frames to get the first 6 lines of each one:

my_list <- list(mtcars, iris)
purrr::map(my_list, head)
## [[1]]
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## 
## [[2]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Instead of head, which is a predefined function, you can use your own functions:

my_list <- list(mtcars, iris)
my_func <- function(data_frame) { print("Hello from my fun!")}
purrr::map(my_list, my_func)
## [1] "Hello from my fun!"
## [1] "Hello from my fun!"
## [[1]]
## [1] "Hello from my fun!"
## 
## [[2]]
## [1] "Hello from my fun!"

When applying map, you need to figure out our initial list to iterate over, and the function we want to apply to it’s elements in every iteration. The first part is easy, to get a list of seasons, we convert a numeric vector into a list like: as.list(2000:2017).

Now we need a function that takes one year and does the job for that year. Something like:

get_player_data <- function(year) {
  base_url <- "https://www.basketball-reference.com/leagues/"
  page <- "NBA_{year}_per_game.html"
  # Download data
  nba_url <- glue(base_url, page) 
  # extract the data frame from the table
  web_page <- read_html(nba_url)
  data <- html_table(web_page)[[1]]
  # return a data frame
  data
}

data.14 <- get_player_data(2014)
head(data.14)
## # A tibble: 6 x 30
##   Rk    Player Pos   Age   Tm    G     GS    MP    FG    FGA   `FG%` `3P`  `3PA`
##   <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1     Quinc~ SF    23    TOT   63    0     13.4  1.0   2.2   .468  0.1   0.2  
## 2 1     Quinc~ SF    23    TOR   7     0     8.7   0.9   2.0   .429  0.3   0.7  
## 3 1     Quinc~ SF    23    SAC   56    0     14.0  1.1   2.3   .472  0.0   0.2  
## 4 2     Steve~ C     20    OKC   81    20    14.8  1.1   2.3   .503  0.0   0.0  
## 5 3     Jeff ~ PF    27    TOT   53    12    18.1  2.7   5.2   .520  0.0   0.0  
## 6 3     Jeff ~ PF    27    CHA   25    0     10.2  0.9   1.6   .550  0.0   0.0  
## # ... with 17 more variables: `3P%` <chr>, `2P` <chr>, `2PA` <chr>,
## #   `2P%` <chr>, `eFG%` <chr>, FT <chr>, FTA <chr>, `FT%` <chr>, ORB <chr>,
## #   DRB <chr>, TRB <chr>, AST <chr>, STL <chr>, BLK <chr>, TOV <chr>, PF <chr>,
## #   PTS <chr>

Pipes

Pipes are a well known feature in the tidyverse, so I won’t go into detail of how they work. We could turn our function into a set of pipes easily except for this line: data <- html_table(web_page)[[1]].

We’re actually calling two functions in that line: html_table with the web_page argument and [[ with 1 as an argument.

web_page <- read_html(nba_url)
tbl_lst <- html_table(web_page)
df <- `[[`(tbl_lst, 1)

So we can pipe it like this:

df_2 <- web_page %>% 
 html_table %>% 
 `[[`(1)

Finally, we re-write our get_player_data function as:

get_player_data <- function(year) {
  base_url <- "https://www.basketball-reference.com/leagues/"
  page <- "NBA_{year}_per_game.html"
  
  glue(base_url, page) %>% 
    read_html %>% 
    html_table %>% 
    `[[`(1)
}

df.14 <- get_player_data(2014)
head(df.14)
## # A tibble: 6 x 30
##   Rk    Player Pos   Age   Tm    G     GS    MP    FG    FGA   `FG%` `3P`  `3PA`
##   <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1     Quinc~ SF    23    TOT   63    0     13.4  1.0   2.2   .468  0.1   0.2  
## 2 1     Quinc~ SF    23    TOR   7     0     8.7   0.9   2.0   .429  0.3   0.7  
## 3 1     Quinc~ SF    23    SAC   56    0     14.0  1.1   2.3   .472  0.0   0.2  
## 4 2     Steve~ C     20    OKC   81    20    14.8  1.1   2.3   .503  0.0   0.0  
## 5 3     Jeff ~ PF    27    TOT   53    12    18.1  2.7   5.2   .520  0.0   0.0  
## 6 3     Jeff ~ PF    27    CHA   25    0     10.2  0.9   1.6   .550  0.0   0.0  
## # ... with 17 more variables: `3P%` <chr>, `2P` <chr>, `2PA` <chr>,
## #   `2P%` <chr>, `eFG%` <chr>, FT <chr>, FTA <chr>, `FT%` <chr>, ORB <chr>,
## #   DRB <chr>, TRB <chr>, AST <chr>, STL <chr>, BLK <chr>, TOV <chr>, PF <chr>,
## #   PTS <chr>

Good, now we’re all set for using map:

list_of_data_frames <- as.list(2000:2017) %>% 
  purrr::map(get_player_data)
  
str(list_of_data_frames)
## List of 18
##  $ : tibble [517 x 30] (S3: tbl_df/tbl/data.frame)
##   ..$ Rk    : chr [1:517] "1" "1" "1" "2" ...
##   ..$ Player: chr [1:517] "Tariq Abdul-Wahad" "Tariq Abdul-Wahad" "Tariq Abdul-Wahad" "Shareef Abdur-Rahim" ...
##   ..$ Pos   : chr [1:517] "SG" "SG" "SG" "SF" ...
##   ..$ Age   : chr [1:517] "25" "25" "25" "23" ...
##   ..$ Tm    : chr [1:517] "TOT" "ORL" "DEN" "VAN" ...
##   ..$ G     : chr [1:517] "61" "46" "15" "82" ...
##   ..$ GS    : chr [1:517] "56" "46" "10" "82" ...
##   ..$ MP    : chr [1:517] "25.9" "26.2" "24.9" "39.3" ...
##   ..$ FG    : chr [1:517] "4.5" "4.8" "3.4" "7.2" ...
##   ..$ FGA   : chr [1:517] "10.6" "11.2" "8.7" "15.6" ...
##   ..$ FG%   : chr [1:517] ".424" ".433" ".389" ".465" ...
##   ..$ 3P    : chr [1:517] "0.0" "0.0" "0.1" "0.4" ...
##   ..$ 3PA   : chr [1:517] "0.4" "0.5" "0.1" "1.2" ...
...

Another functional idiom: Reduce

Map takes list and returns a list. In our script, it takes a list of numbers as input and returns a list of data frames. We now need to roll all those data frames into a single data frame.

That’s what reduce does. It takes a list and applies a function to consequtive pairs. accumulating the results.

lst <- list(1, 2, 3, 4) 
purrr::reduce(lst, sum) # don't do this in real life!
## [1] 10

Of course, since R’s sum is vectorized you don’t ever need to do that,

Reduce works great with map:

lst <- list(1, 2, 3, 4) 
# Find the sum of the square root of 1, 2, 3, 4
purrr::map(lst, sqrt) %>% 
  purrr::reduce(sum)
## [1] 6.146264

Another example: get the sum of the rows in a list of data frames:

list(mtcars, iris) %>% 
  purrr::map(nrow) %>% 
  purrr::reduce(sum)
## [1] 182

In our case, what we need it to use reduce with bind_rows, so we accumulate all the rows in our data_frame into a single one.

Putting all the pieces together in the final script:

library(glue)
library(purrr)
library(dplyr)
library(rvest)

# Download data table from basketball reference
get_player_data <- function(year) {
  
  base_nba_url <- "https://www.basketball-reference.com/leagues/NBA_{year}_per_game.html"
  
  glue(base_nba_url) %>% 
    read_html() %>% 
    html_table() %>% 
    `[[`(1) %>% 
    mutate(Year = year)
}

data <- as.list(2000:2018) %>% 
  purrr::map(get_player_data) %>% 
  purrr::reduce(bind_rows) %>% 
  mutate_at(vars(G:`PS/G`, Age), as.numeric) %>% 
  mutate_at(vars(Tm, Pos), factor)

What to do with the data