Leaderboards: SportsRef

Using rvest to scrape AP and Coaches polls from Sports Reference

Author

Andrew Weatherman

Published

May 18, 2024

Introduction

Sports Reference includes a bevy of program, conference, and national leader data, but scraping it isn’t entirely straightforward because the information is “hidden” behind expandable sections. Even so, we can use rvest to programmatically extract this data.

Warning

Sports Reference limits users to 20 requests per minute, so to avoid an HTTP 429 error (“too many requests”), all functions must use a sleep of three or more seconds – Sys.sleep(3) – when iterating over 20 times.

library(tidyverse)
library(rvest)
library(janitor)
library(rlang)
library(cbbdata)

Structure

There are two structures to take note of here: program leaders and national/conference leaders.

Program Leaders

Scraping program leaderboards is fairly analogous to many other portions of the site: html_table works, but you need a little bit of cleaning.

html_table will return every populated leaderboard (over 40), so we can use html_text and target the appropriate node to return the “names” of each statistical category. rlang::set_names can then take that vector and name our tables (the elements returned by html_table).

If we use map2, we can pass the name of each table to our cleaning function and set the third column, associated with the “value” of the record, to that name using rename_with.

There are three “types” of leaderboards offered: season, career, and game (usually complete to 2006-07). For the last one, there are two commas inside the player string, so we cannot use a comma delimiter to split player name and year/span/etc. So, we need to insert a different character (|) after the first comma and then split on that.

scrape_program_leaders <- function(team, leaderboard_type) {
  
  slug <- filter(cbd_teams(), common_team == team) %>% 
    pull(sr_link) %>% 
    str_extract("(?<=schools/).+(?=/men)") # extract slug
  
  url <- paste0('https://www.sports-reference.com/cbb/schools/', slug, '/men/leaders-and-records-', leaderboard_type, '.html')
  
  html <- read_html(url)
  
  # store names of tables
  table_names <- html %>% html_nodes('#div_leaders .poptip') %>% html_text()
  
  # get the tables and set their names
  tables <- html %>% html_table() %>% set_names(table_names)
  
  # set the delimiter based on leaderboard type
  delim <- if(leaderboard_type == 'game') '|' else ','
  
  ## clean the tables
  map2(tables, names(tables),
      .f = function(table, name) {
        
        table %>% 
          # rename cols.
          rename_with(~c('rank', 'player', name)) %>% 
          clean_names() %>% 
          # fix tied ranks
          fill(rank, .direction = 'down') %>% 
          rowwise() %>% 
          mutate(player = ifelse(leaderboard_type == 'game',
                                 gsub("^([^,]+),\\s*(.*)$", "\\1|\\2", player),
                                 player)) %>% 
          ungroup() %>% 
          # separate name and years played
          separate_wider_delim(player, delim = delim, names = c('player', 'time')) %>% 
          # add team name
          mutate(team = team, time = trimws(time))
        
      }
    )
  
}

API Key set!

rank	player	time	points	team
1	J.J. Redick	2005-06	964	Duke
2	R.J. Barrett	2018-19	860	Duke
3	Jay Williams	2000-01	841	Duke
4	Dick Groat	1950-51	831	Duke
5	Johnny Dawkins	1985-86	809	Duke
6	Danny Ferry	1988-89	791	Duke
7	Dick Groat	1951-52	780	Duke
8	Grayson Allen	2015-16	779	Duke
9	Shane Battier	2000-01	778	Duke
10	Christian Laettner	1990-91	771	Duke

National Leaders

On the other hand, national leaderboards are a bit more complex. If you run a simple html_table function on the leaderboard page, you’ll be met with an empty list.

read_html('https://www.sports-reference.com/cbb/seasons/men/2024-leaders.html') %>% html_table()

list()

Typically, this means one of two things: a) the page loads dynamically or b) there are no tables on the page…but neither are true! In fact, the data is there and it is wrapped in a traditional table. The data, however, is commented out until a user expands the table (image below), and html_table does not inherently access commented-out tables. So, what can we do?

Well, we need to access the parent ID of the tables, which in this case is all_leaders (and you can see that in the top right of the screenshot).

Once we do html_nodes(#all_leaders), we need a way to access those commented-out tables, and we can do that using an xpath selector – comment().

read_html('https://www.sports-reference.com/cbb/conferences/acc/men/2024-leaders.html') %>% 
  html_nodes('#all_leaders') %>% 
  html_nodes(xpath = "comment()")

If you run that code, you’ll be returned an XML nodeset element. Using that object, we can convert it to pure text with html_text, giving us the raw html string, and read it back into a usable format with read_html. Finally, we can toss in html_table to get the list of tables we expected at the start.

read_html('https://www.sports-reference.com/cbb/conferences/acc/men/2024-leaders.html') %>% 
    html_nodes('#all_leaders') %>% 
    html_nodes(xpath = "comment()") %>% 
    html_text() %>% 
    read_html()

When cleaning, however, there is a small catch: Player and team names have no common delimiter like above. The easiest way to approach this, in my opinion, is to build the table manually by targeting the IDs of each column. The function below does this and sprinkles in the cleaning and naming described in the program section.

scrape_national_leaders <- function(year) {
  
  url <- paste0('https://www.sports-reference.com/cbb/seasons/men/', year, '-leaders.html')
  
  html <- read_html(url) %>% 
    html_nodes('#all_leaders') %>% 
    html_nodes(xpath = "comment()") %>% 
    html_text() %>% 
    read_html()
  
  # store names of tables
  table_names <- html %>% html_nodes('#div_leaders .poptip') %>% html_text()
  
  # get the tables and set their names
  tables <- html %>% html_nodes('table')
  
  ## clean the tables
  all_tables <- map2(tables, table_names,
     .f = function(table, name) {
        data <- tibble(
          rank = table %>% html_nodes('.rank') %>% html_text(),
          player = table %>% html_nodes('a') %>% html_text(),
          team = table %>% html_nodes('small') %>% html_text(),
          value = table %>% html_nodes('.value') %>% html_text()
        )
        
        data <- data %>% 
          mutate(rank = parse_number(rank),
                 value = as.numeric(value)) %>% 
          fill(rank, .direction = 'down') %>% 
          rename_with(~c('rank', 'player', 'team', name)) %>% 
          clean_names()
        
        return(data)
     }
  )
  
  all_tables <- all_tables %>% set_names(table_names)
  
  return(all_tables)
  
}

scrape_national_leaders(2024)$Points

rank	player	team	points
1	Zach Edey	Purdue	983
2	Tommy Bruner	Denver	816
3	Mark Sears	Alabama	797
4	RJ Davis	UNC	784
5	Dalton Knecht	Tennessee	780
6	Jaedon Ledee	San Diego State	772
7	Jalen Blackmon	Stetson	744
8	Tyler Thomas	Hofstra	742
9	Terrence Shannon	Illinois	736
10	Tucker DeVries	Drake	734

Conference Leaders

Conference leaderboards behave similarly to national ones, and I’ll leave that as an exercise for the reader (at least for now).