Leaderboards: SportsRef

college basketball
scraping
tutorial
sports reference
Using rvest to scrape AP and Coaches polls from Sports Reference
Author

Andrew Weatherman

Published

May 18, 2024

Introduction

Sports Reference includes a bevy of program, conference, and national leader data, but scraping it isn’t entirely straightforward because the information is “hidden” behind expandable sections. Even so, we can use rvest to programmatically extract this data.

Warning

Sports Reference limits users to 20 requests per minute, so to avoid an HTTP 429 error (“too many requests”), all functions must use a sleep of three or more seconds – Sys.sleep(3) – when iterating over 20 times.

Structure

There are two structures to take note of here: program leaders and national/conference leaders.

Program Leaders

Scraping program leaderboards is fairly analogous to many other portions of the site: html_table works, but you need a little bit of cleaning.

html_table will return every populated leaderboard (over 40), so we can use html_text and target the appropriate node to return the “names” of each statistical category. rlang::set_names can then take that vector and name our tables (the elements returned by html_table).

If we use map2, we can pass the name of each table to our cleaning function and set the third column, associated with the “value” of the record, to that name using rename_with.

There are three “types” of leaderboards offered: season, career, and game (usually complete to 2006-07). For the last one, there are two commas inside the player string, so we cannot use a comma delimiter to split player name and year/span/etc. So, we need to insert a different character (|) after the first comma and then split on that.

scrape_program_leaders <- function(team, leaderboard_type) {
  
  slug <- filter(cbd_teams(), common_team == team) %>% 
    pull(sr_link) %>% 
    str_extract("(?<=schools/).+(?=/men)") # extract slug
  
  url <- paste0('https://www.sports-reference.com/cbb/schools/', slug, '/men/leaders-and-records-', leaderboard_type, '.html')
  
  html <- read_html(url)
  
  # store names of tables
  table_names <- html %>% html_nodes('#div_leaders .poptip') %>% html_text()
  
  # get the tables and set their names
  tables <- html %>% html_table() %>% set_names(table_names)
  
  # set the delimiter based on leaderboard type
  delim <- if(leaderboard_type == 'game') '|' else ','
  
  ## clean the tables
  map2(tables, names(tables),
      .f = function(table, name) {
        
        table %>% 
          # rename cols.
          rename_with(~c('rank', 'player', name)) %>% 
          clean_names() %>% 
          # fix tied ranks
          fill(rank, .direction = 'down') %>% 
          rowwise() %>% 
          mutate(player = ifelse(leaderboard_type == 'game',
                                 gsub("^([^,]+),\\s*(.*)$", "\\1|\\2", player),
                                 player)) %>% 
          ungroup() %>% 
          # separate name and years played
          separate_wider_delim(player, delim = delim, names = c('player', 'time')) %>% 
          # add team name
          mutate(team = team, time = trimws(time))
        
      }
    )
  
}
API Key set!
rank player time points team
1 J.J. Redick 2005-06 964 Duke
2 R.J. Barrett 2018-19 860 Duke
3 Jay Williams 2000-01 841 Duke
4 Dick Groat 1950-51 831 Duke
5 Johnny Dawkins 1985-86 809 Duke
6 Danny Ferry 1988-89 791 Duke
7 Dick Groat 1951-52 780 Duke
8 Grayson Allen 2015-16 779 Duke
9 Shane Battier 2000-01 778 Duke
10 Christian Laettner 1990-91 771 Duke

National Leaders

On the other hand, national leaderboards are a bit more complex. If you run a simple html_table function on the leaderboard page, you’ll be met with an empty list.

read_html('https://www.sports-reference.com/cbb/seasons/men/2024-leaders.html') %>% html_table()

Typically, this means one of two things: a) the page loads dynamically or b) there are no tables on the page…but neither are true! In fact, the data is there and it is wrapped in a traditional table. The data, however, is commented out until a user expands the table (image below), and html_table does not inherently access commented-out tables. So, what can we do?

Well, we need to access the parent ID of the tables, which in this case is all_leaders (and you can see that in the top right of the screenshot).

Once we do html_nodes(#all_leaders), we need a way to access those commented-out tables, and we can do that using an xpath selector – comment().

read_html('https://www.sports-reference.com/cbb/conferences/acc/men/2024-leaders.html') %>% 
  html_nodes('#all_leaders') %>% 
  html_nodes(xpath = "comment()")

If you run that code, you’ll be returned an XML nodeset element. Using that object, we can convert it to pure text with html_text, giving us the raw html string, and read it back into a usable format with read_html. Finally, we can toss in html_table to get the list of tables we expected at the start.

read_html('https://www.sports-reference.com/cbb/conferences/acc/men/2024-leaders.html') %>% 
    html_nodes('#all_leaders') %>% 
    html_nodes(xpath = "comment()") %>% 
    html_text() %>% 
    read_html()

When cleaning, however, there is a small catch: Player and team names have no common delimiter like above. The easiest way to approach this, in my opinion, is to build the table manually by targeting the IDs of each column. The function below does this and sprinkles in the cleaning and naming described in the program section.

scrape_national_leaders <- function(year) {
  
  url <- paste0('https://www.sports-reference.com/cbb/seasons/men/', year, '-leaders.html')
  
  html <- read_html(url) %>% 
    html_nodes('#all_leaders') %>% 
    html_nodes(xpath = "comment()") %>% 
    html_text() %>% 
    read_html()
  
  # store names of tables
  table_names <- html %>% html_nodes('#div_leaders .poptip') %>% html_text()
  
  # get the tables and set their names
  tables <- html %>% html_nodes('table')
  
  ## clean the tables
  all_tables <- map2(tables, table_names,
     .f = function(table, name) {
        data <- tibble(
          rank = table %>% html_nodes('.rank') %>% html_text(),
          player = table %>% html_nodes('a') %>% html_text(),
          team = table %>% html_nodes('small') %>% html_text(),
          value = table %>% html_nodes('.value') %>% html_text()
        )
        
        data <- data %>% 
          mutate(rank = parse_number(rank),
                 value = as.numeric(value)) %>% 
          fill(rank, .direction = 'down') %>% 
          rename_with(~c('rank', 'player', 'team', name)) %>% 
          clean_names()
        
        return(data)
     }
  )
  
  all_tables <- all_tables %>% set_names(table_names)
  
  return(all_tables)
  
}
scrape_national_leaders(2024)$Points
rank player team points
1 Zach Edey Purdue 983
2 Tommy Bruner Denver 816
3 Mark Sears Alabama 797
4 RJ Davis UNC 784
5 Dalton Knecht Tennessee 780
6 Jaedon Ledee San Diego State 772
7 Jalen Blackmon Stetson 744
8 Tyler Thomas Hofstra 742
9 Terrence Shannon Illinois 736
10 Tucker DeVries Drake 734

Conference Leaders

Conference leaderboards behave similarly to national ones, and I’ll leave that as an exercise for the reader (at least for now).