7 Scraping of multi-page websites

In many cases, we do not want to scrape the content of a single website, but several sub-pages in one step. In this session we will look at two common variations. Index pages and pagination.

7.1 Index-pages

An index page, in this context, is a website on which links to the various sub-pages are listed. We can think of this as a table of contents.

The website for the tidyverse packages serves as an example: https://www.tidyverse.org/packages/. Under the point “Core tidyverse” the eight packages are listed, which are loaded in R with library(tidyverse). In addition to the name and icon, a short description of the package and a link to further information, are part of the list.

Let’s look at one of the sub-pages for the core packages. Since they all have the same structure, you can choose any package as an example. It could be our scraping goal to create a table with the names of the core packages, the current version number, and the links to CRAN and the matching chapter in “R for Data Science” by Wickham and Grolemund. By now we have all the tools to extract this data from the websites. We could now “manually” scrape the individual sub-pages and merge the data. It would be more practical, however, if we could start from the index page and scrape all eight sub-pages and the data of interest they contain in one step. This is exactly what we will look at in the following.

7.1.1 Scraping of the index

library(tidyverse)
library(rvest)

As a first step, we need to extract the links to the sub-pages from the source code of the index page. As always, we download the website and parse it.

website <- "https://www.tidyverse.org/packages/" %>% 
  read_html()

In this case, the links are stored twice in the source code. In one case the image of the icon is linked, in the other the name of the package. You can follow this in the source code and/or with the WDTs yourself by now. However, we need each link only once. One of several ways to select them could be to select the <a> tags that directly follow the individual <div class="package"> tags.

a_elements <- website %>% 
  html_elements(css = "div.package > a")

a_elements
## {xml_nodeset (8)}
## [1] <a href="https://ggplot2.tidyverse.org/" target="_blank">\n    <img class ...
## [2] <a href="https://dplyr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [3] <a href="https://tidyr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [4] <a href="https://readr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [5] <a href="https://purrr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [6] <a href="https://tibble.tidyverse.org/" target="_blank">\n    <img class= ...
## [7] <a href="https://stringr.tidyverse.org/" target="_blank">\n    <img class ...
## [8] <a href="https://forcats.tidyverse.org/" target="_blank">\n    <img class ...

Since we need the actual URLs to be able to read out the sub-pages in the following, we should now extract the values of the href="" attributes.

links <- a_elements %>%
  html_attr(name = "href")

links
## [1] "https://ggplot2.tidyverse.org/" "https://dplyr.tidyverse.org/"  
## [3] "https://tidyr.tidyverse.org/"   "https://readr.tidyverse.org/"  
## [5] "https://purrr.tidyverse.org/"   "https://tibble.tidyverse.org/" 
## [7] "https://stringr.tidyverse.org/" "https://forcats.tidyverse.org/"

7.1.2 Iteration with map()

Before starting to parse the sub-pages, we must think about how we can get R to apply these steps automatically to several URLs one after the other. One possibility from base R would be to apply a “For Loop”. However, I would like to introduce the map() functions family, from the tidyverse package purrr. These follow the basic logic of the tidyverse, can easily be included in pipes and have a short and intuitively understandable syntax.

The map() function takes a vector or list as input, applies a function specified in the second argument to each of the elements of the input, and returns to us a list of the results of the applied function.

x <- c(1.28, 1.46, 1.64, 1.82)

map(x, round)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] 2

For each element of the numerical vector x, map() individually applies the function round(). round() does what the name suggests, and rounds the input up or down, depending on the numerical value. As a result, map() returns a list.

If we want to have a vector as output, we can use specific variants of the map functions depending on the desired type – logical, integer, double or character. Here is a quote from the help on ?map:

“map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).”

For example, if we want to have a numeric vector instead of a list as output for the above example, we can use map_dbl():

x <- c(1.28, 1.46, 1.64, 1.82)

map_dbl(x, round)
## [1] 1 1 2 2

Or for a character vector we can apply map_chr(). The function toupper() used here, puts returns the input as uppercase letters.

x <- c("abc", "def", "gah")

map_chr(x, toupper)
## [1] "ABC" "DEF" "GAH"

If we want to change the arguments of the applied function, the arguments are listed after the name of the function. Here, the number of decimal places to be rounded is set from the default value of 0 to 1.

x <- c(1.28, 1.46, 1.64, 1.82)

map_dbl(x, round, digits = 1)
## [1] 1.3 1.5 1.6 1.8

This gives us an overview of iteration with map(), but this can necessarily only be a first introduction. For a more detailed introduction to For Loops and the map functions, I recommend the chapter on “Iteration” from “R for Data Science”: https://r4ds.had.co.nz/iteration.html For a more interactive German introduction, I recommend the section “Schleifen” in the StartR app by Fabian Class: https://shiny.lmes.uni-potsdam.de/startR/#section-schleifen

7.1.3 Scraping the sub-pages

We can now use map() to parse all sub-pages in one step. As input, we use the character vector that contains the URLs of the sub-pages, and as the function to be applied, the familiar read_html(). For each of the eight URLs, the function is applied to the respective URL one after the other. As output we get a list of the eight parsed sub-pages.

pages <- links %>% 
  map(read_html)

If we look at the sub-pages in the browser, we can see that the HTML structure is identical for each sub-page in terms of the information we are interested in – name, version number and CRAN as well as “R for Data Science” links. We can therefore extract the data for each of them using the same CSS selectors.

pages %>% 
  map(html_element, css = "a.navbar-brand") %>% 
  map_chr(html_text)
## [1] "ggplot2" "dplyr"   "tidyr"   "readr"   "purrr"   "tibble"  "stringr"
## [8] "forcats"

The name of the package is displayed in the menu bar in the upper section of the pages. This is enclosed by an <a> tag. For example, for https://ggplot2.tidyverse.org/ this is: <a class="navbar-brand" href="index.html">ggplot2</a>. The CSS selector used here is one of the possible options to retrieve the desired information.

So what happens in detail in the code shown? The input is the previously created list with the eight parsed websites. In the second line, by using map(), the function html_element() with the argument css = "a.navbar-brand" is applied to each of the parsed pages. For each of the eight pages, the corresponding HTML-element is selected in turn. These are passed through the pipe to the third line, where iteration is again performed over each element, this time using the familiar function html_text(). For each of the eight selected elements, the text between the start and end tag is extracted. Since map_chr() is used here, a character vector is returned as output.

pages %>% 
  map(html_element, css = "small.nav-text.text-muted.me-auto") %>% 
  map_chr(html_text)
## [1] "3.4.4" "1.1.3" "1.3.0" "2.1.4" "1.0.2" "3.2.1" "1.5.0" "1.0.0"

The extraction of the current version number of the packages works the same way. For ggplot2, these are contained in the following tag: <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="Released version">3.3.5</small>. The <small> tag is used for smaller than normal text, which here results in the small version number written after the package’s name. Looking closely at the tag reveals an interesting detail. Namely, the class name contains spaces. This indicates that the <small> tag carries the classes nav-text, text-muted and me-auto. We can select the tag by attaching all class names to small in the selector, combined with dots between them. Strictly speaking, however, we do not need to do this here. Each class name by itself would be sufficient for selection, as they do not appear anywhere else on the website. In terms of the most explicit CSS selectors possible, I would still recommend to use all three class names, but this is also a matter of taste.

pages %>% 
  map(html_element, css = "ul.list-unstyled > li:nth-child(1) > a") %>% 
  map_chr(html_attr, name = "href")
## [1] "https://cloud.r-project.org/package=ggplot2"
## [2] "https://cloud.r-project.org/package=dplyr"  
## [3] "https://cloud.r-project.org/package=tidyr"  
## [4] "https://cloud.r-project.org/package=readr"  
## [5] "https://cloud.r-project.org/package=purrr"  
## [6] "https://cloud.r-project.org/package=tibble" 
## [7] "https://cloud.r-project.org/package=stringr"
## [8] "https://cloud.r-project.org/package=forcats"

pages %>% 
  map(html_element, css = "ul.list-unstyled > li:nth-child(4) > a") %>% 
  map_chr(html_attr, name = "href")
## [1] "https://r4ds.had.co.nz/data-visualisation.html"
## [2] "http://r4ds.had.co.nz/transform.html"          
## [3] "https://r4ds.had.co.nz/tidy-data.html"         
## [4] "http://r4ds.had.co.nz/data-import.html"        
## [5] "http://r4ds.had.co.nz/iteration.html"          
## [6] "https://r4ds.had.co.nz/tibbles.html"           
## [7] "http://r4ds.had.co.nz/strings.html"            
## [8] "http://r4ds.had.co.nz/factors.html"

The extraction of the links also follows the same basic principle. The selectors are a little more complicated, but can easily be understood looking in the source code and/or using the WDTs. We select the <a> tags of the first and fourth <li> children of the unordered list with the class list-unstyled. Here we apply the function html_attr() with the argument name = "href" to each of the eight selected elements to get the data of interest, the URLs of the links.

If we are only interested in the final result, we can also extract the data of the sub-pages directly during the creation of a tibble:

tibble(
  name = pages %>% 
    map(html_element, css = "a.navbar-brand") %>% 
    map_chr(html_text),
  version = pages %>% 
    map(html_element, css = "small.nav-text.text-muted.me-auto") %>% 
    map_chr(html_text),
  CRAN = pages %>% 
    map(html_element, css = "ul.list-unstyled > li:nth-child(1) > a") %>% 
    map_chr(html_attr, name = "href"),
  Learn = pages %>% 
    map(html_element, css = "ul.list-unstyled > li:nth-child(4) > a") %>% 
    map_chr(html_attr, name = "href")
)
## # A tibble: 8 × 4
##   name    version CRAN                                        Learn             
##   <chr>   <chr>   <chr>                                       <chr>             
## 1 ggplot2 3.4.4   https://cloud.r-project.org/package=ggplot2 https://r4ds.had.…
## 2 dplyr   1.1.3   https://cloud.r-project.org/package=dplyr   http://r4ds.had.c…
## 3 tidyr   1.3.0   https://cloud.r-project.org/package=tidyr   https://r4ds.had.…
## 4 readr   2.1.4   https://cloud.r-project.org/package=readr   http://r4ds.had.c…
## 5 purrr   1.0.2   https://cloud.r-project.org/package=purrr   http://r4ds.had.c…
## 6 tibble  3.2.1   https://cloud.r-project.org/package=tibble  https://r4ds.had.…
## 7 stringr 1.5.0   https://cloud.r-project.org/package=stringr http://r4ds.had.c…
## 8 forcats 1.0.0   https://cloud.r-project.org/package=forcats http://r4ds.had.c…