6 Scraping of tables & dynamic websites

6.1 Scraping of tables

In web scraping, we will often pursue the goal of transferring the extracted data into a tibble or data frame in order to be able to analyse it further. It is particularly helpful if the data we are interested in is already stored in an HTML table. Because rvest allows us to read out complete tables quickly and easily with the function html_table().

As a reminder, the basic structure of the HTML code for tables is as follows:

<table>
  <tr> <th>#</th> <th>Tag</th> <th>Effect</th> </tr>
  <tr> <td>1</td> <td>"b"</td> <td>bold</td> </tr>
  <tr> <td>2</td> <td>"i"</td> <td>italics</td> </tr>
</table>

The <table> tag covers the entire table. Rows are defined by <tr>, column headings with <th> and cells with <td>.

Before we start scraping, we load the necessary packages as usual:

library(tidyverse)
library(rvest)

6.1.1 Table with CSS selectors from Wikipedia

On the Wikipedia page on “CSS”, there is also a table with CSS selectors. This is our scraping target.

First we parse the website:

website <- "https://en.wikipedia.org/wiki/CSS" %>% 
  read_html()

If we look at the source code and search – CTRL+F – for “<table”, we see that this page contains a large number of HTML tables. These include not only the elements that are recognisable at first glance as “classic” tables, but also, among other things, the “info boxes” at the top right edge of the article or the fold-out lists of further links at the bottom. If you want to look at this more closely, the Web Developer Tools can be very helpful here.

Instead of simply selecting all <table> elements on the page, one strategy might be to use the WDTs to create a CSS selector for that specific table: "table.wikitable:nth-child(42)". We thus select the table of class "wikitable" which is the 42nd child of the parent hierarchy level – <div class="mw-parser-output">.

If we only want to select a single HTML element, it can be helpful to use the function html_element() instead of html_elements().

elements <- website %>% 
  html_elements(css = "table.wikitable:nth-child(42)")
elements
## {xml_nodeset (1)}
## [1] <table class="wikitable"><tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th ...

element <- website %>% 
  html_element(css = "table.wikitable:nth-child(42)")
element
## {html_node}
## <table class="wikitable">
## [1] <tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th>\n<th>First defined<br>i ...

The difference is mainly in the output of the function. This is recognisable by the entry inside the { } in the output. In the first case, we get a list of HTML elements – an “xml_nodeset” – even if this list, as here, consists of only one entry. html_element() returns the HTML element itself – the “html_element” – as the function’s output. Why is this relevant? In many cases it can be easier to work directly with the HTML element instead of a list of HTML elements, for example when transferring tables into data frames and tibbles, but more on that later.

To read out the table selected in this way, we only need to apply the function html_table() to the HTML element.

css_table <- element %>% 
  html_table()

css_table %>% 
  head(n = 4)
## # A tibble: 4 × 3
##   Pattern       Matches                                   First definedin CSS …¹
##   <chr>         <chr>                                                      <int>
## 1 E             an element of type E                                           1
## 2 E:link        an E element that is the source anchor o…                      1
## 3 E:active      an E element during certain user actions                       1
## 4 E::first-line the first formatted line of an E element                       1
## # ℹ abbreviated name: ¹​`First definedin CSS level`

The result is a tibble that contains the scraped contents of the HTML table and adopts the column names stored in the <th> tags for the columns.

6.1.2 Scraping multiple tables

It could also be our scraping goal to scrape not only the first, but all four content tables of the Wikipedia article. If we look at the four tables in the source code and/or the WDTs, we see that they all carry the class "wikitable". This allows us to select them easily. Please note that the function html_elements() must be used again, as we no longer need just one element, but a list of several selected elements.

tables <- website %>% 
  html_elements(css = "table.wikitable") %>% 
  html_table()

The result is a list of four tibbles, each of which contains one of the four tables. If we want to select an individual tibble from the list, for example, to transfer it into a new object, we have to rely on subsetting.

We have learned about basic subsetting for vectors using [#], in chapter 1. For lists, things can get a little bit more complicated. There are basically two ways of subsetting lists in R: list_name[#] and list_name[[#]]. The most relevant difference for us is what kind of object R returns to us. In the first case, the returned object is always a list, even if it may only consist of one element. Using double square brackets, on the other hand, returns a single element directly. So the difference is not dissimilar to that between html_elements() and html_element().

For example, if our goal is to select the third tibble from the list of four data frames, which subsetting should we use?

tables[3] %>% 
  str()
## List of 1
##  $ : tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Selectors  : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##   ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

tables[[3]] %>% 
  str()
## tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Selectors  : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##  $ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

In the first case, we see that we have a list of length 1, which contains a tibble with 7 lines and 2 variables, as well as further information about these variables. In the second case, we get the tibble directly, i.e. no longer as an element of a list. So we have to use list_name[[]] to directly select a single tibble from a list of tibbles.

If we are interested in selecting several elements from a list instead, this is only possible with list_name[]. Instead of selecting an element with a single number, we can select several with a vector of numbers in one step.

tables[c(1, 3)] %>% 
  str()
## List of 2
##  $ : tibble [42 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ Pattern                  : chr [1:42] "E" "E:link" "E:active" "E::first-line" ...
##   ..$ Matches                  : chr [1:42] "an element of type E" "an E element that is the source anchor of a hyperlink whose target is either not yet visited (:link) or already"| __truncated__ "an E element during certain user actions" "the first formatted line of an E element" ...
##   ..$ First definedin CSS level: int [1:42] 1 1 1 1 1 1 1 1 1 1 ...
##  $ : tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Selectors  : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##   ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

As a result, we get a list again that contains the two tibbles selected here.

6.1.3 Tabellen mit NAs

What happens when we try to read a table with missing values? Consider the following example: https://jakobtures.github.io/web-scraping/table_na.html

At first glance, it is already obvious that several cells of the table are unoccupied here. Values are missing. Let’s try to read in the table anyway.

table_na <- "https://jakobtures.github.io/web-scraping/table_na.html" %>% 
  read_html %>% 
  html_element(css = "table")

turnout <- table_na %>% 
  html_table()

turnout
## # A tibble: 16 × 3
##    Bundesland             Wahljahr Wahlbeteiligung
##    <chr>                     <dbl>           <dbl>
##  1 Baden-Württemberg        2016              NA  
##  2 Bayern                   2018              NA  
##  3 Berlin                     NA              66.9
##  4 Brandenburg                61.3            NA  
##  5 Bremen                   2019              64.1
##  6 Hamurg                   2020              63.2
##  7 Hessen                   2018              67.3
##  8 Mecklenburg-Vorpommern   2016              61.6
##  9 Niedersachsen            2017              63.1
## 10 Nordrhein-Westfalen      2017              65.2
## 11 Rheinland-Pfalz          2016              70.4
## 12 Saarland                 2017              69.7
## 13 Sachsen                  2019              66.6
## 14 Sachsen-Anhalt           2016              61.1
## 15 Schleswig-Holstein       2017              64.2
## 16 Thüringen                2019              64.9

As we can see, html_table filled four cells with with NA. This stands for “Not Available” and represents missing values in R. However, there are different types of missing values in the HTML source code, which the automatic repair implemented in html_table() handles differently. Let’s first look at the source code of the first two lines:

<tr>
  <td>Baden-Württemberg</td>
  <td>2016</td>
  <td></td>
</tr>

<tr>
  <td>Bayern</td>
  <td>2018</td>
</tr>

In both cases the value for turnout is missing. For “Baden-Württemberg”, we see that the third column is created in the HTML code, but there is no content in this cell. html_table() knows, that this empty cell has to be filled with a NA. In contrast, for “Bayern” the cell is completely missing. This means that the second row of the table consists of only two columns, while the rest of the table has three columns. In this case, html_table() could draw the correct conclusion and filled the missing third column with an NA.

But let’s also look at the third and fourth rows in the source code:

<tr>
  <td>Berlin</td>
  <td></td>
  <td>66.9</td>
</tr>

<tr>
  <td>Brandenburg</td>
  <td>61.3</td>
</tr>

The second column is missing in both cases. In the first case it is created but empty, in the second it does not exist. In the first case, html_table() can again handle it without any problems. For “Brandenburg”, however, the function reaches its limits. We, as human observers, quickly realise that the last state election in Brandenburg did not take place in 61.3 and that this must therefore be the turnout. R cannot distinguish this so easily and takes 61.3 as the value for the column “Election year” and inserts a NA in the third column.

What to do? First of all, we should be aware that such problems exist. So we should check if such a problem exists and whether the option to have it fixed automatically will actually get us there. If this is not the case, we can at least correct the problems that arise after extraction.

Our problem lies exclusively in row four. Its second column must be moved to the third and the second must then itself be set as NA. For this we need subsetting again. In the case of a tibble, we need to specify the row and column in the form tbl[row, column] to select a cell. So we can tell R: “Write in cell three the content of cell two, and then write in cell two NA”.

turnout[4, 3] <- turnout[4, 2]
turnout[4, 2] <- NA

turnout %>% 
  head(n = 4)
## # A tibble: 4 × 3
##   Bundesland        Wahljahr Wahlbeteiligung
##   <chr>                <dbl>           <dbl>
## 1 Baden-Württemberg     2016            NA  
## 2 Bayern                2018            NA  
## 3 Berlin                  NA            66.9
## 4 Brandenburg             NA            61.3

6.2 Dynamic Websites

In the “reality” of the modern internet, we will increasingly encounter websites that are no longer based exclusively on static HTML files, but generate content dynamically. You know this, for example, in the form of timelines in social media offerings that are generated dynamically based on your user profile. Other websites may generate the displayed content with JavaScript functions or in response to input in HTML forms.

In many of these cases, it is no longer sufficient from a web scraping perspective to parse an HTML page and extract the data you are looking for, as this is often not contained in the HTML source code but is loaded dynamically in the background. The good news is that there are usually ways of scraping the information anyway.

Perhaps the operator of a page or service offers an API (Application Programming Interface). In this case, we can register for access to this interface and then get access to the data of interest. This is possible with Twitter, for example. In other cases, we may be able to identify in the embedded scripts how and from which database the information is loaded and access it directly. Or we use the Selenium WebDriver to “remotely control” a browser window and scrape what the browser “sees”.

However, all of these approaches are advanced methods that are beyond the scope of this introduction.

But in cases where an HTML file is dynamically generated based on input into a HTML form, we can often (not always) read it using the methods we already know.

6.2.1 HTML forms and HTML queries

As an example, let’s first look at the OPAC catalogue of the Potsdam University Library https://opac.ub.uni-potsdam.de/ in the browser.

If we enter the term “test” in the search field and click on Search, the browser window will show us the results of the search query. But what actually interests us here is the browser’s address bar. Instead of the URL “https://opac.ub.uni-potsdam.de/”, there is now a much longer URL. Note that the exact URL may very well differ for you, but the basic form should be similar to: “https://opac.ub.uni-potsdam.de/DB=1/LNG=DU/SID=3f0e2b15-1/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test”.

The first part is obviously still the URL of the website called up: “https://opac.ub.uni-potsdam.de/”. Let’s call this the base URL. However, the part “CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” was added to the end of the URL. This is the HTML query we are interested in here. Between the base URL and the query there are one or more components, which in this case may also differ depending on your browser. However, these are also irrelevant for the actual search query. We can shorten the URL to “https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” and receive the same result.

A query is a request in which data from an HTML form is sent to the server. In response, the server generates a new website, which is sent back to the user and displayed in the browser. In this case, the query was triggered by clicking on the “Search” button. If we understand what the components of the query do, we could manipulate it and use it specifically to have a website of interest created and parsed.

6.2.2 HTML forms

To do this, we first need to take a look at the HTML code of the search form. To understand this, you should display the source code of the page and search for “<form” or use the WDTs to look at the form and its components.

<form action="CMD"
      class="form"
      name="SearchForm"
      method="GET">
  ...
</form>

HTML forms are encompassed by the <form> tag. Within the tag, one or more form elements such as text entry fields, drop-down option lists, buttons, etc. can be placed.

<form> itself carries a number of attributes in this example. The first attribute of interest to us is the method="GET" attribute. This specifies the method of data transfer between client and server. It is important to note that the method “GET” uses queries in the URL for the transmission of data and the method “POST” does not. We can therefore only manipulate queries in this way, if the “GET” method is used. If no method is specified in the <form> tag, “GET” is also used as the default.

The second attribute of interest to us is action="CMD". This specifies which action should be triggered after the form has been submitted. Often the value of action= is the name of a file on the server to which the data will be sent and which then returns a dynamically generated HTML page back to the user.

Let us now look at the elements of the form. For this, the rvest function html_form() can be helpful.

"https://opac.ub.uni-potsdam.de/" %>% 
  read_html() %>% 
  html_element(css = "form") %>% 
  html_form()
## <form> 'SearchForm' (GET https://opac.ub.uni-potsdam.de/CMD)
##   <field> (select) ACT: SRCHA
##   <field> (select) IKT: 
##   <field> (select) SRT: 
##   <field> (checkbox) FUZZY: Y
##   <field> (text) TRM: 
##   <field> (submit) :  Suchen

The output shows us in the first line the name of the form and the action that is performed on submit: “GET https://opac.ub.uni-potsdam.de/CMD”. The other six lines show the form components:

  • The three drop-down selections for:
    • type of search
    • which fields should be searched
    • how results should be ordered
  • The Checkbox for “unscharfe Suche”
  • The text field where we enter terms to be searched
  • The search button itself

We also see the names of these components as well as in some cases the default value that is sent when the form is submitted, as long as no other value is selected or entered.

Let’s look at some of these elements. <select> elements are drop-down lists of options that can be selected. This is the source code for the first <select> element in our example:

<select name="ACT">
    <OPTION VALUE="SRCH">suchen [oder]
      <OPTION VALUE="SRCHA" SELECTED>suchen [und]
        
        <OPTION value="AND">eingrenzen
        <OPTION value="OR">erweitern
        <OPTION value="NOT">ausgenommen
        <OPTION value="RLV">neu ordnen
        
      <OPTION value="BRWS">Index bl&auml;ttern
</select>

The attribute name="ACT" defines the elements name, which is used when transmitting the data from the form via the query. The <option> tags define the selectable options, i.e. the drop down menu. <value=""> represents the value transmitted by the form. The user is being shown the text following the tag. The default selection is either the first value in the list or – like in this case – the option with the attribute selected is being explicitly chosen as the default.

The three other elements are <input> tags. Input fields whose specific type is specified via the attribute type="". These can be, for example, text boxes (type="text") or checkboxes (input="checkbox"), but there are many more options available. A comprehensive list can be found at: https://www.w3schools.com/html/html_form_input_types.asp. Here is the source code for two of the three <input> elements on the example page:

<input type="text" name="TRM" value="" size="50">
...
<input type="submit" class="button" value=" Suchen ">

The first tag is of the type “text”, i.e. a text field, in this case the text field into which the search term is entered. In addition to the name of the element, a default value of the field is specified via value="". In this case, the default value is an empty field. The second tag is of the type “submit”. This is the “Search” button, which triggers the transmission of the form data via the query by clicking on it.

6.2.3 The query

But what exactly is being transmitted? Let’s look again at the example query from above:

CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test

The value of the action="" attribute forms the first part of the query and is appended after the base URL. The value of the attribute tells the server what to do with the other transmitted data. This is followed by a ?, which introduces the data to be transmitted as several pairs of name="" and value="" attributes of the individual elements. The pairs are connected with &. ACT=SRCHA thus stands for the fact that the (default) value “SRCHA” has been selected in the element with the name “ACT”. What the values of the two other <select> elements “IKT” and “SRT” stand for, you can understand yourself with a look into the source code or the WDTs. These are not important for our endeavour. The text entered in the field is transmitted as the value of the <input type="text"> tag with the name “TRM”. Here the value was “test”.

The server receives the form data in this way, can then take a decision on the basis of the action="" attribute, here “CMD”, how the data is to be processed and constructs the website accordingly, which it sends back to us and which is displayed in our browser.

6.2.4 Manipulating the query and scraping the result

Now that we know what the components of the query mean, we can manipulate them. Instead of writing queries by hand, we should use R to combine them for us. We will also encounter the technique of manipulating URLs directly in the R code more often. So we should learn it early.

The function str_c() from stringr (core tidyverse) combines the strings listed as arguments into a single string. Strings stored in other R objects can also be included. If we have the goal of manipulating both the search method and the search term, we could achieve this in this way:

base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "test"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test"

If we now change the strings stored in the method and term objects and generate the complete URL again, these components of the query are manipulated accordingly.

method <- "SRCH"
term <- "web+scraping"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCH&IKT=1016&SRT=YOP&TRM=web+scraping"

The search method was set to the value “SRCH”, i.e. an “OR” search, the search term to “web scraping”. It is important to note that no spaces may appear in the query and that these are replaced by “+” when the form is submitted. So instead of “web scraping” we have to use the string “web+scraping”.

As an example application, we can now have the server perform an “AND” search for the term “web scraping”, read out the HTML page generated by the server and extract the 10 titles displayed.

base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "web+scraping"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=web+scraping"

website <- url %>% 
  read_html()

The search results are displayed as tables in the generated HTML file. The <table> tag has the attribute-value combination summary="hitlist", which we can use for our CSS selector:

hits <- website %>% 
  html_element(css = "table[summary='hitlist']") %>% 
  html_table() %>% 
  as_tibble()

hits %>% head(n=10)
## # A tibble: 10 × 4
##    X1       X2 X3                                                          X4   
##    <lgl> <dbl> <chr>                                                       <lgl>
##  1 NA       NA ""                                                          NA   
##  2 NA        1 "Estimating House Prices in Emerging Markets and Developin… NA   
##  3 NA       NA ""                                                          NA   
##  4 NA        2 "Elgar encyclopedia of technology and politics/ Ceron, And… NA   
##  5 NA       NA ""                                                          NA   
##  6 NA        3 "The Beginner's Guide to Data Science/ Ball, Robert. - 1st… NA   
##  7 NA       NA ""                                                          NA   
##  8 NA        4 "From #Hashtags to Legislation : Engagement and Support fo… NA   
##  9 NA       NA ""                                                          NA   
## 10 NA        5 "Disaggregating China, Inc : State Strategies in the Liber… NA

This worked, but we see that the table consists mainly of empty rows and cells. These are invisible on the website, but are used to format the display. Instead of repairing the table afterwards, it makes more sense to extract only the cells that contain the information we are looking for. These are the <td> tags with class="hit" and the attribute-value combination align="left". On this basis, we can construct a unique CSS selector.

hits <- website %>% 
  html_elements(css = "td.hit[align='left']") %>% 
  html_text(trim = TRUE)

hits %>% head(n = 5)
## [1] "Estimating House Prices in Emerging Markets and Developing Economies : A Big Data Approach/ Behr, Daniela M.. - Washington, D.C : The World Bank, 2023"                        
## [2] "Elgar encyclopedia of technology and politics/ Ceron, Andrea. - Cheltenham, UK : Edward Elgar Publishing, 2022"                                                                
## [3] "The Beginner's Guide to Data Science/ Ball, Robert. - 1st ed. 2022. - Cham : Springer International Publishing, 2022"                                                          
## [4] "From #Hashtags to Legislation : Engagement and Support for Economic Reforms in the Gulf Cooperation Council Countries/ Arezki, Rabah. - Washington, D.C : The World Bank, 2022"
## [5] "Disaggregating China, Inc : State Strategies in the Liberal Economic Order/ Tan, Yeling. -  [Online-Ausgabe]. - Ithaca, NY : Cornell University Press, [2022]"

6.2.5 Additional resources

In order to process this information further and, for example, separate it into data on author, title, year, etc., advanced knowledge in dealing with strings is necessary, which unfortunately goes beyond the scope of this introduction. A good first overview can be found in the chapter “Strings” from “R for Data Science” by Wickham and Grolemund: https://r4ds.had.co.nz/strings.html

The appropriate “cheat sheet” is also recommended: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf