6 Scraping of tables & dynamic websites
6.1 Scraping of tables
In web scraping, we will often pursue the goal of transferring the extracted
data into a tibble or data frame in order to be able to analyse it further. It
is particularly helpful if the data we are interested in is already stored in an
HTML table. Because rvest allows us to read out complete tables quickly and
easily with the function html_table()
.
As a reminder, the basic structure of the HTML code for tables is as follows:
<table>
<tr> <th>#</th> <th>Tag</th> <th>Effect</th> </tr>
<tr> <td>1</td> <td>"b"</td> <td>bold</td> </tr>
<tr> <td>2</td> <td>"i"</td> <td>italics</td> </tr>
</table>
The <table>
tag covers the entire table. Rows are defined by <tr>
, column
headings with <th>
and cells with <td>
.
Before we start scraping, we load the necessary packages as usual:
6.1.1 Table with CSS selectors from Wikipedia
On the Wikipedia page on “CSS”, there is also a table with CSS selectors. This is our scraping target.
First we parse the website:
If we look at the source code and search – CTRL+F – for “<table”, we see that this page contains a large number of HTML tables. These include not only the elements that are recognisable at first glance as “classic” tables, but also, among other things, the “info boxes” at the top right edge of the article or the fold-out lists of further links at the bottom. If you want to look at this more closely, the Web Developer Tools can be very helpful here.
Instead of simply selecting all <table>
elements on the page, one strategy might
be to use the WDTs to create a CSS selector for that specific table:
"table.wikitable:nth-child(42)"
. We thus select the table of class
"wikitable"
which is the 42nd child of the parent hierarchy level –
<div class="mw-parser-output">
.
If we only want to select a single HTML element, it can be helpful to use the
function html_element()
instead of html_elements()
.
elements <- website %>%
html_elements(css = "table.wikitable:nth-child(42)")
elements
## {xml_nodeset (1)}
## [1] <table class="wikitable"><tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th ...
element <- website %>%
html_element(css = "table.wikitable:nth-child(42)")
element
## {html_node}
## <table class="wikitable">
## [1] <tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th>\n<th>First defined<br>i ...
The difference is mainly in the output of the function. This is recognisable by
the entry inside the { } in the output. In the first case, we get a list of HTML
elements – an “xml_nodeset” – even if this list, as here, consists of only one
entry. html_element()
returns the HTML element itself – the “html_element” – as
the function’s output. Why is this relevant? In many cases it can be easier to
work directly with the HTML element instead of a list of HTML elements, for
example when transferring tables into data frames and tibbles, but more on that
later.
To read out the table selected in this way, we only need to apply the function
html_table()
to the HTML element.
css_table <- element %>%
html_table()
css_table %>%
head(n = 4)
## # A tibble: 4 × 3
## Pattern Matches First definedin CSS …¹
## <chr> <chr> <int>
## 1 E an element of type E 1
## 2 E:link an E element that is the source anchor o… 1
## 3 E:active an E element during certain user actions 1
## 4 E::first-line the first formatted line of an E element 1
## # ℹ abbreviated name: ¹`First definedin CSS level`
The result is a tibble that contains the scraped contents of the HTML table
and adopts the column names stored in the <th>
tags for the columns.
6.1.2 Scraping multiple tables
It could also be our scraping goal to scrape not only the first, but all four
content tables of the Wikipedia article. If we look at the four tables in the
source code and/or the WDTs, we see that they all carry the class "wikitable"
.
This allows us to select them easily. Please note that the function
html_elements()
must be used again, as we no longer need just one element, but a
list of several selected elements.
The result is a list of four tibbles, each of which contains one of the four tables. If we want to select an individual tibble from the list, for example, to transfer it into a new object, we have to rely on subsetting.
We have learned about basic subsetting for vectors using [#]
, in chapter
1. For lists, things can get a little bit more complicated.
There are basically two ways of subsetting lists in R: list_name[#]
and
list_name[[#]]
. The most relevant difference for us is what kind of object R
returns to us. In the first case, the returned object is always a list, even if
it may only consist of one element. Using double square brackets, on the other
hand, returns a single element directly. So the difference is not dissimilar to
that between html_elements()
and html_element()
.
For example, if our goal is to select the third tibble from the list of four data frames, which subsetting should we use?
tables[3] %>%
str()
## List of 1
## $ : tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Selectors : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
tables[[3]] %>%
str()
## tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
## $ Selectors : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## $ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
In the first case, we see that we have a list of length 1, which contains a
tibble with 7 lines and 2 variables, as well as further information about these
variables. In the second case, we get the tibble directly, i.e. no longer as
an element of a list. So we have to use list_name[[]]
to directly select a
single tibble from a list of tibbles.
If we are interested in selecting several elements from a list instead, this is
only possible with list_name[]
. Instead of selecting an element with a single
number, we can select several with a vector of numbers in one step.
tables[c(1, 3)] %>%
str()
## List of 2
## $ : tibble [42 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ Pattern : chr [1:42] "E" "E:link" "E:active" "E::first-line" ...
## ..$ Matches : chr [1:42] "an element of type E" "an E element that is the source anchor of a hyperlink whose target is either not yet visited (:link) or already"| __truncated__ "an E element during certain user actions" "the first formatted line of an E element" ...
## ..$ First definedin CSS level: int [1:42] 1 1 1 1 1 1 1 1 1 1 ...
## $ : tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Selectors : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
As a result, we get a list again that contains the two tibbles selected here.
6.1.3 Tabellen mit NAs
What happens when we try to read a table with missing values? Consider the following example: https://jakobtures.github.io/web-scraping/table_na.html
At first glance, it is already obvious that several cells of the table are unoccupied here. Values are missing. Let’s try to read in the table anyway.
table_na <- "https://jakobtures.github.io/web-scraping/table_na.html" %>%
read_html %>%
html_element(css = "table")
turnout <- table_na %>%
html_table()
turnout
## # A tibble: 16 × 3
## Bundesland Wahljahr Wahlbeteiligung
## <chr> <dbl> <dbl>
## 1 Baden-Württemberg 2016 NA
## 2 Bayern 2018 NA
## 3 Berlin NA 66.9
## 4 Brandenburg 61.3 NA
## 5 Bremen 2019 64.1
## 6 Hamurg 2020 63.2
## 7 Hessen 2018 67.3
## 8 Mecklenburg-Vorpommern 2016 61.6
## 9 Niedersachsen 2017 63.1
## 10 Nordrhein-Westfalen 2017 65.2
## 11 Rheinland-Pfalz 2016 70.4
## 12 Saarland 2017 69.7
## 13 Sachsen 2019 66.6
## 14 Sachsen-Anhalt 2016 61.1
## 15 Schleswig-Holstein 2017 64.2
## 16 Thüringen 2019 64.9
As we can see, html_table
filled four cells with with NA
. This stands for
“Not Available” and represents missing values in R. However, there are
different types of missing values in the HTML source code, which the automatic
repair implemented in html_table()
handles differently. Let’s first look at
the source code of the first two lines:
<tr>
<td>Baden-Württemberg</td>
<td>2016</td>
<td></td>
</tr>
<tr>
<td>Bayern</td>
<td>2018</td>
</tr>
In both cases the value for turnout is missing.
For “Baden-Württemberg”, we see that the third column is created in the HTML
code, but there is no content in this cell. html_table()
knows, that this
empty cell has to be filled with a NA
. In contrast, for “Bayern” the cell is
completely missing. This means that the second row of the table consists of only
two columns, while the rest of the table has three columns. In this case,
html_table()
could draw the correct conclusion and filled the missing third
column with an NA
.
But let’s also look at the third and fourth rows in the source code:
<tr>
<td>Berlin</td>
<td></td>
<td>66.9</td>
</tr>
<tr>
<td>Brandenburg</td>
<td>61.3</td>
</tr>
The second column is missing in both cases. In the first case it is created but
empty, in the second it does not exist. In the first case, html_table()
can
again handle it without any problems. For “Brandenburg”, however, the function
reaches its limits. We, as human observers, quickly realise that the last state
election in Brandenburg did not take place in 61.3 and that this must therefore
be the turnout. R cannot distinguish this so easily and takes 61.3 as the value
for the column “Election year” and inserts a NA
in the third column.
What to do? First of all, we should be aware that such problems exist. So we should check if such a problem exists and whether the option to have it fixed automatically will actually get us there. If this is not the case, we can at least correct the problems that arise after extraction.
Our problem lies exclusively in row four. Its second column must be moved to the
third and the second must then itself be set as NA
. For this we need
subsetting again. In the case of a tibble, we need to specify the row and
column in the form tbl[row, column]
to select a cell. So we can tell R: “Write
in cell three the content of cell two, and then write in cell two NA
”.
6.2 Dynamic Websites
In the “reality” of the modern internet, we will increasingly encounter websites that are no longer based exclusively on static HTML files, but generate content dynamically. You know this, for example, in the form of timelines in social media offerings that are generated dynamically based on your user profile. Other websites may generate the displayed content with JavaScript functions or in response to input in HTML forms.
In many of these cases, it is no longer sufficient from a web scraping perspective to parse an HTML page and extract the data you are looking for, as this is often not contained in the HTML source code but is loaded dynamically in the background. The good news is that there are usually ways of scraping the information anyway.
Perhaps the operator of a page or service offers an API (Application Programming Interface). In this case, we can register for access to this interface and then get access to the data of interest. This is possible with Twitter, for example. In other cases, we may be able to identify in the embedded scripts how and from which database the information is loaded and access it directly. Or we use the Selenium WebDriver to “remotely control” a browser window and scrape what the browser “sees”.
However, all of these approaches are advanced methods that are beyond the scope of this introduction.
But in cases where an HTML file is dynamically generated based on input into a HTML form, we can often (not always) read it using the methods we already know.
6.2.1 HTML forms and HTML queries
As an example, let’s first look at the OPAC catalogue of the Potsdam University Library https://opac.ub.uni-potsdam.de/ in the browser.
If we enter the term “test” in the search field and click on Search, the browser window will show us the results of the search query. But what actually interests us here is the browser’s address bar. Instead of the URL “https://opac.ub.uni-potsdam.de/”, there is now a much longer URL. Note that the exact URL may very well differ for you, but the basic form should be similar to: “https://opac.ub.uni-potsdam.de/DB=1/LNG=DU/SID=3f0e2b15-1/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test”.
The first part is obviously still the URL of the website called up: “https://opac.ub.uni-potsdam.de/”. Let’s call this the base URL. However, the part “CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” was added to the end of the URL. This is the HTML query we are interested in here. Between the base URL and the query there are one or more components, which in this case may also differ depending on your browser. However, these are also irrelevant for the actual search query. We can shorten the URL to “https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” and receive the same result.
A query is a request in which data from an HTML form is sent to the server. In response, the server generates a new website, which is sent back to the user and displayed in the browser. In this case, the query was triggered by clicking on the “Search” button. If we understand what the components of the query do, we could manipulate it and use it specifically to have a website of interest created and parsed.
6.2.2 HTML forms
To do this, we first need to take a look at the HTML code of the search form. To understand this, you should display the source code of the page and search for “<form” or use the WDTs to look at the form and its components.
<form action="CMD"
class="form"
name="SearchForm"
method="GET">
...
</form>
HTML forms are encompassed by the <form>
tag. Within the tag, one or more form
elements such as text entry fields, drop-down option lists, buttons, etc. can be
placed.
<form>
itself carries a number of attributes in this example. The first
attribute of interest to us is the method="GET"
attribute. This specifies the
method of data transfer between client and server. It is important to note that
the method “GET” uses queries in the URL for the transmission of data and the
method “POST” does not. We can therefore only manipulate queries in this way, if
the “GET” method is used. If no method is specified in the <form>
tag, “GET”
is also used as the default.
The second attribute of interest to us is action="CMD"
. This specifies which
action should be triggered after the form has been submitted. Often the value of
action=
is the name of a file on the server to which the data will be sent and
which then returns a dynamically generated HTML page back to the user.
Let us now look at the elements of the form. For this, the rvest function
html_form()
can be helpful.
"https://opac.ub.uni-potsdam.de/" %>%
read_html() %>%
html_element(css = "form") %>%
html_form()
## <form> 'SearchForm' (GET https://opac.ub.uni-potsdam.de/CMD)
## <field> (select) ACT: SRCHA
## <field> (select) IKT:
## <field> (select) SRT:
## <field> (checkbox) FUZZY: Y
## <field> (text) TRM:
## <field> (submit) : Suchen
The output shows us in the first line the name of the form and the action that is performed on submit: “GET https://opac.ub.uni-potsdam.de/CMD”. The other six lines show the form components:
- The three drop-down selections for:
- type of search
- which fields should be searched
- how results should be ordered
- The Checkbox for “unscharfe Suche”
- The text field where we enter terms to be searched
- The search button itself
We also see the names of these components as well as in some cases the default value that is sent when the form is submitted, as long as no other value is selected or entered.
Let’s look at some of these elements. <select>
elements are drop-down lists of
options that can be selected. This is the source code for the first <select>
element in our example:
<select name="ACT">
<OPTION VALUE="SRCH">suchen [oder]
<OPTION VALUE="SRCHA" SELECTED>suchen [und]
<OPTION value="AND">eingrenzen
<OPTION value="OR">erweitern
<OPTION value="NOT">ausgenommen
<OPTION value="RLV">neu ordnen
<OPTION value="BRWS">Index blättern
</select>
The attribute name="ACT"
defines the elements name, which is used when
transmitting the data from the form via the query. The <option>
tags define
the selectable options, i.e. the drop down menu. <value="">
represents the
value transmitted by the form. The user is being shown the text following the
tag. The default selection is either the first value in the list or – like in
this case – the option with the attribute selected
is being explicitly chosen
as the default.
The three other elements are <input>
tags. Input fields whose specific type is
specified via the attribute type=""
. These can be, for example, text boxes
(type="text"
) or checkboxes (input="checkbox"
), but there are many more
options available. A comprehensive list can be found at:
https://www.w3schools.com/html/html_form_input_types.asp.
Here is the source code for two of the three <input>
elements on the example page:
<input type="text" name="TRM" value="" size="50">
...
<input type="submit" class="button" value=" Suchen ">
The first tag is of the type “text”, i.e. a text field, in this case the text
field into which the search term is entered. In addition to the name of the
element, a default value of the field is specified via value=""
. In this case,
the default value is an empty field. The second tag is of the type “submit”.
This is the “Search” button, which triggers the transmission of the form data
via the query by clicking on it.
6.2.3 The query
But what exactly is being transmitted? Let’s look again at the example query from above:
CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test
The value of the action=""
attribute forms the first part of the query and is
appended after the base URL. The value of the attribute tells the server what to
do with the other transmitted data. This is followed by a ?
, which introduces
the data to be transmitted as several pairs of name=""
and value=""
attributes of the individual elements. The pairs are connected with &
.
ACT=SRCHA
thus stands for the fact that the (default) value “SRCHA” has been selected
in the element with the name “ACT”. What the values of the two other <select>
elements “IKT” and “SRT” stand for, you can understand yourself with a look into
the source code or the WDTs. These are not important for our endeavour.
The text entered in the field is transmitted as the value of the
<input type="text">
tag with the name “TRM”. Here the value was “test”.
The server receives the form data in this way, can then take a decision on the
basis of the action=""
attribute, here “CMD”, how the data is to be processed
and constructs the website accordingly, which it sends back to us and which is
displayed in our browser.
6.2.4 Manipulating the query and scraping the result
Now that we know what the components of the query mean, we can manipulate them. Instead of writing queries by hand, we should use R to combine them for us. We will also encounter the technique of manipulating URLs directly in the R code more often. So we should learn it early.
The function str_c()
from stringr (core tidyverse) combines the strings
listed as arguments into a single string. Strings stored in other R objects can also be
included. If we have the goal of manipulating both the search method and the
search term, we could achieve this in this way:
base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "test"
url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test"
If we now change the strings stored in the method
and term
objects and
generate the complete URL again, these components of the query are manipulated
accordingly.
method <- "SRCH"
term <- "web+scraping"
url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCH&IKT=1016&SRT=YOP&TRM=web+scraping"
The search method was set to the value “SRCH”, i.e. an “OR” search, the search term to “web scraping”. It is important to note that no spaces may appear in the query and that these are replaced by “+” when the form is submitted. So instead of “web scraping” we have to use the string “web+scraping”.
As an example application, we can now have the server perform an “AND” search for the term “web scraping”, read out the HTML page generated by the server and extract the 10 titles displayed.
base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "web+scraping"
url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=web+scraping"
website <- url %>%
read_html()
The search results are displayed as tables in the generated HTML file. The
<table>
tag has the attribute-value combination summary="hitlist"
, which we
can use for our CSS selector:
hits <- website %>%
html_element(css = "table[summary='hitlist']") %>%
html_table() %>%
as_tibble()
hits %>% head(n=10)
## # A tibble: 10 × 4
## X1 X2 X3 X4
## <lgl> <dbl> <chr> <lgl>
## 1 NA NA "" NA
## 2 NA 1 "Estimating House Prices in Emerging Markets and Developin… NA
## 3 NA NA "" NA
## 4 NA 2 "Elgar encyclopedia of technology and politics/ Ceron, And… NA
## 5 NA NA "" NA
## 6 NA 3 "The Beginner's Guide to Data Science/ Ball, Robert. - 1st… NA
## 7 NA NA "" NA
## 8 NA 4 "From #Hashtags to Legislation : Engagement and Support fo… NA
## 9 NA NA "" NA
## 10 NA 5 "Disaggregating China, Inc : State Strategies in the Liber… NA
This worked, but we see that the table consists mainly of empty rows and cells.
These are invisible on the website, but are used to format the display.
Instead of repairing the table afterwards, it makes more sense to extract only
the cells that contain the information we are looking for. These are the <td>
tags with class="hit"
and the attribute-value combination align="left"
.
On this basis, we can construct a unique CSS selector.
hits <- website %>%
html_elements(css = "td.hit[align='left']") %>%
html_text(trim = TRUE)
hits %>% head(n = 5)
## [1] "Estimating House Prices in Emerging Markets and Developing Economies : A Big Data Approach/ Behr, Daniela M.. - Washington, D.C : The World Bank, 2023"
## [2] "Elgar encyclopedia of technology and politics/ Ceron, Andrea. - Cheltenham, UK : Edward Elgar Publishing, 2022"
## [3] "The Beginner's Guide to Data Science/ Ball, Robert. - 1st ed. 2022. - Cham : Springer International Publishing, 2022"
## [4] "From #Hashtags to Legislation : Engagement and Support for Economic Reforms in the Gulf Cooperation Council Countries/ Arezki, Rabah. - Washington, D.C : The World Bank, 2022"
## [5] "Disaggregating China, Inc : State Strategies in the Liberal Economic Order/ Tan, Yeling. - [Online-Ausgabe]. - Ithaca, NY : Cornell University Press, [2022]"
6.2.5 Additional resources
In order to process this information further and, for example, separate it into data on author, title, year, etc., advanced knowledge in dealing with strings is necessary, which unfortunately goes beyond the scope of this introduction. A good first overview can be found in the chapter “Strings” from “R for Data Science” by Wickham and Grolemund: https://r4ds.had.co.nz/strings.html
The appropriate “cheat sheet” is also recommended: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf