4 First scraping with rvest
With the knowledge of how an HTML file is constructed and how R and RStudio work in basic terms, we are equipped with the necessary tools to take our first steps in web scraping. In this session we will learn how to use the R package rvest to read HTML source code into RStudio, extract targeted content we are interested in, and transfer the collected data into an R object for further analysis in the future.
4.1 The rvest package
Part of the tidyverse is a package called rvest, which provides us with all
the basic functions for a variety of typical web scraping tasks. This package
was included in the installation of the tidyverse package, but it is not part of
the core tidyverse and thus is not loaded into the current R session with
library(tidyverse)
. Therefore, we have to do this explicitly:
4.2 hello_world.html
As a first exercise, it is a good idea to scrape the Hello World example already described in chapter 3. As a reminder, here is the HTML source code:
<!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
</head>
<body>
<b>Hello World!</b>
</body>
</html>
4.2.1 read_html()
The first step in web scraping is to convert the page we are interested in into
an R object. This is made possible by the function read_html()
from the rvest
package. read_html()
“parses” the website, i.e. it reads the HTML, understands
its source code and transforms it into a representation R can understand. This
function needs the URL, i.e. the address of the website we want to read in, as
its first argument. The URL must be given as a string, so we have to enclose it
in "
. The function also allows you to specify other options. In most cases,
however, the default settings are sufficient. So we read in the hello_world.html
file, assign it to a new R object at the same time and have this object put out
in the next step:
hello_world <- read_html("https://jakobtures.github.io/web-scraping/hello_world.html")
hello_world
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <b>Hello World!</b>\n </body>
As we can see in the output, the R object hello_world
is a list with two
entries. The first entry contains everything enclosed by the <head>
tag, the
second entry everything enclosed by the <body>
tag. The opening and closing
<html>
tag is not part of the object. Remembering that HTML code is
hierarchically structured, the list is thus organised based on the highest
remaining levels – <head>
and <body>
.
We have thus successfully created a representation of the website in an R object. But what do we do with it now? In the case of this simple example, we might be interested in extracting the title of the website or the text displayed on the page.
4.2.2 html_elements()
The function html_elements()
from rvest allows us to extract
individual elements of the HTML code. To do this, it needs the object to be
extracted from as the first argument and a selector as well. In this
introduction, we will concentrate exclusively on the so-called CSS selectors.
The alternative XPath is a bit more flexible, but CSS selectors are sufficient
in most cases and have a shorter and more intuitive syntax, which clearly makes
them the tool of choice here.
We will discuss the possibilities offered by CSS selectors in more detail in
chapter 5.1 and will limit ourselves to the basics for now. A selector in
the form "tag"
, selects all HTML tags of the specified name. If we want to
extract the <title>
tag, we can do so in this way:
element_title <- html_elements(hello_world, css = "title")
element_title
## {xml_nodeset (1)}
## [1] <title>Hello World!</title>
If we want to extract the text Hello World! shown on the website, one
possibility would be to select the complete <body>
tag, since in this case no
other text is displayed on the page.
element_body <- html_elements(hello_world, css = "body")
element_body
## {xml_nodeset (1)}
## [1] <body>\n <b>Hello World!</b>\n </body>
This works in principle, but we also extracted the <b>
tags as well as
multiple new lines (\n
), which we do not need both.
It would be more efficient to directly select the <b>
tag enclosing the text.
4.2.3 html_text()
In this case, we are interested in the text in the title and on the website,
i.e. the content of the tags. We can extract this from the selected HTML
elements in an additional step. This is made possible by the rvest function
html_text()
. This requires the previously extracted HTML element as the only
argument.
With this, we have successfully completed our first web scraping goal, the extraction of the title and the text displayed on the page.
One more thing about the application of html_text()
to elements that
themselves contain further tags: Further above we extracted the object
element_body, which contains the <b>
tags as well as several line breaks in
addition to the displayed text. Here, too, we can extract the pure text.
We see that the function has conveniently removed the <b>
tags we were not
interested in for us. However, the line breaks and several spaces, so-called
whitespace, remain. Both can be removed with the additional argument
trim = TRUE
.
4.3 Countries of the World
Let us now look at a somewhat more realistic application. The website https://scrapethissite.com/pages/simple/ lists the names of 250 countries, as well as their flag, capital, population and size in square kilometres. Our goal could be to read this information into R for each country so that we can potentially analyse it further.
Before we start, we should load the required packages (we will also need the
tidyverse package this time) and read the website with the function
read_html()
and assign it to an R object.
To understand the structure of the HTML file, the first step is to look at the source code. As always, we can open it by right-clicking in the browser window and then clicking on “View Page Source”. The first 100 or so lines of HTML code mainly contain information on the design of the website, which should not distract us further at this point. We are purely interested in the data of the countries. The first country listed on the website is Andorra. It therefore makes sense to search the source code specifically for “Andorra”. The key combination CTRL+F opens the search mask in your browser. We find what we are looking for in line 128. Since this source code, designed for practice purposes, is formatted in a very structured way, we quickly realise that lines 125-135 are code blocks related to Andorra. Let’s look at these more closely:
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
<strong>Population:</strong> <span class="country-population">84000</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>
</div><!--.col-->
All the information about Andorra is enclosed in a <div>
tag. As a reminder, a
<div>
defines a grouping of code across multiple lines. In web design
practice, these groupings are mainly used to assign a certain CSS style to the
following code via the argument class=
, for example to define the typeface.
From a web scraping perspective, we generally don’t care how the styles are
defined. We just need to know that we can exploit these CSS assignments of
classes for our purposes. At the next level down, we find two blocks, one
containing, among other things, the name of the country and another containing
information about that country. Let’s begin with examining the first block.
4.3.1 Country names
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
The name “Andorra” is enclosed in an <h3>
tag, i.e. a third-level heading. In
addition to the name, we also find another tag within the tag that includes the
image of the flag. Since we are not interested in the graphics here, we can
ignore this.
On this website, all <h3>
tags are used exclusively to display the names of
the countries. Thus, we can use the <h3>
tag as a CSS selector to read out the
enclosed text analogous to the first example.
element_country <- html_elements(website, css = "h3")
text_country <- html_text(element_country, trim = TRUE)
head(text_country, n = 10)
## [1] "Andorra" "United Arab Emirates" "Afghanistan"
## [4] "Antigua and Barbuda" "Anguilla" "Albania"
## [7] "Armenia" "Angola" "Antarctica"
## [10] "Argentina"
The result looks promising. Since the structure of the code block is the same
for each country, the vector text_country
was created in this way with 250
entries, exactly the number of countries listed on the website. For reasons of
clarity, it often makes sense not to put out the complete and often very long
vectors, data frames or tibbles, but to use the function head()
to list the
number of entries specified by the argument n
, starting with the first.
4.3.1.1 The pipe %>%
At this point, we should think again about the readability and structure of our R code. Let us consider the preceding code block:
element_country <- html_elements(website, css = "h3")
text_country <- html_text(element_country, trim = TRUE)
As we have seen, this achieves our goal. However, we have also created the
element_country
object to temporarily save the result of the first step – reading the <h3>
tags. We will never need this object again. If we use the pipe %>%
from tidyverse
instead, the need to cache partial results is eliminated and we write
code that is more intuitive and easier to understand at the same time.
country <- website %>%
html_elements(css = "h3") %>%
html_text(trim = TRUE)
head(country, n = 10)
## [1] "Andorra" "United Arab Emirates" "Afghanistan"
## [4] "Antigua and Barbuda" "Anguilla" "Albania"
## [7] "Armenia" "Angola" "Antarctica"
## [10] "Argentina"
The pipe passes the result of a work step along to the next
function, which in the tidyverse as well as in many other R-functions (but not
all!) takes data as the first argument, which we then do not have to define
explicitly. For a better understanding, let’s look at the above example in
detail. The first line passes the object website
along to the function
html_elements()
. So we don’t have to tell html_elements()
which object to apply
to, because we already passed it along to the function with the pipe. The
function is applied to the object website
with all other defined arguments –
here css
– and the result is passed along again to the next line, where the
html_text()
function is applied to it. Here the pipe ends, and the final
result is assigned to the object country.
We now need three instead of two lines to get the same result, but the actual typing work has been reduced – especially if you create the pipe with the key combination CTRL+Shift+M – and we have created code that can be read and understood more intuitively with a little practice. Also we do not clutter our environment with unneeded objects.
So should we always connect all steps with the pipe? No. In many cases it makes sense to save intermediate results in an object, namely whenever we will access it multiple times. In our example, we could also integrate the import of the website into the pipe:
country <- read_html("https://scrapethissite.com/pages/simple/") %>%
html_elements(css = "h3") %>%
html_text(trim = TRUE)
Overall, this saves us even more typing. However, since we still have to access
the selected website multiple times later on, this would also mean that the
parsing process has to be repeated each time. On the one hand, this can have a
noticeable impact on the computing time for larger amounts of data. On the other
hand, it also means accessing the website’s servers and downloading the data
again each time. However, we should avoid data traffic generated without good
reasons as part of a good practice of web scraping – see 9.
So it makes perfect sense to save the result of the read_html()
function in an
R object so that it can be reused multiple times.
We will see the pipe in action many more times over the course of this seminar.
4.3.2 Capitals, population and area
Let us now turn to the further information for each country. These are located in the second block of the HTML code considered above:
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
<strong>Population:</strong> <span class="country-population">84000</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>
As we can see, both the name of the capital, the population of the country, and
its size in square kilometers are enclosed by a <span>
tag in lines 2–4
respectively. Like <div>
, <span>
defines groupings, but not across multiple
lines but for one, or as here, part of a line. So let’s try to read the names of
the capitals, using the <span>
tag as a selector.
website %>%
html_elements(css = "span") %>%
html_text() %>%
head(n = 10)
## [1] "Andorra la Vella" "84000" "468.0" "Abu Dhabi"
## [5] "4975593" "82880.0" "Kabul" "29121286"
## [9] "647500.0" "St. John's"
So we get the names of the capitals, but also the population and the size of the
country. span
was too unspecific as a selector. Since all three types of
country data are enclosed with <span>
tags, all three are also selected. So we
have to tell html_elements()
more precisely which <span>
we are interested in.
This is where the CSS classes we mentioned earlier come into play. These differ
between the three countries’ information. For example, the <span>
that
includes the name of the capital city is assigned the class "country-capital"
.
We can target this class with our CSS selector. To select a class, we can use
the syntax .class-name
. So, to select all <span>
that have the class
"country-capital"
, we can do as follows:
capital <- website %>%
html_elements(css = "span.country-capital") %>%
html_text()
head(capital, n = 10)
## [1] "Andorra la Vella" "Abu Dhabi" "Kabul" "St. John's"
## [5] "The Valley" "Tirana" "Yerevan" "Luanda"
## [9] "None" "Buenos Aires"
We can repeat this in an analogue manner for the number of inhabitants with the
class "country-population"
.
population <- website %>%
html_elements(css = "span.country-population") %>%
html_text()
head(population, n = 10)
## [1] "84000" "4975593" "29121286" "86754" "13254" "2986952"
## [7] "2968000" "13068161" "0" "41343201"
If we take a closer look at the vector created in this way, we see that it is a
character vector. For inspection we can use the function str()
, which gives us
the structure of an R object, including the data type used.
So the numbers were not read out as numbers but as strings. Among other things,
this does not allow for calculation with the numbers. Reminder: population[1]
selects the first element of the vector.
As you remember, we can tell R to interpret the “text” read from the HTML
code as numbers using the function as.numeric()
.
population <- website %>%
html_elements(css = "span.country-population") %>%
html_text() %>%
as.numeric()
str(population)
## num [1:250] 84000 4975593 29121286 86754 13254 ...
population[1] / 2
## [1] 42000
In the same way, the size in square kilometers can be read with the class
"country-area"
.
4.3.3 Merge into one tibble
We have now created four vectors, which respectively contain the information about the name of the country, the associated capital, the number of population and the size of the country. For Andorra:
country[1]
## [1] "Andorra"
capital[1]
## [1] "Andorra la Vella"
population[1]
## [1] 84000
area[1]
## [1] 468
We could already continue working with this, but for many applications it is more practical if we combine the data in tabular form. In the tidyverse, the form of the tibble is suitable for this purpose.
countries <- tibble(
Land = country,
Hauptstadt = capital,
Bevoelkerung = population,
Flaeche = area
)
countries
## # A tibble: 250 × 4
## Land Hauptstadt Bevoelkerung Flaeche
## <chr> <chr> <dbl> <dbl>
## 1 Andorra Andorra la Vella 84000 468
## 2 United Arab Emirates Abu Dhabi 4975593 82880
## 3 Afghanistan Kabul 29121286 647500
## 4 Antigua and Barbuda St. John's 86754 443
## 5 Anguilla The Valley 13254 102
## 6 Albania Tirana 2986952 28748
## 7 Armenia Yerevan 2968000 29800
## 8 Angola Luanda 13068161 1246700
## 9 Antarctica None 0 14000000
## 10 Argentina Buenos Aires 41343201 2766890
## # ℹ 240 more rows
This is not only more readable but also facilitates all further potential analysis steps.
If we are sure that we do not need the individual vectors, we can also perform the reading of the data and the creation of the tibble in a single step. Below you can see how the complete scraping process can be completed in relatively few lines.
website <- "https://scrapethissite.com/pages/simple/" %>%
read_html()
countries_2 <- tibble(
Land = website %>%
html_elements(css = "h3") %>%
html_text(trim = TRUE),
Hauptstadt = website %>%
html_elements(css = "span.country-capital") %>%
html_text(),
Bevoelkerung = website %>%
html_elements(css = "span.country-population") %>%
html_text() %>%
as.numeric(),
Flaeche = website %>%
html_elements(css = "span.country-area") %>%
html_text() %>%
as.numeric()
)
countries_2
## # A tibble: 250 × 4
## Land Hauptstadt Bevoelkerung Flaeche
## <chr> <chr> <dbl> <dbl>
## 1 Andorra Andorra la Vella 84000 468
## 2 United Arab Emirates Abu Dhabi 4975593 82880
## 3 Afghanistan Kabul 29121286 647500
## 4 Antigua and Barbuda St. John's 86754 443
## 5 Anguilla The Valley 13254 102
## 6 Albania Tirana 2986952 28748
## 7 Armenia Yerevan 2968000 29800
## 8 Angola Luanda 13068161 1246700
## 9 Antarctica None 0 14000000
## 10 Argentina Buenos Aires 41343201 2766890
## # ℹ 240 more rows