4 First scraping with rvest

With the knowledge of how an HTML file is constructed and how R and RStudio work in basic terms, we are equipped with the necessary tools to take our first steps in web scraping. In this session we will learn how to use the R package rvest to read HTML source code into RStudio, extract targeted content we are interested in, and transfer the collected data into an R object for further analysis in the future.

4.1 The rvest package

Part of the tidyverse is a package called rvest, which provides us with all the basic functions for a variety of typical web scraping tasks. This package was included in the installation of the tidyverse package, but it is not part of the core tidyverse and thus is not loaded into the current R session with library(tidyverse). Therefore, we have to do this explicitly:

library(rvest)

4.2 hello_world.html

As a first exercise, it is a good idea to scrape the Hello World example already described in chapter 3. As a reminder, here is the HTML source code:

<!DOCTYPE html>

<html>
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <b>Hello World!</b>
  </body>
</html>

4.2.1 read_html()

The first step in web scraping is to convert the page we are interested in into an R object. This is made possible by the function read_html() from the rvest package. read_html() “parses” the website, i.e. it reads the HTML, understands its source code and transforms it into a representation R can understand. This function needs the URL, i.e. the address of the website we want to read in, as its first argument. The URL must be given as a string, so we have to enclose it in ". The function also allows you to specify other options. In most cases, however, the default settings are sufficient. So we read in the hello_world.html file, assign it to a new R object at the same time and have this object put out in the next step:

hello_world <- read_html("https://jakobtures.github.io/web-scraping/hello_world.html")
hello_world
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <b>Hello World!</b>\n  </body>

As we can see in the output, the R object hello_world is a list with two entries. The first entry contains everything enclosed by the <head> tag, the second entry everything enclosed by the <body> tag. The opening and closing <html> tag is not part of the object. Remembering that HTML code is hierarchically structured, the list is thus organised based on the highest remaining levels – <head> and <body>.

We have thus successfully created a representation of the website in an R object. But what do we do with it now? In the case of this simple example, we might be interested in extracting the title of the website or the text displayed on the page.

4.2.2 html_elements()

The function html_elements() from rvest allows us to extract individual elements of the HTML code. To do this, it needs the object to be extracted from as the first argument and a selector as well. In this introduction, we will concentrate exclusively on the so-called CSS selectors. The alternative XPath is a bit more flexible, but CSS selectors are sufficient in most cases and have a shorter and more intuitive syntax, which clearly makes them the tool of choice here.

We will discuss the possibilities offered by CSS selectors in more detail in chapter 5.1 and will limit ourselves to the basics for now. A selector in the form "tag", selects all HTML tags of the specified name. If we want to extract the <title> tag, we can do so in this way:

element_title <- html_elements(hello_world, css = "title")
element_title
## {xml_nodeset (1)}
## [1] <title>Hello World!</title>

If we want to extract the text Hello World! shown on the website, one possibility would be to select the complete <body> tag, since in this case no other text is displayed on the page.

element_body <- html_elements(hello_world, css = "body")
element_body
## {xml_nodeset (1)}
## [1] <body>\n    <b>Hello World!</b>\n  </body>

This works in principle, but we also extracted the <b> tags as well as multiple new lines (\n), which we do not need both. It would be more efficient to directly select the <b> tag enclosing the text.

element_b <- html_elements(hello_world, css = "b")
element_b
## {xml_nodeset (1)}
## [1] <b>Hello World!</b>

4.2.3 html_text()

In this case, we are interested in the text in the title and on the website, i.e. the content of the tags. We can extract this from the selected HTML elements in an additional step. This is made possible by the rvest function html_text(). This requires the previously extracted HTML element as the only argument.

html_text(element_title)
## [1] "Hello World!"
html_text(element_b)
## [1] "Hello World!"

With this, we have successfully completed our first web scraping goal, the extraction of the title and the text displayed on the page.

One more thing about the application of html_text() to elements that themselves contain further tags: Further above we extracted the object element_body, which contains the <b> tags as well as several line breaks in addition to the displayed text. Here, too, we can extract the pure text.

html_text(element_body)
## [1] "\n    Hello World!\n  "

We see that the function has conveniently removed the <b> tags we were not interested in for us. However, the line breaks and several spaces, so-called whitespace, remain. Both can be removed with the additional argument trim = TRUE.

html_text(element_body, trim = TRUE)
## [1] "Hello World!"

4.3 Countries of the World

Let us now look at a somewhat more realistic application. The website https://scrapethissite.com/pages/simple/ lists the names of 250 countries, as well as their flag, capital, population and size in square kilometres. Our goal could be to read this information into R for each country so that we can potentially analyse it further.

Before we start, we should load the required packages (we will also need the tidyverse package this time) and read the website with the function read_html() and assign it to an R object.

library(tidyverse)
library(rvest)

website <- read_html("https://scrapethissite.com/pages/simple/")

To understand the structure of the HTML file, the first step is to look at the source code. As always, we can open it by right-clicking in the browser window and then clicking on “View Page Source”. The first 100 or so lines of HTML code mainly contain information on the design of the website, which should not distract us further at this point. We are purely interested in the data of the countries. The first country listed on the website is Andorra. It therefore makes sense to search the source code specifically for “Andorra”. The key combination CTRL+F opens the search mask in your browser. We find what we are looking for in line 128. Since this source code, designed for practice purposes, is formatted in a very structured way, we quickly realise that lines 125-135 are code blocks related to Andorra. Let’s look at these more closely:

<div class="col-md-4 country">
  <h3 class="country-name">
    <i class="flag-icon flag-icon-ad"></i>
    Andorra
  </h3>
  <div class="country-info">
    <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
    <strong>Population:</strong> <span class="country-population">84000</span><br>
    <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
  </div>
</div><!--.col-->

All the information about Andorra is enclosed in a <div> tag. As a reminder, a <div> defines a grouping of code across multiple lines. In web design practice, these groupings are mainly used to assign a certain CSS style to the following code via the argument class=, for example to define the typeface. From a web scraping perspective, we generally don’t care how the styles are defined. We just need to know that we can exploit these CSS assignments of classes for our purposes. At the next level down, we find two blocks, one containing, among other things, the name of the country and another containing information about that country. Let’s begin with examining the first block.

4.3.1 Country names

<h3 class="country-name">
  <i class="flag-icon flag-icon-ad"></i>
  Andorra
</h3>

The name “Andorra” is enclosed in an <h3> tag, i.e. a third-level heading. In addition to the name, we also find another tag within the tag that includes the image of the flag. Since we are not interested in the graphics here, we can ignore this.

On this website, all <h3> tags are used exclusively to display the names of the countries. Thus, we can use the <h3> tag as a CSS selector to read out the enclosed text analogous to the first example.

element_country <- html_elements(website, css = "h3")
text_country <- html_text(element_country, trim = TRUE)

head(text_country, n = 10)
##  [1] "Andorra"              "United Arab Emirates" "Afghanistan"         
##  [4] "Antigua and Barbuda"  "Anguilla"             "Albania"             
##  [7] "Armenia"              "Angola"               "Antarctica"          
## [10] "Argentina"

The result looks promising. Since the structure of the code block is the same for each country, the vector text_country was created in this way with 250 entries, exactly the number of countries listed on the website. For reasons of clarity, it often makes sense not to put out the complete and often very long vectors, data frames or tibbles, but to use the function head() to list the number of entries specified by the argument n, starting with the first.

4.3.1.1 The pipe %>%

At this point, we should think again about the readability and structure of our R code. Let us consider the preceding code block:

element_country <- html_elements(website, css = "h3")
text_country <- html_text(element_country, trim = TRUE)

As we have seen, this achieves our goal. However, we have also created the element_country object to temporarily save the result of the first step – reading the <h3> tags. We will never need this object again. If we use the pipe %>% from tidyverse instead, the need to cache partial results is eliminated and we write code that is more intuitive and easier to understand at the same time.

country <- website %>% 
  html_elements(css = "h3") %>% 
  html_text(trim = TRUE)

head(country, n = 10)
##  [1] "Andorra"              "United Arab Emirates" "Afghanistan"         
##  [4] "Antigua and Barbuda"  "Anguilla"             "Albania"             
##  [7] "Armenia"              "Angola"               "Antarctica"          
## [10] "Argentina"

The pipe passes the result of a work step along to the next function, which in the tidyverse as well as in many other R-functions (but not all!) takes data as the first argument, which we then do not have to define explicitly. For a better understanding, let’s look at the above example in detail. The first line passes the object website along to the function html_elements(). So we don’t have to tell html_elements() which object to apply to, because we already passed it along to the function with the pipe. The function is applied to the object website with all other defined arguments – here css – and the result is passed along again to the next line, where the html_text() function is applied to it. Here the pipe ends, and the final result is assigned to the object country.

We now need three instead of two lines to get the same result, but the actual typing work has been reduced – especially if you create the pipe with the key combination CTRL+Shift+M – and we have created code that can be read and understood more intuitively with a little practice. Also we do not clutter our environment with unneeded objects.

So should we always connect all steps with the pipe? No. In many cases it makes sense to save intermediate results in an object, namely whenever we will access it multiple times. In our example, we could also integrate the import of the website into the pipe:

country <- read_html("https://scrapethissite.com/pages/simple/") %>% 
  html_elements(css = "h3") %>% 
  html_text(trim = TRUE)

Overall, this saves us even more typing. However, since we still have to access the selected website multiple times later on, this would also mean that the parsing process has to be repeated each time. On the one hand, this can have a noticeable impact on the computing time for larger amounts of data. On the other hand, it also means accessing the website’s servers and downloading the data again each time. However, we should avoid data traffic generated without good reasons as part of a good practice of web scraping – see 9. So it makes perfect sense to save the result of the read_html() function in an R object so that it can be reused multiple times.

We will see the pipe in action many more times over the course of this seminar.

4.3.2 Capitals, population and area

Let us now turn to the further information for each country. These are located in the second block of the HTML code considered above:

<div class="country-info">
  <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
  <strong>Population:</strong> <span class="country-population">84000</span><br>
  <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>

As we can see, both the name of the capital, the population of the country, and its size in square kilometers are enclosed by a <span> tag in lines 2–4 respectively. Like <div>, <span> defines groupings, but not across multiple lines but for one, or as here, part of a line. So let’s try to read the names of the capitals, using the <span> tag as a selector.

website %>% 
  html_elements(css = "span") %>% 
  html_text() %>% 
  head(n = 10)
##  [1] "Andorra la Vella" "84000"            "468.0"            "Abu Dhabi"       
##  [5] "4975593"          "82880.0"          "Kabul"            "29121286"        
##  [9] "647500.0"         "St. John's"

So we get the names of the capitals, but also the population and the size of the country. span was too unspecific as a selector. Since all three types of country data are enclosed with <span> tags, all three are also selected. So we have to tell html_elements() more precisely which <span> we are interested in. This is where the CSS classes we mentioned earlier come into play. These differ between the three countries’ information. For example, the <span> that includes the name of the capital city is assigned the class "country-capital". We can target this class with our CSS selector. To select a class, we can use the syntax .class-name. So, to select all <span> that have the class "country-capital", we can do as follows:

capital <- website %>% 
  html_elements(css = "span.country-capital") %>% 
  html_text()

head(capital, n = 10)
##  [1] "Andorra la Vella" "Abu Dhabi"        "Kabul"            "St. John's"      
##  [5] "The Valley"       "Tirana"           "Yerevan"          "Luanda"          
##  [9] "None"             "Buenos Aires"

We can repeat this in an analogue manner for the number of inhabitants with the class "country-population".

population <- website %>% 
  html_elements(css = "span.country-population") %>% 
  html_text()

head(population, n = 10)
##  [1] "84000"    "4975593"  "29121286" "86754"    "13254"    "2986952" 
##  [7] "2968000"  "13068161" "0"        "41343201"

If we take a closer look at the vector created in this way, we see that it is a character vector. For inspection we can use the function str(), which gives us the structure of an R object, including the data type used.

str(population)
##  chr [1:250] "84000" "4975593" "29121286" "86754" "13254" "2986952" ...

So the numbers were not read out as numbers but as strings. Among other things, this does not allow for calculation with the numbers. Reminder: population[1] selects the first element of the vector.

population[1] / 2
## Error in population[1]/2: non-numeric argument to binary operator

As you remember, we can tell R to interpret the “text” read from the HTML code as numbers using the function as.numeric().

population <- website %>% 
  html_elements(css = "span.country-population") %>% 
  html_text() %>% 
  as.numeric()

str(population)
##  num [1:250] 84000 4975593 29121286 86754 13254 ...

population[1] / 2
## [1] 42000

In the same way, the size in square kilometers can be read with the class "country-area".

area <- website %>% 
  html_elements(css = "span.country-area") %>% 
  html_text() %>% 
  as.numeric()

str(area)
##  num [1:250] 468 82880 647500 443 102 ...

4.3.3 Merge into one tibble

We have now created four vectors, which respectively contain the information about the name of the country, the associated capital, the number of population and the size of the country. For Andorra:

country[1]
## [1] "Andorra"
capital[1]
## [1] "Andorra la Vella"
population[1]
## [1] 84000
area[1]
## [1] 468

We could already continue working with this, but for many applications it is more practical if we combine the data in tabular form. In the tidyverse, the form of the tibble is suitable for this purpose.

countries <- tibble(
  Country = country,
  Capital = capital,
  Population = population,
  Area_sqkm = area
)

countries
## # A tibble: 250 × 4
##    Country              Capital          Population Area_sqkm
##    <chr>                <chr>                 <dbl>     <dbl>
##  1 Andorra              Andorra la Vella      84000       468
##  2 United Arab Emirates Abu Dhabi           4975593     82880
##  3 Afghanistan          Kabul              29121286    647500
##  4 Antigua and Barbuda  St. John's            86754       443
##  5 Anguilla             The Valley            13254       102
##  6 Albania              Tirana              2986952     28748
##  7 Armenia              Yerevan             2968000     29800
##  8 Angola               Luanda             13068161   1246700
##  9 Antarctica           None                      0  14000000
## 10 Argentina            Buenos Aires       41343201   2766890
## # ℹ 240 more rows

This is not only more readable but also facilitates all further potential analysis steps.

If we are sure that we do not need the individual vectors, we can also perform the reading of the data and the creation of the tibble in a single step. Below you can see how the complete scraping process can be completed in relatively few lines.

website <- "https://scrapethissite.com/pages/simple/" %>%
  read_html()

countries_2 <- tibble(
  Country = website %>%
    html_elements(css = "h3") %>% 
    html_text(trim = TRUE),
  Capital = website %>% 
    html_elements(css = "span.country-capital") %>% 
    html_text(),
  Population = website %>% 
    html_elements(css = "span.country-population") %>% 
    html_text() %>% 
    as.numeric(),
  Area_sqkm = website %>% 
    html_elements(css = "span.country-area") %>% 
    html_text() %>% 
    as.numeric()
)

countries_2
## # A tibble: 250 × 4
##    Country              Capital          Population Area_sqkm
##    <chr>                <chr>                 <dbl>     <dbl>
##  1 Andorra              Andorra la Vella      84000       468
##  2 United Arab Emirates Abu Dhabi           4975593     82880
##  3 Afghanistan          Kabul              29121286    647500
##  4 Antigua and Barbuda  St. John's            86754       443
##  5 Anguilla             The Valley            13254       102
##  6 Albania              Tirana              2986952     28748
##  7 Armenia              Yerevan             2968000     29800
##  8 Angola               Luanda             13068161   1246700
##  9 Antarctica           None                      0  14000000
## 10 Argentina            Buenos Aires       41343201   2766890
## # ℹ 240 more rows