8 Downloading and saving files

In this section we will learn to download files from within our R scripts, and save the scraping results locally to be reused without the need to scrape the data again. Most data files you will deal with online are CSV files. To begin we will talk briefly about what CSV files are and how we can use them in R.

8.1 CSV files

CSV stands for “comma-separated values”. While they are limited to the display of two-dimensional tables, their simplicity and portability makes CSV files one of the most used formats for this type of data. You can open and write CSV files with statistical software, spreadsheet software like Excel, and even with the most basic text editor.

A simple CSV file could look like this:

column1, column2, column3
data1_1, data1_2, data1_3
data2_1, data2_2, data2_3
data3_1, data3_2, data3_3

In essence, CSV files are tables. The rows of the table are separated by line breaks, the columns are delimited by commas.

The first row usually represents the column names, just as in the example above (column1, column2, column3). Be aware, that this isn’t necessarily always the case. A CSV file may not contain any column names at all, and start with a first row of data cells. This is one of the vital differences when it comes to parsing the file.

8.1.1 Parsing a CSV file

To convert the data contained in the CSV file into a format we can use for data analysis we have to parse it. The function used for parsing “understands” the representation of a table in the CSV syntax, can discern column names from data cells, assigns data cells to the correct rows and columns and returns an R object representing the table. read_csv(), from the readr package that we will use here, returns a tibble, while read.csv() from base R returns a data frame.

Since readr is part of the core tidyverse, we can just load the tidyverse package. Soon we will also need rvest, so let us begin with loading both libraries.

library(tidyverse)
library(rvest)

read_csv() takes a CSV file as its first argument. To test it, we can also use literal data for the file argument. So we can pass the example defined above as a string – contained in " – as the fileargument. We will not be using this in practice much, but it serves well as a first example.

read_csv(
"column1, column2, column3
data1_1, data1_2, data1_3
data2_1, data2_2, data2_3
data3_1, data3_2, data3_3"
)
## Rows: 3 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): column1, column2, column3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 3
##   column1 column2 column3
##   <chr>   <chr>   <chr>  
## 1 data1_1 data1_2 data1_3
## 2 data2_1 data2_2 data2_3
## 3 data3_1 data3_2 data3_3

read_csv() guesses the column types by the data entered. All three columns were automatically defined as character vectors here. Automatic guessing works reliably in most situations. Let’s try a CSV with multiple different column types:

read_csv(
"name, age, size, retired
Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE"
)
## Rows: 3 Columns: 4
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (2): age, size
## lgl (1): retired
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 4
##   name    age  size retired
##   <chr> <dbl> <dbl> <lgl>  
## 1 Peter    42  1.68 FALSE  
## 2 Paul     84  1.82 TRUE   
## 3 Mary     24  1.74 FALSE

read_csv() correctly guessed that the first column is of type character, the third of the type double and the fourth is a logical vector. It also guessed that the second column should be of the type double, i.e. floating point numbers. This will work, but we could redefine this vector as integer manually by using the col_types argument. This will not be necessary in most situations, but could increase computational speed with very large datasets, as integers can be saved more efficiently. More on the definition of column types can be found in the help file.

What happens when we have CSV data without column names defined in the first line?

read_csv(
"Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE"
)
## Rows: 2 Columns: 4
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Peter
## dbl (2): 42, 1.68
## lgl (1): FALSE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 4
##   Peter  `42` `1.68` `FALSE`
##   <chr> <dbl>  <dbl> <lgl>  
## 1 Paul     84   1.82 TRUE   
## 2 Mary     24   1.74 FALSE

read_csv() by itself cannot discern if the authors of the CSV file intended the first line to be used for names or not. If there are no column names in the CSV data, as in this case, we have to explicitly tell read_csv() not to use the first line for column names. We can do this by specifying the col_names() argument as FALSE; in this case, read_csv() chooses default names for the columns, which can be changed later on.

read_csv(
"Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE",
col_names = FALSE)
## Rows: 3 Columns: 4
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X1
## dbl (2): X2, X3
## lgl (1): X4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 4
##   X1       X2    X3 X4   
##   <chr> <dbl> <dbl> <lgl>
## 1 Peter    42  1.68 FALSE
## 2 Paul     84  1.82 TRUE 
## 3 Mary     24  1.74 FALSE

8.2 Downloading files

Let us look at some real CSV files to download and parse.

The website https://www.bundeswahlleiter.de/bundestagswahlen/2017/ergebnisse/repraesentative-wahlstatistik.html holds multiple CSV files containing inferential statistics on the German federal election held in 2017. Among these files are two files with statistics on the number of people eligible to vote and on the number that actually voted, both by binary gender, year of birth, and additional indicators.

Your first impulse would be to download the files manually and then parse them. This will work just fine, but we could also download the files directly from our code. Due to the small number of files in this example, this may not even be the most efficient way, but when you start handling larger amounts of files or want to regularly update files, downloading them from within your code is a safer and more efficient option.

To start, we will download the HTML file containing the links to the CSV files. Using the tools at our disposal, we will extract the first two links following the header “Tabellen zur Weiterverwendung”. To understand the construction of the css selector, read chapter 5.1. Note that this is only one of many possible selectors that can achieve the selection of both links of interest.

website <- "https://www.bundeswahlleiter.de/bundestagswahlen/2017/ergebnisse/repraesentative-wahlstatistik.html" %>%
  read_html()

a_elements <- website %>% 
  html_elements(css = "ul[style='list-style-type:none'] > li:first-child > a,
                    ul[style='list-style-type:none'] > li:nth-child(2) > a")

a_elements
## {xml_nodeset (2)}
## [1] <a data-file-extension="CSV" data-file-size="40,53 kB" title="Link zur CS ...
## [2] <a data-file-extension="CSV" data-file-size="68,82 kB" title="Link zur CS ...

links <- a_elements %>%
  html_attr(name = "href")

links
## [1] "../../../dam/jcr/38972966-dc3d-40fa-91d7-6599d913f5e9/btw17_rws_bw2.csv" 
## [2] "../../../dam/jcr/a67208c0-2a2b-41aa-abb8-204b09e73b6b/btw17_rws_bst2.csv"

We succeeded in extracting the links, but we can also see that these links are not complete since they are missing their base URL. This works on the website, as the link is relative. To access the files directly though, we need an absolute link – the full path. As before (sub section 7.2.2) we can use the function str_c() to construct the complete URLs. If we give str_c() a vector as one of its input arguments, the function will repeat the string connection for each element in the vector. For this example we can ignore the “../../../” part of the link as it will work regardless; to remove this part we could apply some of the functions from the stringr package. For this we need some basic knowledge in string manipulation and regular expressions, which we will gain in chapter 13. So for now, we will just use the links as they come.

links <- str_c("https://www.bundeswahlleiter.de/", links)
links
## [1] "https://www.bundeswahlleiter.de/../../../dam/jcr/38972966-dc3d-40fa-91d7-6599d913f5e9/btw17_rws_bw2.csv" 
## [2] "https://www.bundeswahlleiter.de/../../../dam/jcr/a67208c0-2a2b-41aa-abb8-204b09e73b6b/btw17_rws_bst2.csv"

Now that we have complete absolute links, we can download the files to our hard drive using the base R function download.file(). We specify a URL as its first argument and also specify a path and file name for the destfile argument. When no path is given, the file is saved in the directory where your R script is located.

download.file(links[1], destfile =  "eligible.csv")
download.file(links[2], destfile =  "turnout.csv")

We could also have used read_csv() to parse the CSV files directly from the web. However it may still sometimes be a good idea to download the files to our hard drives, since we only have to do this once; with this we decrease traffic for the server, as well as increase the efficiency of our code. This is only a benefit if we download only once and not every time we rerun our script. So we have to comment out the lines where the download occurs after it was successful.

8.2.1 Parsing the CSV files

Now we proceed to parse the downloaded CSV files into R objects. But first let’s have a look at the CSV files. You can open them in a text editor of your choosing or directly in RStudio. Importing them into Excel or a similar spreadsheet software is not a valid option here, as we want to see the raw contents of the file and not a representation of the data.

Looking at eligible.csv, we notice at least two things: First, the delimiter used here is not a comma but a semicolon. The default CSV style discussed above is common in those countries where a “.” is used as the decimal point, thus the “,” is available to be used as a delimiter. In countries where the “,” is used as a decimal point, it is not available as a delimiter and therefore the “;” is used. The function read_csv2() can be used in this case. A more generalized function is read_delim(), because the delimiting character can freely be defined as an argument. Secondly, we see that the first 9 lines are neither column names nor data, but contain comments on the CSV file itself. We have to tell read_csv2() to ignore these lines. We could use the argument skip to tell R to ignore the first 9 lines; however, these comments are neatly introduced with an #, so we can use the comment argument instead and specify the # as the identifier for comments, which then will be ignored.

eligible_tbl <- read_csv2("eligible.csv", comment = "#")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 627 Columns: 10
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (3): Land, Geschlecht, Geburtsjahresgruppe
## dbl (7): Wahlberechtigte, Wahlberechtigte ohne Wahlscheinvermerk, Wahlberech...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eligible_tbl
## # A tibble: 627 × 10
##    Land  Geschlecht Geburtsjahresgruppe Wahlberechtigte Wahlberechtigte ohne W…¹
##    <chr> <chr>      <chr>                         <dbl>                    <dbl>
##  1 Bund  Summe      Summe                      61688485                 47595800
##  2 Bund  Summe      1997 - 1999                 2045501                  1709333
##  3 Bund  Summe      1993 - 1996                 2902998                  2295723
##  4 Bund  Summe      1988 - 1992                 4206014                  3311417
##  5 Bund  Summe      1983 - 1987                 4302277                  3455795
##  6 Bund  Summe      1978 - 1982                 4264347                  3490155
##  7 Bund  Summe      1973 - 1977                 3953481                  3273575
##  8 Bund  Summe      1968 - 1972                 5230865                  4232089
##  9 Bund  Summe      1958 - 1967                12396299                  9581842
## 10 Bund  Summe      1948 - 1957                 9487267                  6950185
## # ℹ 617 more rows
## # ℹ abbreviated name: ¹`Wahlberechtigte ohne Wahlscheinvermerk`
## # ℹ 5 more variables: `Wahlberechtigte mit Wahlscheinvermerk` <dbl>,
## #   `Wähler/-innen` <dbl>, `Wähler/-innen ohne Wahlschein` <dbl>,
## #   `Wähler/-innen mit Wahlschein` <dbl>, Wahlbeteiligung <dbl>

turnout_tbl <- read_csv2("turnout.csv", comment = "#")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 798 Columns: 16── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (3): Land, Geschlecht, Geburtsjahresgruppe
## dbl (13): Erst-/Zweitstimme, Summe, Ungültig, CDU, SPD, DIE LINKE, GRÜNE, CS...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

turnout_tbl
## # A tibble: 798 × 16
##    Land  `Erst-/Zweitstimme` Geschlecht Geburtsjahresgruppe    Summe Ungültig
##    <chr>               <dbl> <chr>      <chr>                  <dbl>    <dbl>
##  1 Bund                    1 Summe      Summe               46976341   586726
##  2 Bund                    1 Summe      1993 - 1999          3369640    37704
##  3 Bund                    1 Summe      1983 - 1992          6039663    58687
##  4 Bund                    1 Summe      1973 - 1982          6227268    62068
##  5 Bund                    1 Summe      1958 - 1972         13918027   146749
##  6 Bund                    1 Summe      1948 - 1957          7693131   103820
##  7 Bund                    1 Summe      1947 und früher      9728612   177697
##  8 Bund                    1 m          Summe               22757980   274037
##  9 Bund                    1 m          1993 - 1999          1694779    20330
## 10 Bund                    1 m          1983 - 1992          3001418    29800
## # ℹ 788 more rows
## # ℹ 10 more variables: CDU <dbl>, SPD <dbl>, `DIE LINKE` <dbl>, GRÜNE <dbl>,
## #   CSU <dbl>, FDP <dbl>, AfD <dbl>, Sonstige <dbl>, `dar. NPD` <dbl>,
## #   `dar. FREIE WÄHLER` <dbl>

8.3 Saving data files

In my opinion, the best approach to data analysis in terms of reproducibility, is to have one script in which you load your raw data, do all your data cleaning, transform your data, do all statistical and graphical analysis and also all exports of tables and graphics. This ensures that anyone who has access to your script and the raw data can reproduce every step you undertook in your analysis, which one of the cornerstones of transparency and reproducibility of scientific works. When we take this approach, we seldom need to save the results of our data analysis in a file. The script – the .R file – is enough.

Yet, there are situations where saving results becomes reasonable or even necessary. One of these is Web Scraping, for at least two reasons. To begin with, we will try to minimize the traffic we create on the websites we scrape (see @ref(good_practice)). If we re-run the read_html() function every time we run our script, we also re-download all of the HTML data in each re-run. This puts unnecessary load on the servers and also slows down our script, which becomes more relevant the more data you are scraping from the web.

Another point to consider, is that websites change all the time. The data you scraped today may be changed or even gone tomorrow. You should also keep in mind that if the structure of the site changes, your CSS selectors and URLs may not work anymore. The website you are viewing right now, is an updated version of a site that was originally written for a Seminar in the summer of 2021. In the course of updating it for 2022, virtually every example had to be rewritten at least in some parts because the scraped websites had changed, often only in small details, rendering most of the original CSS selectors useless.

So for both of those reasons, it may be appropriate to save all of our downloaded HTML data locally, so that we can re-load it for re-runs of our script without having to worry about changes to the source.

The most straightforward approach would be to use the download.file() function and apply it directly to the URL and thus copy the original HTML data exactly.

download.file("https://webscraping-tures.github.io/hello_world.html", "hello_world_local.html")

We run this line once and afterwards just use read_html() on the local file.

hello_world <- read_html("hello_world_local.html")

In most situations we don’t need to save the results of our transformations and analysis. Re-running the script on the original data – we have now saved locally – is enough to ensure that the work can be reproduced. But if we are still interested in saving the results of our analysis to a local file, we could use write_csv() from stringr, or save the data in the native R format .RData.

Let’s create an example object we want to save first:

data <- tibble(
  name = c("Peter", "Paul", "Mary"),
  age = c(42, 84, 24),
  size = c(1.68, 1.82, 1.74),
  retired = c(FALSE, TRUE, FALSE)
)

write_csv() is an appropriate format for saving tibbles or data frames and on top of that has the advantage of being readable by most software that deals with data, not just by R. It takes the object to be saved as its first argument and the path and file name used for saving as its second argument.

write_csv(data, "peter_paul_mary.csv")

The resulting file looks like this:

name,age,size,retired
Peter,42,1.68,FALSE
Paul,84,1.82,TRUE
Mary,24,1.74,FALSE

If we want to use semicolons as a delimiter we could also use write_csv2(); to load the resulting file we would just use read_csv() or read_csv2().

The downside to using CSV files is that the definition of column types – e.g. character, numeric or integer – is lost and may have to be redefined when reloading the data. Also more complex objects like lists, can not be saved as a CSV, so we may use the data format native to R, .RData. These files also have the advantages of being neatly compressed, thus taking up less space on your drive, and R can read them in faster than CSV files.

The save() function also takes the name of the object – or multiple objects – to be saved as its first argument. We also have to specify a file name (and possibly path) in the file = argument.

save(data, file = "peter_paul_mary.RData")

We can load objects stored as .RData files by using the load() function on this file.

load("peter_paul_mary.RData")

Please note that save() does not work, when used on an object to which the result of read_html() is assigned to; it will save an object but if you load it, the object will be empty. The reasons for this are rather technical, but you should know that you will loose some of your work, if you do try something like this:

hello_world <- read_html("https://webscraping-tures.github.io/hello_world.html")
hello_world

save(hello_world, file = "hello_world.RData")
rm(hello_world)

load("hello_world.RData")
hello_world

As you can see, the objext hello_world does not work after loading. Sidenote: The function rm() removes the object(s) defined between the parantheses from the environment.