8 Downloading and saving files
In this section we will learn to download files from within our R scripts, and save the scraping results locally to be reused without the need to scrape the data again. Most data files you will deal with online are CSV files. To begin we will talk briefly about what CSV files are and how we can use them in R.
8.1 CSV files
CSV stands for “comma-separated values”. While they are limited to the display of two-dimensional tables, their simplicity and portability makes CSV files one of the most used formats for this type of data. You can open and write CSV files with statistical software, spreadsheet software like Excel, and even with the most basic text editor.
A simple CSV file could look like this:
column1, column2, column3
data1_1, data1_2, data1_3
data2_1, data2_2, data2_3
data3_1, data3_2, data3_3
In essence, CSV files are tables. The rows of the table are separated by line breaks, the columns are delimited by commas.
The first row usually represents the column names, just as in the example above
(column1, column2, column3
). Be aware, that this isn’t necessarily always the
case. A CSV file may not contain any column names at all, and start with a first
row of data cells. This is one of the vital differences when it comes to parsing
the file.
8.1.1 Parsing a CSV file
To convert the data contained in the CSV file into a format we can use for data
analysis we have to parse it. The function used for parsing “understands” the
representation of a table in the CSV syntax, can discern column names from data
cells, assigns data cells to the correct rows and columns and returns an R
object representing the table. read_csv()
, from the readr package that we will
use here, returns a tibble, while read.csv()
from base R returns a data frame.
Since readr is part of the core tidyverse, we can just load the tidyverse package. Soon we will also need rvest, so let us begin with loading both libraries.
read_csv()
takes a CSV file as its first argument. To test
it, we can also use literal data for the file argument. So we can pass the
example defined above as a string – contained in "
– as the file
argument.
We will not be using this in practice much, but it serves well as a first
example.
read_csv(
"column1, column2, column3
data1_1, data1_2, data1_3
data2_1, data2_2, data2_3
data3_1, data3_2, data3_3"
)
## Rows: 3 Columns: 3
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (3): column1, column2, column3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 data1_1 data1_2 data1_3
## 2 data2_1 data2_2 data2_3
## 3 data3_1 data3_2 data3_3
read_csv()
guesses the column types by the data entered. All three columns
were automatically defined as character vectors here. Automatic guessing works
reliably in most situations. Let’s try a CSV with multiple different column
types:
read_csv(
"name, age, size, retired
Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE"
)
## Rows: 3 Columns: 4
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (2): age, size
## lgl (1): retired
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 4
## name age size retired
## <chr> <dbl> <dbl> <lgl>
## 1 Peter 42 1.68 FALSE
## 2 Paul 84 1.82 TRUE
## 3 Mary 24 1.74 FALSE
read_csv()
correctly guessed that the first column is of type character, the
third of the type double and the fourth is a logical vector. It also guessed
that the second column should be of the type double, i.e. floating point
numbers. This will work, but we could redefine this vector as integer manually
by using the col_types
argument. This will not be necessary in most
situations, but could increase computational speed with very large datasets, as
integers can be saved more efficiently. More on the definition of column types
can be found in the help file.
What happens when we have CSV data without column names defined in the first line?
read_csv(
"Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE"
)
## Rows: 2 Columns: 4
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (1): Peter
## dbl (2): 42, 1.68
## lgl (1): FALSE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 4
## Peter `42` `1.68` `FALSE`
## <chr> <dbl> <dbl> <lgl>
## 1 Paul 84 1.82 TRUE
## 2 Mary 24 1.74 FALSE
read_csv()
by itself cannot discern if the authors of the CSV file intended
the first line to be used for names or not. If there are no column names in the
CSV data, as in this case, we have to explicitly tell read_csv()
not to use
the first line for column names. We can do this by specifying the col_names()
argument as FALSE
; in this case, read_csv()
chooses default names for the
columns, which can be changed later on.
read_csv(
"Peter, 42, 1.68, FALSE
Paul, 84, 1.82, TRUE
Mary, 24, 1.74, FALSE",
col_names = FALSE)
## Rows: 3 Columns: 4
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (1): X1
## dbl (2): X2, X3
## lgl (1): X4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 4
## X1 X2 X3 X4
## <chr> <dbl> <dbl> <lgl>
## 1 Peter 42 1.68 FALSE
## 2 Paul 84 1.82 TRUE
## 3 Mary 24 1.74 FALSE
8.2 Downloading files
Let us look at some real CSV files to download and parse.
The website https://www.bundeswahlleiter.de/bundestagswahlen/2017/ergebnisse/repraesentative-wahlstatistik.html holds multiple CSV files containing inferential statistics on the German federal election held in 2017. Among these files are two files with statistics on the number of people eligible to vote and on the number that actually voted, both by binary gender, year of birth, and additional indicators.
Your first impulse would be to download the files manually and then parse them. This will work just fine, but we could also download the files directly from our code. Due to the small number of files in this example, this may not even be the most efficient way, but when you start handling larger amounts of files or want to regularly update files, downloading them from within your code is a safer and more efficient option.
To start, we will download the HTML file containing the links to the CSV files. Using the tools at our disposal, we will extract the first two links following the header “Tabellen zur Weiterverwendung”. To understand the construction of the css selector, read chapter 5.1. Note that this is only one of many possible selectors that can achieve the selection of both links of interest.
website <- "https://www.bundeswahlleiter.de/bundestagswahlen/2017/ergebnisse/repraesentative-wahlstatistik.html" %>%
read_html()
a_elements <- website %>%
html_elements(css = "ul[style='list-style-type:none'] > li:first-child > a,
ul[style='list-style-type:none'] > li:nth-child(2) > a")
a_elements
## {xml_nodeset (2)}
## [1] <a data-file-extension="CSV" data-file-size="40,53 kB" title="Link zur CS ...
## [2] <a data-file-extension="CSV" data-file-size="68,82 kB" title="Link zur CS ...
links <- a_elements %>%
html_attr(name = "href")
links
## [1] "../../../dam/jcr/38972966-dc3d-40fa-91d7-6599d913f5e9/btw17_rws_bw2.csv"
## [2] "../../../dam/jcr/a67208c0-2a2b-41aa-abb8-204b09e73b6b/btw17_rws_bst2.csv"
We succeeded in extracting the links, but we can also see that these links are
not complete since they are missing their base URL. This works on the website,
as the link is relative. To access the files directly though, we need an
absolute link – the full path. As before (sub section 7.2.2) we can
use the function str_c()
to construct the complete URLs. If we give str_c()
a vector as one of its input arguments, the function will repeat the string
connection for each element in the vector. For this example we can ignore the
“../../../” part of the link as it will work regardless; to remove this part we
could apply some of the functions from the stringr package. For this we need
some basic knowledge in string manipulation and regular expressions, which we
will gain in chapter 13. So for now, we will just use the links as
they come.
links <- str_c("https://www.bundeswahlleiter.de/", links)
links
## [1] "https://www.bundeswahlleiter.de/../../../dam/jcr/38972966-dc3d-40fa-91d7-6599d913f5e9/btw17_rws_bw2.csv"
## [2] "https://www.bundeswahlleiter.de/../../../dam/jcr/a67208c0-2a2b-41aa-abb8-204b09e73b6b/btw17_rws_bst2.csv"
Now that we have complete absolute links, we can download the files to our
hard drive using the base R function download.file()
. We specify a URL as its
first argument and also specify a path and file name for the destfile
argument. When no path is given, the file is saved in the directory where your R
script is located.
download.file(links[1], destfile = "eligible.csv")
download.file(links[2], destfile = "turnout.csv")
We could also have used read_csv()
to parse the CSV files directly from the
web. However it may still sometimes be a good idea to download the files to our
hard drives, since we only have to do this once; with this we decrease traffic
for the server, as well as increase the efficiency of our code. This is only a
benefit if we download only once and not every time we rerun our script. So we
have to comment out the lines where the download occurs after it was successful.
8.2.1 Parsing the CSV files
Now we proceed to parse the downloaded CSV files into R objects. But first let’s have a look at the CSV files. You can open them in a text editor of your choosing or directly in RStudio. Importing them into Excel or a similar spreadsheet software is not a valid option here, as we want to see the raw contents of the file and not a representation of the data.
Looking at eligible.csv, we notice at least two things: First, the delimiter
used here is not a comma but a semicolon. The default CSV style discussed above
is common in those countries where a “.” is used as the decimal point, thus the
“,” is available to be used as a delimiter. In countries where the “,” is used
as a decimal point, it is not available as a delimiter and therefore the “;” is
used. The function read_csv2()
can be used in this case. A more generalized
function is read_delim()
, because the delimiting character can freely be
defined as an argument. Secondly, we see that the first 9 lines are neither
column names nor data, but contain comments on the CSV file itself. We have to
tell read_csv2()
to ignore these lines. We could use the argument skip
to
tell R to ignore the first 9 lines; however, these comments are neatly
introduced with an #
, so we can use the comment
argument instead and specify
the #
as the identifier for comments, which then will be ignored.
eligible_tbl <- read_csv2("eligible.csv", comment = "#")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 627 Columns: 10
## ── Column specification ────────────────────────
## Delimiter: ";"
## chr (3): Land, Geschlecht, Geburtsjahresgruppe
## dbl (7): Wahlberechtigte, Wahlberechtigte ohne Wahlscheinvermerk, Wahlberech...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eligible_tbl
## # A tibble: 627 × 10
## Land Geschlecht Geburtsjahresgruppe Wahlberechtigte Wahlberechtigte ohne W…¹
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Bund Summe Summe 61688485 47595800
## 2 Bund Summe 1997 - 1999 2045501 1709333
## 3 Bund Summe 1993 - 1996 2902998 2295723
## 4 Bund Summe 1988 - 1992 4206014 3311417
## 5 Bund Summe 1983 - 1987 4302277 3455795
## 6 Bund Summe 1978 - 1982 4264347 3490155
## 7 Bund Summe 1973 - 1977 3953481 3273575
## 8 Bund Summe 1968 - 1972 5230865 4232089
## 9 Bund Summe 1958 - 1967 12396299 9581842
## 10 Bund Summe 1948 - 1957 9487267 6950185
## # ℹ 617 more rows
## # ℹ abbreviated name: ¹`Wahlberechtigte ohne Wahlscheinvermerk`
## # ℹ 5 more variables: `Wahlberechtigte mit Wahlscheinvermerk` <dbl>,
## # `Wähler/-innen` <dbl>, `Wähler/-innen ohne Wahlschein` <dbl>,
## # `Wähler/-innen mit Wahlschein` <dbl>, Wahlbeteiligung <dbl>
turnout_tbl <- read_csv2("turnout.csv", comment = "#")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 798 Columns: 16── Column specification ────────────────────────
## Delimiter: ";"
## chr (3): Land, Geschlecht, Geburtsjahresgruppe
## dbl (13): Erst-/Zweitstimme, Summe, Ungültig, CDU, SPD, DIE LINKE, GRÜNE, CS...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
turnout_tbl
## # A tibble: 798 × 16
## Land `Erst-/Zweitstimme` Geschlecht Geburtsjahresgruppe Summe Ungültig
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 Bund 1 Summe Summe 46976341 586726
## 2 Bund 1 Summe 1993 - 1999 3369640 37704
## 3 Bund 1 Summe 1983 - 1992 6039663 58687
## 4 Bund 1 Summe 1973 - 1982 6227268 62068
## 5 Bund 1 Summe 1958 - 1972 13918027 146749
## 6 Bund 1 Summe 1948 - 1957 7693131 103820
## 7 Bund 1 Summe 1947 und früher 9728612 177697
## 8 Bund 1 m Summe 22757980 274037
## 9 Bund 1 m 1993 - 1999 1694779 20330
## 10 Bund 1 m 1983 - 1992 3001418 29800
## # ℹ 788 more rows
## # ℹ 10 more variables: CDU <dbl>, SPD <dbl>, `DIE LINKE` <dbl>, GRÜNE <dbl>,
## # CSU <dbl>, FDP <dbl>, AfD <dbl>, Sonstige <dbl>, `dar. NPD` <dbl>,
## # `dar. FREIE WÄHLER` <dbl>
8.3 Saving data files
In my opinion, the best approach to data analysis in terms of reproducibility, is to have one script in which you load your raw data, do all your data cleaning, transform your data, do all statistical and graphical analysis and also all exports of tables and graphics. This ensures that anyone who has access to your script and the raw data can reproduce every step you undertook in your analysis, which one of the cornerstones of transparency and reproducibility of scientific works. When we take this approach, we seldom need to save the results of our data analysis in a file. The script – the .R file – is enough.
Yet, there are situations where saving results becomes reasonable or even
necessary. One of these is Web Scraping, for at least two reasons. To begin
with, we will try to minimize the traffic we create on the websites we scrape
(see @ref(good_practice)). If we re-run the read_html()
function every time
we run our script, we also re-download all of the HTML data in each re-run. This
puts unnecessary load on the servers and also slows down our script, which
becomes more relevant the more data you are scraping from the web.
Another point to consider, is that websites change all the time. The data you scraped today may be changed or even gone tomorrow. You should also keep in mind that if the structure of the site changes, your CSS selectors and URLs may not work anymore. The website you are viewing right now, is an updated version of a site that was originally written for a Seminar in the summer of 2021. In the course of updating it for 2022, virtually every example had to be rewritten at least in some parts because the scraped websites had changed, often only in small details, rendering most of the original CSS selectors useless.
So for both of those reasons, it may be appropriate to save all of our downloaded HTML data locally, so that we can re-load it for re-runs of our script without having to worry about changes to the source.
The most straightforward approach would be to use the download.file()
function
and apply it directly to the URL and thus copy the original HTML data exactly.
We run this line once and afterwards just use read_html()
on the local file.
In most situations we don’t need to save the results of our transformations and
analysis. Re-running the script on the original data – we have now saved
locally – is enough to ensure that the work can be reproduced. But if we are
still interested in saving the results of our analysis to a local file, we could
use write_csv()
from stringr, or save the data in the native R format .RData.
Let’s create an example object we want to save first:
data <- tibble(
name = c("Peter", "Paul", "Mary"),
age = c(42, 84, 24),
size = c(1.68, 1.82, 1.74),
retired = c(FALSE, TRUE, FALSE)
)
write_csv()
is an appropriate format for saving tibbles or data frames and on
top of that has the advantage of being readable by most software that deals with
data, not just by R. It takes the object to be saved as its first argument and
the path and file name used for saving as its second argument.
The resulting file looks like this:
name,age,size,retired
Peter,42,1.68,FALSE
Paul,84,1.82,TRUE
Mary,24,1.74,FALSE
If we want to use semicolons as a delimiter we could also use write_csv2()
; to
load the resulting file we would just use read_csv()
or read_csv2()
.
The downside to using CSV files is that the definition of column types – e.g. character, numeric or integer – is lost and may have to be redefined when reloading the data. Also more complex objects like lists, can not be saved as a CSV, so we may use the data format native to R, .RData. These files also have the advantages of being neatly compressed, thus taking up less space on your drive, and R can read them in faster than CSV files.
The save()
function also takes the name of the object – or multiple objects
– to be saved as its first argument. We also have to specify a file name (and
possibly path) in the file =
argument.
We can load objects stored as .RData files by using the load()
function on
this file.
Please note that save()
does not work, when used on an object to which the
result of read_html()
is assigned to; it will save an object but if you load
it, the object will be empty. The reasons for this are rather technical, but you
should know that you will loose some of your work, if you do try something like
this:
hello_world <- read_html("https://webscraping-tures.github.io/hello_world.html")
hello_world
save(hello_world, file = "hello_world.RData")
rm(hello_world)
load("hello_world.RData")
hello_world
As you can see, the objext hello_world
does not work after loading.
Sidenote: The function rm()
removes the object(s) defined between the parantheses
from the environment.