2 Packages, the tidyverse and R Markdown

2.1 R packages

The R world is open and collaborative by nature. Besides the packages that come with your R installation – base R – an ever growing number of additional packages, written by professionals and users, is available for download by anyone. Every package is focussed on a specific use case and brings with it a number of functions that enable R to be used for tasks that the original software designers did not have in mind or at the very least provide a smoother user experience in cases where the original base R solutions are more complicated.

The packages, its documentation and various other related information are hosted at CRAN – “Comprehensive R Archive Network”– which you already got to know during the installation of R. If you install a package directly from RStudio, it uses CRAN to find and download the package and the associated files.

2.1.1 Installing and loading packages

To install a package we can use the R function install.packages() where the name of the package to be installed is written enclosed by " between the parentheses. Normally we do this using the console. Installing packages from an R script works as well, but as we only need to perform the installation once, there is no benefit in it. It actually slows things down if we repeat the installation every time we run a script. At the same time, if we share our script, it is impolite to force an (re)installation on somebody else.

For this introduction we will focus on the packages of the tidyverse – more on them below. To install the core tidyverse package, you should type:

install.packages("tidyverse")

R will output a lot of information concerning the installation process, and close with a satisfying DONE (tidyverse) if everything went according to plan.

Please note, that R is case sensitive. This means that “Tidyverse” is not the same as “tidyverse”. To R, these are to different words. When installing packages, their names have to be written exactly as they are named or R will not find the package. The same principle applies to object names, strings, functions, arguments and work with R in general.

Now that the installation is complete, we can load the package. This should normally be done in the first lines of a script. This way all necessary packages are loaded at the beginning of running a script and other users that see your code also immediately see which packages are required. Loading a package is done with library() with the name of the package in the parentheses, this time without the need for enclosing it in ".

library(tidyverse)

Loading the tidyverse package returns a lot of information to us, some of which we will look at in more detail during the course of this chapter. Please note that not all packages are that verbose in their loading process. Often you will get no output at all which is a good sign, as this also means that the package loaded correctly. If anything goes wrong, R will return an error message.

2.1.2 Namespaces

Looking at the last lines of the returned message when loading the tidyverse package, we’re informed that there are two conflicts. These arise when two or more loaded packages include functions with the same name. Here we can see that the tidyverse package dplyr masks the functions filter() and lag() from the base R package stats. If we would have used filter() without loading dplyr, the function from the stats package would have been used. After loading it, the function from dplyr masks the function from stats and is used instead.

If we had a case where we want to load dplyr, but still use filter() from stats, we can still do this by explicitly declaring the namespace which we are referring to. The namespace basically is a reference for R where to look up the function we have called. If we just write the function’s name, R looks for it in the list of loaded packages, which would result in applying filter() from dplyr here. But we can tell R to look up the function in another namespace, by using the notation namespace::function. So to call filter() from stats while the function is masked by the similarly named function from dplyr, we could write stats::filter(). As the function will not work without further arguments, we can’t try this out directly, but the same principle applies to loading the help files:

?dplyr::filter()
?stats::filter()

2.2 Tidyverse

While we will use some base R functions throughout this course, our main focus will lie on the tidyverse packages.

The tidyverse is a collection of R packages, all following a shared philosophy concerning the syntax of their functions and the way in which data is represented. We will see how the philosophy underlying the tidyverse can lead to more intuitive R code, especially when using the pipe (%>%), in the next chapter. If you want to learn more about the concept of tidy data, the structure of data representation underlying the tidyverse, a read of the chapter on this concept from “R for Data Science” by Wickham & Grolemund is highly recommended: https://r4ds.had.co.nz/tidy-data.html.

Right now, the core tidyverse consists of eight packages. These are the packages that are loaded when we type library(tidyverse) and that are listed in the corresponding output under “Attaching packages”. As the name suggests, the packages comprise the core functionalities that define the tidyverse. This includes reading, cleaning and transforming data, handling certain data types, plotting graphs and more. Over the course of this introduction to web scraping, we will make use of several of these packages, so in most chapters we will begin our scripts with loading the tidyverse package.

Besides the core tidyverse, a number of additional and more specialised packages are part of the tidyverse and were already installed when you ran install.packages("tidyverse") above. Among them, the package rvest is of special importance to us, as it will be our main tool for web scraping throughout the course.

For a full list of tidyverse packages and the corresponding descriptions of their functionality, you can visit: https://www.tidyverse.org/packages/

2.2.1 Tibbles

The tibble package is part of the core tidyverse and offers an alternative to data frames that are used in base R to represent data in tabular form. The differences between data frames and tibbles are relatively minor. If you are interested in the details, you can read up on them in this section from “R for Data Science” and the chapter on tibbles in general: https://r4ds.had.co.nz/tibbles.html#tibbles-vs.-data.frame. For now, it will suffice to know that tibbles are used throughout this introduction, but that all examples will also work with the classic data frames.

The syntax to create a tibble is simple. Every column represents a variable, every row an observation. You should think of the columns as vectors, where the first position in each vector corresponds to the first observation (row), the second position in each vector to the second observation, and so on. In this way, we can create tibbles vector by vector or variable by variable, using the function tibble(). We assign a name to the variable followed by = and the data to be assigned to the variable. The variable-data pairs are separated by ,:

tibble(numbers = c(0, 1, 2), strings = c("zero", "one", "two"), logicals = c(FALSE, TRUE, TRUE))
## # A tibble: 3 × 3
##   numbers strings logicals
##     <dbl> <chr>   <lgl>   
## 1       0 zero    FALSE   
## 2       1 one     TRUE    
## 3       2 two     TRUE

For longer code like this, it is advisable to use multiple lines and a more clear formatting to create code that is readable and intuitive:

tibble(
  numbers = c(0, 1, 2),
  strings = c("zero", "one", "two"),
  logicals = c(FALSE, TRUE, TRUE)
)
## # A tibble: 3 × 3
##   numbers strings logicals
##     <dbl> <chr>   <lgl>   
## 1       0 zero    FALSE   
## 2       1 one     TRUE    
## 3       2 two     TRUE

R understands that all five lines are part of one command as it evaluates everything between the opening and closing bracket of the tibbles() function together. We just have to make sure, that we don’t miss the closing bracket or a , that separates the variable-data pairs. This actually is a main source of errors and will be high on your list of things to check if something does not work as planned.

We can also use calculations and functions directly in tibble creation, circumventing the need to assign the results to an object first:

tibble(
  numbers = c(1, 2, 3),
  roots = sqrt(numbers),
  rounded = round(roots)
)
## # A tibble: 3 × 3
##   numbers roots rounded
##     <dbl> <dbl>   <dbl>
## 1       1  1          1
## 2       2  1.41       1
## 3       3  1.73       2

2.2.1.1 Subsetting tibbles

When subsetting two dimensional objects like data frames and tibbles, we have to supply an index for the row(s) as well as for the column(s) we want to subset. Those are written in the form object[row_index, column_index] Let us first assign a sample tibble to an object.

exmpl_tbl <- tibble(
  numbers = c(0, 1, 2),
  strings = c("zero", "one", "two"),
  logicals = c(FALSE, TRUE, TRUE)
)

We can now subset this tibble. Note that if we want to subset a complete row or column, we can leave the place before or after the , empty to indicate that we want to see all rows or columns.

exmpl_tbl[1, 2]   # first row, second column
## # A tibble: 1 × 1
##   strings
##   <chr>  
## 1 zero
exmpl_tbl[1, ]    # first row, all columns
## # A tibble: 1 × 3
##   numbers strings logicals
##     <dbl> <chr>   <lgl>   
## 1       0 zero    FALSE
exmpl_tbl[, 2]   # all rows, second column
## # A tibble: 3 × 1
##   strings
##   <chr>  
## 1 zero   
## 2 one    
## 3 two

You may note that in each of those cases, R returns a tibble to us. Even when there is only one value, as in the first example, we get a 1x1 tibble. This is because subsetting tibbles with [] always returns a tibble. If we are interested in extracting an actual value from a cell in a tibble, we have to use [[]] subsetting instead.

exmpl_tbl[[1, 2]]   # first row, second column
## [1] "zero"
exmpl_tbl[[3, 1]]   # third row, first column
## [1] 2

If our goal is to extract a column as a vector, we have several options. We can write object[[column_index]]. In this case, R knows that we are only interested in columns as we have only provided one index to a two dimensional object. We couls also use the column’s name instead of the index, enclosed in "". Another popular option is to use the $-notation. Here we also use the column’s name but write it like this: object$column_name. Which option to use is up to taste, as the results are identical:

exmpl_tbl[[2]]
## [1] "zero" "one"  "two"
exmpl_tbl[["strings"]]
## [1] "zero" "one"  "two"
exmpl_tbl$strings
## [1] "zero" "one"  "two"

2.3 Additional R resources

When learning R and when using functions and packages that are new to you, you will regularly run into situations where you need help in understanding what is happening and what you can do. Luckily, there a lot of resources that will help you on your R journey.

You have already learned about the built-in help functionalities of R. Many packages also come with so called vignettes which offer more in-depth introductions to the packages. Let’s see if the tibble package comes with vignettes. To do this we can write:

vignette(package = "tibble")

We get a list of all vignettes available for the specific package. To access a specific vignette, we also use the vignette() function, this time with the specific name of the vignette as the function’s argument:

vignette("types")

You can also always check the CRAN page for the package in question. Here you can access the documentation as well as available vignettes, e.g.: https://cran.r-project.org/web/packages/tibble/index.html.

Another highly recommended resource are the RStudio cheatsheets found at: https://www.rstudio.com/resources/cheatsheets/. These are available for many popular packages and present a comprehensive list of the functions offered by the packages.

The RStudio homepage also offers many more resources for learning R and specific packages, including a number of webinars and tutorial videos available under the menu “Resources”: https://www.rstudio.com/

In general, the internet offers a lot of resources that you can access. One of the most important skills you have to develop as an aspiring R user is to understand the problem you are facing to the best of your abilities and formulate a short but precise google search. In most cases you can assume, that you are not the first or last person to have a specific problem. Someone will have written a blogpost, asked a question on https://stackoverflow.com/, made a video tutorial, and so on. If you can find these resources, you are already halfway there.

There are also a lot of books available on R and RStudio in general, as well as on more specific applications in R. I want to recommend three of them in particular, both available as paperback or online:

  • Intro to R for Social Scientists by Jasper Tjaden. An accessible introduction to R that expands on the concepts only touched here. Written for a seminar at the University of Potsdam in summer 2021. Available under: https://jaspertjaden.github.io/course-intro2r/

  • R Cookbook, 2nd Edition by J.D. Long & Paul Teetor. The book is comprised of recipes for specific tasks you might want to perform. It is not designed as a course but rather as reference for concrete questions. Available under: https://rc2e.com/

  • R for Data Science by Hadley Wickham and Garrett Grolemund. An introduction to data science using (almost) exclusively the tidyverse packages. Available under: https://r4ds.had.co.nz/

2.4 Intro to R Markdown

The following is an excerpt from another seminar that I have written with Prof. Jasper Tjaden and Niaz Morshed, Data Analysis with R for Social Scientists, available under: https://jaspertjaden.github.io/DataAnalysisR/

It is a short introduction to using R Markdown for documenting your work and writing reports and papers with it.

2.4.1 What is R Markdown?

R Markdown allows you to combine written text that is easy to format with R Code that is executed when knitting or compiling the document into the chosen output format. This allows us to describe our research, analyse our data, display results as tables or plots and interpret these, all in one file. In this way we can not only create reports on seminar exercises but also write websites - like the one your are looking at in this very moment -, seminar papers, articles or create presentations.

It is also a great notebook for projects you are working on. More often than not, our work on a specific analysis will span multiple days, weeks or even months and it is often hard to remember what we were thinking the last time we worked on our code.

“I am sure I had my reasons for writing this piece of code, but I can not for the life of me remember any of them…”

— Anonymous Coder 2023

If we use R Markdown to document our work we can add text that explains our reasons, thoughts, ideas and plans at that very moment and pick up our work from there the next time we open the file.

R Markdown allows output to different file formats, including html, docx, pptx and pdf. Note that you need a LaTeX installation to knit to pdf. LaTeX is a typesetting language and used for producing high quality pdf documents. For simple pdf reports or presentations - sidenote: if you bring a pptx to a talk something will most probably go wrong or stop working… - you do not really need to know how LaTeX works, you just need an installed distribution. For these purposes the package tinytex gives you all you need and is easy to install from within R. This site explains how to install it. You can also get an overview of all possible output formats here.

2.4.2 Creating a R Markdown file

Before you can create and compile R Markdown documents, you first have to install the package by writing install.packages("rmarkdown") in your console.

Creating a new R Markdown file is as straightforward as it can be. In RStudio you can click of File > New File > R Markdown.... In the new window you can set up some basic information on the document - which will be displayed in the output - and chose your desired format. You can basically write R Markdown files in any text editor, just make sure that the file extension is saved as .Rmd. We still recommend using RStudio because it gives you some convenient options that a text editor will not, e.g. displaying a preview of your document and easy knitting of the final file.

2.4.3 Writing in R Markdown

2.4.3.1 Document components

When you followed the steps above, a new R Markdown file will have been created. It basically consists of two main parts:

  • A YAML header - surrounded by three dashes --- - where options for the document can be set. The good news is that you do not have to do anything here until you get more profound with using R Markdown. For now it is enough that all the options you set when creating the new file - the title, author, date and format of the output - are present and will be included in your output file.

  • A body that contains the actual content of your document. Text is directly written in the body at the location where it is to be displayed in the output. We can use the simple Markdown syntax for formatting using a set of symbols, some of which we will explain below. We can also include R code in so called chunks, specifying if we also want it and/or the results to be displayed or “just” to be executed in the background. The code chunks will be executed when we compile the final document and everything that we want to include in the output - e.g. tables, plots or code examples - will be displayed where it occurs.

2.4.3.2 Formatting

Here are some of the more common formatting elements you will need when starting out using R Markdown:

2.4.3.2.1 Headers

To include sections in a document we use # followed by the header we want to be displayed. We can define levels for sections by using multiple # in this way:

# Section 1
## Section 1.1
## Section 1.2
### Section 1.2.1
### Section 1.2.2
## Section 1.3
# Section 2
2.4.3.2.2 Text

We write the text between the section headers at the place where it should be displayed in the final document. We can insert line breaks at any point but these will not be rendered in the output. To include an actual paragraph we will have to include a blank line between between two blocks of text. Two emphasize certain words or phrases, we can wrap them in * for italics or ** for bold face.

Consider this Markdown code:

This is the first paragraph.
This still is the first paragraph.

Here begins the second paragraph.
It includes emphasis, by using *italics* and also **bold face** words.

It is rendered as:

This is the first paragraph. This still is the first paragraph.

Here begins the second paragraph. It includes emphasis, by using italics and also bold face words.

2.4.3.2.3 Lists

Unordered lists or bullet points can be inserted by adding a -, * or + at the beginning of a line. To create levels, we have to indent lines using tab stops.

* Level 1
* Level 1
  * Level 2
    * Level 3
  * Level 2
    * Level 3
* Level 1

The above will be rendered as:

  • Level 1
  • Level 1
    • Level 2
      • Level 3
    • Level 2
      • Level 3
  • Level 1

We can also create ordered lists by using numbers followed by a . instead of the * etc.

1.  Bulletpoint 1
2.  Bulletpoint 2
3.  Bulletpoint 3
  1. Bulletpoint 1
  2. Bulletpoint 2
  3. Bulletpoint 3

2.4.3.3 Code chunks

Codechunks have to be started and ended with three backticks ```. After the first set of backticks we also have to include {r} to let Markdown know that we want to run the code as R code. The code that is written after this and up to the second set of backticks will be executed when knitting the file.

You can see some examples of this in the newly created R Markdown file if you followed the steps above.

We can also always run the code in a chunk before knitting by clicking on the green arrow in the upper right corner of the chunk. We can also execute individual lines of code by placing our keyboard cursor in the line and pressing Shift + Enter.

2.4.3.3.1 Chunk options

We can change the way code chunks are handled when knitting by adding one or multiple chunk options between the curly brackets like this: {r option=value}. If we want to use multiple options they have to be written like this: {r option1=value1, option2=value2}.

There are many options available but most are not needed when starting out. The ones that may be of interest to you are:

  • {r echo=FALSE}: This prevents the code to be displayed in the output while the results will be included. This is useful if you want to show the results of a computation or a plot but do not want the document to be cluttered with the underlying code.
  • {r include=FALSE}: This prevents the code as well as the output from being displayed. The code is still run in the background.
  • {r eval=FALSE}: This prevents the code from being run but displays it. This can be useful if you want to show code examples for illustrative purposes.