2 Packages, the tidyverse and R Markdown
2.1 R packages
The R world is open and collaborative by nature. Besides the packages that come with your R installation – base R – an ever growing number of additional packages, written by professionals and users, is available for download by anyone. Every package is focussed on a specific use case and brings with it a number of functions that enable R to be used for tasks that the original software designers did not have in mind or at the very least provide a smoother user experience in cases where the original base R solutions are more complicated.
The packages, its documentation and various other related information are hosted at CRAN – “Comprehensive R Archive Network”– which you already got to know during the installation of R. If you install a package directly from RStudio, it uses CRAN to find and download the package and the associated files.
2.1.1 Installing and loading packages
To install a package we can use the R function install.packages()
where the name of the package to be installed is written enclosed by "
between the parentheses. Normally we do this using the console.
Installing packages from an R script works as well, but as we only need
to perform the installation once, there is no benefit in it. It actually
slows things down if we repeat the installation every time we run a
script. At the same time, if we share our script, it is impolite to
force an (re)installation on somebody else.
For this introduction we will focus on the packages of the tidyverse – more on them below. To install the core tidyverse package, you should type:
R will output a lot of information concerning the installation process,
and close with a satisfying DONE (tidyverse)
if everything went
according to plan.
Please note, that R is case sensitive. This means that “Tidyverse” is not the same as “tidyverse”. To R, these are to different words. When installing packages, their names have to be written exactly as they are named or R will not find the package. The same principle applies to object names, strings, functions, arguments and work with R in general.
Now that the installation is complete, we can load the package. This
should normally be done in the first lines of a script. This way all
necessary packages are loaded at the beginning of running a script and
other users that see your code also immediately see which packages are
required. Loading a package is done with library()
with the name of
the package in the parentheses, this time without the need for enclosing
it in "
.
Loading the tidyverse package returns a lot of information to us, some of which we will look at in more detail during the course of this chapter. Please note that not all packages are that verbose in their loading process. Often you will get no output at all which is a good sign, as this also means that the package loaded correctly. If anything goes wrong, R will return an error message.
2.1.2 Namespaces
Looking at the last lines of the returned message when loading the
tidyverse package, we’re informed that there are two conflicts. These
arise when two or more loaded packages include functions with the same
name. Here we can see that the tidyverse package dplyr masks the
functions filter()
and lag()
from the base R package stats. If
we would have used filter()
without loading dplyr, the function from
the stats package would have been used. After loading it, the
function from dplyr masks the function from stats and is used
instead.
If we had a case where we want to load dplyr, but still use
filter()
from stats, we can still do this by explicitly declaring
the namespace which we are referring to. The namespace basically is a
reference for R where to look up the function we have called. If we just
write the function’s name, R looks for it in the list of loaded
packages, which would result in applying filter()
from dplyr here.
But we can tell R to look up the function in another namespace, by using
the notation namespace::function
. So to call filter()
from stats
while the function is masked by the similarly named function from
dplyr, we could write stats::filter()
. As the function will not
work without further arguments, we can’t try this out directly, but the
same principle applies to loading the help files:
2.2 Tidyverse
While we will use some base R functions throughout this course, our main focus will lie on the tidyverse packages.
The tidyverse is a collection of R packages, all following a shared
philosophy concerning the syntax of their functions and the way in which
data is represented. We will see how the philosophy underlying the
tidyverse can lead to more intuitive R code, especially when using the
pipe (%>%
), in the next chapter. If you want to learn more about the
concept of tidy data, the structure of data representation underlying
the tidyverse, a read of the chapter on this concept from “R for Data
Science” by Wickham & Grolemund is highly recommended:
https://r4ds.had.co.nz/tidy-data.html.
Right now, the core tidyverse consists of eight packages. These are
the packages that are loaded when we type library(tidyverse)
and that
are listed in the corresponding output under “Attaching packages”. As
the name suggests, the packages comprise the core functionalities that
define the tidyverse. This includes reading, cleaning and
transforming data, handling certain data types, plotting graphs and
more. Over the course of this introduction to web scraping, we will make
use of several of these packages, so in most chapters we will begin our
scripts with loading the tidyverse package.
Besides the core tidyverse, a number of additional and more specialised
packages are part of the tidyverse and were already installed when
you ran install.packages("tidyverse")
above. Among them, the package
rvest is of special importance to us, as it will be our main tool
for web scraping throughout the course.
For a full list of tidyverse packages and the corresponding descriptions of their functionality, you can visit: https://www.tidyverse.org/packages/
2.2.1 Tibbles
The tibble package is part of the core tidyverse and offers an alternative to data frames that are used in base R to represent data in tabular form. The differences between data frames and tibbles are relatively minor. If you are interested in the details, you can read up on them in this section from “R for Data Science” and the chapter on tibbles in general: https://r4ds.had.co.nz/tibbles.html#tibbles-vs.-data.frame. For now, it will suffice to know that tibbles are used throughout this introduction, but that all examples will also work with the classic data frames.
The syntax to create a tibble is simple. Every column represents a
variable, every row an observation. You should think of the columns as
vectors, where the first position in each vector corresponds to the
first observation (row), the second position in each vector to the
second observation, and so on. In this way, we can create tibbles vector
by vector or variable by variable, using the function tibble()
. We
assign a name to the variable followed by =
and the data to be
assigned to the variable. The variable-data pairs are separated by ,
:
tibble(numbers = c(0, 1, 2), strings = c("zero", "one", "two"), logicals = c(FALSE, TRUE, TRUE))
## # A tibble: 3 × 3
## numbers strings logicals
## <dbl> <chr> <lgl>
## 1 0 zero FALSE
## 2 1 one TRUE
## 3 2 two TRUE
For longer code like this, it is advisable to use multiple lines and a more clear formatting to create code that is readable and intuitive:
tibble(
numbers = c(0, 1, 2),
strings = c("zero", "one", "two"),
logicals = c(FALSE, TRUE, TRUE)
)
## # A tibble: 3 × 3
## numbers strings logicals
## <dbl> <chr> <lgl>
## 1 0 zero FALSE
## 2 1 one TRUE
## 3 2 two TRUE
R understands that all five lines are part of one command as it
evaluates everything between the opening and closing bracket of the
tibbles()
function together. We just have to make sure, that we don’t
miss the closing bracket or a ,
that separates the variable-data
pairs. This actually is a main source of errors and will be high on your
list of things to check if something does not work as planned.
We can also use calculations and functions directly in tibble creation, circumventing the need to assign the results to an object first:
tibble(
numbers = c(1, 2, 3),
roots = sqrt(numbers),
rounded = round(roots)
)
## # A tibble: 3 × 3
## numbers roots rounded
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 2 1.41 1
## 3 3 1.73 2
2.2.1.1 Subsetting tibbles
When subsetting two dimensional objects like data frames and tibbles, we
have to supply an index for the row(s) as well as for the column(s) we
want to subset. Those are written in the form
object[row_index, column_index]
Let us first assign a sample tibble to
an object.
exmpl_tbl <- tibble(
numbers = c(0, 1, 2),
strings = c("zero", "one", "two"),
logicals = c(FALSE, TRUE, TRUE)
)
We can now subset this tibble. Note that if we want to subset a complete
row or column, we can leave the place before or after the ,
empty to
indicate that we want to see all rows or columns.
exmpl_tbl[1, 2] # first row, second column
## # A tibble: 1 × 1
## strings
## <chr>
## 1 zero
exmpl_tbl[1, ] # first row, all columns
## # A tibble: 1 × 3
## numbers strings logicals
## <dbl> <chr> <lgl>
## 1 0 zero FALSE
exmpl_tbl[, 2] # all rows, second column
## # A tibble: 3 × 1
## strings
## <chr>
## 1 zero
## 2 one
## 3 two
You may note that in each of those cases, R returns a tibble to us. Even
when there is only one value, as in the first example, we get a 1x1
tibble. This is because subsetting tibbles with []
always returns a
tibble. If we are interested in extracting an actual value from a cell
in a tibble, we have to use [[]]
subsetting instead.
exmpl_tbl[[1, 2]] # first row, second column
## [1] "zero"
exmpl_tbl[[3, 1]] # third row, first column
## [1] 2
If our goal is to extract a column as a vector, we have several options.
We can write object[[column_index]]
. In this case, R knows that we are
only interested in columns as we have only provided one index to a two
dimensional object. We couls also use the column’s name instead of the
index, enclosed in ""
. Another popular option is to use the
$-notation. Here we also use the column’s name but write it like this:
object$column_name
. Which option to use is up to taste, as the results
are identical:
2.3 Additional R resources
When learning R and when using functions and packages that are new to you, you will regularly run into situations where you need help in understanding what is happening and what you can do. Luckily, there a lot of resources that will help you on your R journey.
You have already learned about the built-in help functionalities of R. Many packages also come with so called vignettes which offer more in-depth introductions to the packages. Let’s see if the tibble package comes with vignettes. To do this we can write:
We get a list of all vignettes available for the specific package. To
access a specific vignette, we also use the vignette()
function, this
time with the specific name of the vignette as the function’s argument:
You can also always check the CRAN page for the package in question. Here you can access the documentation as well as available vignettes, e.g.: https://cran.r-project.org/web/packages/tibble/index.html.
Another highly recommended resource are the RStudio cheatsheets found at: https://www.rstudio.com/resources/cheatsheets/. These are available for many popular packages and present a comprehensive list of the functions offered by the packages.
The RStudio homepage also offers many more resources for learning R and specific packages, including a number of webinars and tutorial videos available under the menu “Resources”: https://www.rstudio.com/
In general, the internet offers a lot of resources that you can access. One of the most important skills you have to develop as an aspiring R user is to understand the problem you are facing to the best of your abilities and formulate a short but precise google search. In most cases you can assume, that you are not the first or last person to have a specific problem. Someone will have written a blogpost, asked a question on https://stackoverflow.com/, made a video tutorial, and so on. If you can find these resources, you are already halfway there.
There are also a lot of books available on R and RStudio in general, as well as on more specific applications in R. I want to recommend three of them in particular, both available as paperback or online:
Intro to R for Social Scientists by Jasper Tjaden. An accessible introduction to R that expands on the concepts only touched here. Written for a seminar at the University of Potsdam in summer 2021. Available under: https://jaspertjaden.github.io/course-intro2r/
R Cookbook, 2nd Edition by J.D. Long & Paul Teetor. The book is comprised of recipes for specific tasks you might want to perform. It is not designed as a course but rather as reference for concrete questions. Available under: https://rc2e.com/
R for Data Science by Hadley Wickham and Garrett Grolemund. An introduction to data science using (almost) exclusively the tidyverse packages. Available under: https://r4ds.had.co.nz/
2.4 Intro to R Markdown
The following is an excerpt from another seminar that I have written with Prof. Jasper Tjaden and Niaz Morshed, Data Analysis with R for Social Scientists, available under: https://jaspertjaden.github.io/DataAnalysisR/
It is a short introduction to using R Markdown for documenting your work and writing reports and papers with it.
2.4.1 What is R Markdown?
R Markdown allows you to combine written text that is easy to format with R Code that is executed when knitting or compiling the document into the chosen output format. This allows us to describe our research, analyse our data, display results as tables or plots and interpret these, all in one file. In this way we can not only create reports on seminar exercises but also write websites - like the one your are looking at in this very moment -, seminar papers, articles or create presentations.
It is also a great notebook for projects you are working on. More often than not, our work on a specific analysis will span multiple days, weeks or even months and it is often hard to remember what we were thinking the last time we worked on our code.
“I am sure I had my reasons for writing this piece of code, but I can not for the life of me remember any of them…”
— Anonymous Coder 2023
If we use R Markdown to document our work we can add text that explains our reasons, thoughts, ideas and plans at that very moment and pick up our work from there the next time we open the file.
R Markdown allows output to different file formats, including html
,
docx
, pptx
and pdf
. Note that you need a LaTeX installation to
knit to pdf
. LaTeX is a typesetting language and used for producing
high quality pdf
documents. For simple pdf
reports or
presentations - sidenote: if you bring a pptx
to a talk something will
most probably go wrong or stop working… - you do not really need to
know how LaTeX works, you just need an installed distribution. For these
purposes the package tinytex
gives you all you need and is easy to
install from within R. This
site
explains how to install it. You can also get an overview of all possible
output formats here.
2.4.2 Creating a R Markdown file
Before you can create and compile R Markdown documents, you first have
to install the package by writing install.packages("rmarkdown")
in
your console.
Creating a new R Markdown file is as straightforward as it can be. In
RStudio you can click of File > New File > R Markdown...
. In the new
window you can set up some basic information on the document - which
will be displayed in the output - and chose your desired format. You can
basically write R Markdown files in any text editor, just make sure that
the file extension is saved as .Rmd
. We still recommend using RStudio
because it gives you some convenient options that a text editor will
not, e.g. displaying a preview of your document and easy knitting of the
final file.
2.4.3 Writing in R Markdown
2.4.3.1 Document components
When you followed the steps above, a new R Markdown file will have been created. It basically consists of two main parts:
A
YAML
header - surrounded by three dashes---
- where options for the document can be set. The good news is that you do not have to do anything here until you get more profound with using R Markdown. For now it is enough that all the options you set when creating the new file - the title, author, date and format of the output - are present and will be included in your output file.A body that contains the actual content of your document. Text is directly written in the body at the location where it is to be displayed in the output. We can use the simple Markdown syntax for formatting using a set of symbols, some of which we will explain below. We can also include R code in so called
chunks
, specifying if we also want it and/or the results to be displayed or “just” to be executed in the background. The code chunks will be executed when we compile the final document and everything that we want to include in the output - e.g. tables, plots or code examples - will be displayed where it occurs.
2.4.3.2 Formatting
Here are some of the more common formatting elements you will need when starting out using R Markdown:
2.4.3.2.1 Headers
To include sections in a document we use #
followed by the header we
want to be displayed. We can define levels for sections by using
multiple #
in this way:
# Section 1
## Section 1.1
## Section 1.2
### Section 1.2.1
### Section 1.2.2
## Section 1.3
# Section 2
2.4.3.2.2 Text
We write the text between the section headers at the place where it
should be displayed in the final document. We can insert line breaks at
any point but these will not be rendered in the output. To include an
actual paragraph we will have to include a blank line between between
two blocks of text. Two emphasize certain words or phrases, we can wrap
them in *
for italics or **
for bold face.
Consider this Markdown code:
This is the first paragraph.
This still is the first paragraph.
Here begins the second paragraph.
It includes emphasis, by using *italics* and also **bold face** words.
It is rendered as:
This is the first paragraph. This still is the first paragraph.
Here begins the second paragraph. It includes emphasis, by using italics and also bold face words.
2.4.3.2.3 Lists
Unordered lists or bullet points can be inserted by adding a -
, *
or
+
at the beginning of a line. To create levels, we have to indent
lines using tab stops.
* Level 1
* Level 1
* Level 2
* Level 3
* Level 2
* Level 3
* Level 1
The above will be rendered as:
- Level 1
- Level 1
- Level 2
- Level 3
- Level 2
- Level 3
- Level 2
- Level 1
We can also create ordered lists by using numbers followed by a .
instead of the *
etc.
1. Bulletpoint 1
2. Bulletpoint 2
3. Bulletpoint 3
- Bulletpoint 1
- Bulletpoint 2
- Bulletpoint 3
2.4.3.2.4 Hyperlinks
Hyperlinks can be included as <url>
or [text](url)
.
To include a plain url we can use <https://jaspertjaden.github.io/DataAnalysisR/>.
We can also [link](https://jaspertjaden.github.io/DataAnalysisR/) in this way.
To include a plain url we can use https://jaspertjaden.github.io/DataAnalysisR/. We can also link in this way.
2.4.3.3 Code chunks
Codechunks have to be started and ended with three backticks
```
. After the first set of backticks we also have to include
{r}
to let Markdown know that we want to run the code as R code. The
code that is written after this and up to the second set of backticks
will be executed when knitting the file.
You can see some examples of this in the newly created R Markdown file if you followed the steps above.
We can also always run the code in a chunk before knitting by clicking
on the green arrow in the upper right corner of the chunk. We can also
execute individual lines of code by placing our keyboard cursor in the
line and pressing Shift + Enter
.
2.4.3.3.1 Chunk options
We can change the way code chunks are handled when knitting by adding
one or multiple chunk options between the curly brackets like this:
{r option=value}
. If we want to use multiple options they have to be
written like this: {r option1=value1, option2=value2}
.
There are many options available but most are not needed when starting out. The ones that may be of interest to you are:
{r echo=FALSE}
: This prevents the code to be displayed in the output while the results will be included. This is useful if you want to show the results of a computation or a plot but do not want the document to be cluttered with the underlying code.{r include=FALSE}
: This prevents the code as well as the output from being displayed. The code is still run in the background.{r eval=FALSE}
: This prevents the code from being run but displays it. This can be useful if you want to show code examples for illustrative purposes.