1 R & R Studio

1.1 Installing R

R is a freely available programming language used predominantly for data science and statistical computations. For more information on the language and to access the documentation, visit: https://www.r-project.org/. From there you can also follow the link to CRAN, the Comprehensive R Archive Network, or access it directly by visiting https://cran.r-project.org/.

The latest versions of R will always be hosted at CRAN. At the top of the landing page, you will find links to the installers for each operating system. If you are using Windows, please choose “base” after following the link and then download the offered file. In the case of Mac OS download the first file listed under “Latest release”. In both cases, execute the file and install R to a directory of your choice. If you are using Linux, the link on CRAN will offer installation advice for some of the more popular distributions. In any case, you can check the package manager of your choice for the latest available release for your system.

All examples used on this Website were written and tested with R version 4.1.3 “One Push-Up”. While it is not to be expected, they might, nonetheless, return errors in newer or older versions of R.

1.2 RStudio

The basic R installation provides a terminal that could in principle be used to follow the contents of this introduction to Web Scraping. The widely more common approach is to use an external IDE – Integrated Development Environment –, the most popular being RStudio. Using an IDE will dramatically improve your workflow and I would strongly recommend using RStudio for this purpose.

RStudio is also freely available and can be found at https://www.rstudio.com/. Following the “Download” link and scrolling down (ignoring the different versions offered for professional usage) you will find the latest installers for several operating systems offered as downloads. In most cases, simply installing RStudio after R has been installed, will work “out of the box”.

1.2.1 Overview

The RStudio interface consists of four sub-areas. The bottom-left shows the “Console” as well as additional tabs which you will rarely need in the beginning. The console can be used to evaluate R code live. We will begin working with the console soon, so this will make more sense to you in a bit. The top-left shows opened files, e.g. R scripts. This is where you will actually spend most of your time. This introduction proceeds from using one-time commands in the console to writing your code in scripts that can be re-opened, re-run and shared. The top-right has several tabs of which “Environment” should be our main concern at this point. Here you will see all data objects created in your RStudio session. More on this later. Finally, the bottom-right shows us, amongst other things, the “Files” in a selected folder, graphical output under “Plots” and requested “Help” on packages and functions.

1.3 Hello World!

So, let’s begin with putting R & RStudio to use. For now, we will write our commands directly into the console. You will notice a > sign at the beginning of the last line in the console. This is a prompt, as in “write commands here”. Try writing this command and executing it with the “Enter” key:

print("Hello World!")
## [1] "Hello World!"

You just entered your first R command, received your first output and also used your first function. We will address functions in more detail later. For now, it is enough to know that the command print() prints everything that is enclosed in its parentheses to the output. The output begins with [1], indicating that this is the first, and in this case the only, element of the output generated by the executed command. Please note, that RStudio will not print ## before the output. In the shown code segments on this website, ## is inserted before the output to allow copying the code directly to RStudio, as one or mutliple # indicate that a line is a comment, and thus is not evaluated as a command by R.

1.3.1 Calculating with R

R understands the basic arithmetic symbols + - * / and thus the console can be used as a calculator. Many functions for more involved calculations, e.g. sqrt() for taking a square root of the content enclosed in the parenthesis, are available. x^y can be used to write \(x^y\). For now, you should write the code below line for line into the R console and execute each line with the “Enter” key.

17 + 25
## [1] 42
99 - 57
## [1] 42
4 * 10.5
## [1] 42
84 / 2
## [1] 42
sqrt(1764)
## [1] 42
6.480741 ^ 2
## [1] 42

1.3.2 Comparison operators

We can use comparison operators to compare two values and receive the test result as output. To test if two values are equal, we write ==. To test if they are not equal, we can use !=

42 == 42
## [1] TRUE
42 != 42
## [1] FALSE

We can also compare if the first value is less <, less or equal <=, larger > or larger or equal >=, compared to the second value.

10 < 42
## [1] TRUE
42 <= 42
## [1] TRUE
10 > 42
## [1] FALSE
90 >= 42
## [1] TRUE

1.3.3 Logical operators

We can also combine tests by applying logical operators to create more complex conditions. & – AND – checks if both tests combined by it are TRUE and only returns TRUE if both are at the same time. | – OR – checks if at least on of the tests is TRUE and returns TRUE if one or both are. When combining two tests, we thus have these possibilities:

TRUE & TRUE returns TRUE
TRUE & FALSE, FALSE & TRUE and FALSE & FALSE all return FALSE
TRUE | TRUE, TRUE | FALSE and FALSE | TRUE all return TRUE
FALSE | FALSE returns FALSE

1 < 2 & 3 != 4  # TRUE & TRUE
## [1] TRUE
1 > 2 & 3 != 4  # FALSE & TRUE
## [1] FALSE
1 > 2 | 3 != 4  # FALSE | TRUE
## [1] TRUE
1 > 2 | 3 == 4  # FALSE | FALSE
## [1] FALSE

1.4 Objects

Some of the power of using a language like R for computation, comes from the ability to store data or results for later use and further analysis. In R, all types of data are stored in objects. On a basic level, an object is a name that we define that has some form of data assigned to it. To assign data to a name, we use the assignment operator <-.

the_answer <- 42

A handy keyboard shortcut for writing the assignment operator is pressing the “Alt” and “-” keys simultaneously. Learning this shortcut early, will safe you on a lot of typing and keyboard gymnastics.

After we assigned a value to an object, we can recall that value, by writing the object’s name.

the_answer
## [1] 42

We can also use defined objects in calculations and function calls (more on those later). Note, that if we assign a value to an already defined object, the stored value is overwritten by the new one.

the_answer <- the_answer / 2
the_answer
## [1] 21

a <- 17
b <- 4
the_answer <- (a + b) * 2
the_answer
## [1] 42

All objects we define are listed in the “Environment” tab, seen in the upper right of RStudio. If we ever want to remove objects from the environment, we can use the rm() function. In general, this is not necessary, but it can help with keeping the list from getting cluttered.

rm(the_answer)

1.5 Vectors

When we assigned a number to an object, we actually created a vector. A vector is a one-dimensional data structure that can contain multiple elements. The number of elements determine the length of the vector. So a vector with only one element is still a vector, but with a length of 1. To assign multiple elements to a vector, we use the combine function c(). All values inside the parentheses, separated by ,, are combined as elements to form the vector.

v <- c(7, 8, 9)

v
## [1] 7 8 9

1.5.1 Subsetting

If we want to access certain elements of a vector, we have to use subsetting. This is achieved by adding square brackets to the object’s name, containing the position of the element in its vector. In order to access the first or third element, we can write:

v[1]
## [1] 7
v[3]
## [1] 9

We can also access multiple elements at once, using c() inside the brackets or by defining a range of positions using :.

v[c(1, 3)]
## [1] 7 9
v[2:3]
## [1] 8 9

1.5.2 Types of vectors

Observing the vector v we created in the environment, we notice that RStudio writes num [1:3] before listing the values of the elements. The second part, indicates the length of 3, while the first part shows the type of the vector we created. In this case the type is numeric. Numeric vectors, as you might have guessed, contain numbers. We can also use str() to receive info on type, length and content of a vector.

str(v)
##  num [1:3] 7 8 9

There are a number of other types of vectors, the two most important – besides numeric vectors – being logical and character vectors.

Logical vectors can only contain the values TRUE and FALSE. Strictly speaking, they – as the other types of vectors – can also contain NA, indicating a missing value. We will talk more about NAs later on. Logical vectors are often created when we test for something. For example, we can test, if the elements in a numerical vector are larger or equal to 5 and receive a logical vector containing the test results.

x <- c(1, 7, 3, 5)
x >= 5
## [1] FALSE  TRUE FALSE  TRUE

Character vectors contain strings of characters. When assigning strings, they have to be enclosed in quotation marks.

char_v <- c("This", "is", "a", "character", "vector!")
char_v
## [1] "This"      "is"        "a"         "character" "vector!"

char_v_2 <- c("This is also", "a character vector!")
char_v_2
## [1] "This is also"        "a character vector!"

We can compare character vectors only for (non-)equality, not for being smaller or larger.

"same" == "same"
## [1] TRUE
"same" == "not the same"
## [1] FALSE
"same" != "not the same"
## [1] TRUE

Character vectors also cannot be used to calculate. This can get problematic, if numbers are stored as characters, which arises frequently when Web Scraping.

a <- c(1, 2, 3)
b <- c("7", "8", "9")

str(a)
##  num [1:3] 1 2 3
str(b)
##  chr [1:3] "7" "8" "9"

a + b
## Error in a + b: non-numeric argument to binary operator

As we enclosed the elements of vector b in quotation marks, R interprets the data as characters instead of numbers. Since characters cannot be used for calculations, we received an error message. But we can make R interpret the characters as numbers by using as.numeric().

a + as.numeric(b)
## [1]  8 10 12

1.5.3 A brief look at lists

Note that a vector of a certain type, can only contain elements of that type. So we cannot mix data types in the same vector. If we want to mix data types, we can use lists instead of vectors.

l <-list(1, TRUE, "Hello World!")
str(l)
## List of 3
##  $ : num 1
##  $ : logi TRUE
##  $ : chr "Hello World!"

Lists can also contain other lists to represent hierarchical data structures. We will see lists “in action” later on in this course.

1.6 Functions

Functions provide an easy and concise way of performing more or less complex tasks using predefined bits of R code that are provided in “base R” – i.e. that come with the basic R installation – or in the various additional packages that are available for installation. We have already used a number of functions up to this point, e.g. print(). To “call” a function, we write its name, followed by parentheses. Inside the parentheses additional arguments are provided to R. In most cases, some data has to be entered as the first argument. For example, print() writes the text provided as argument to the output. More complex functions often allow for more than one argument. Sometimes these are required, but more often these additional arguments are optional and can be used to change some options from the default value to the one desired.

1.6.1 Help

But how do we know which arguments can or have to be provide to use a function and what their effects are? We can check the documentation on CRAN or use Google to find additional information. Another often more convenient way, is to use the help functionality build into R. By writing ? in front of the function name into the console and executing the line by pressing “Enter”, the help file is opened in the lower right of the RStudio window. Let’s try this for the function rnorm().

?rnorm()

The help file tells us several things. rnorm() is part of a family of functions that are related to the normal distribution, each providing a distinct functionality. The functionality of rnorm() being the generation of random numbers stemming from the normal distribution. We also learn, that three arguments can be provided. n, the number of observations to be generated, as well as mean and sd, the mean and the standard deviation of the normal distribution to be drawn from. We also see that mean and sd are provided with the standard values 0 and 1 respectively, indicated by the =. We also see that n has no standard value. So we have to provide a value for n, but not for mean and sd. Just writing rnorm() will result in an error.

rnorm()
## Error in rnorm(): argument "n" is missing, with no default

To provide an argument to a function, we write the name of the argument, followed by = and the value to be provided. Note that, since rnorm() draws random numbers, your output will differ from the output presented here.

rnorm(n = 10)
##  [1] -0.6947138 -0.6549466 -0.7869072  0.3663827 -0.9294131  0.8603928
##  [7]  0.2958026  0.5531536  0.5040406 -0.8553844

In the same vein, additional arguments that are allowed by the function can be defined, instead of using their default values.

rnorm(n = 10, mean = 10, sd = 0.5)
##  [1]  9.605022 10.635442  9.549972 10.480408 10.858938 10.854275 10.473092
##  [8]  9.712174 10.010402 10.113529

We can also skip writing the names of arguments in many cases. As the n argument is the first listed in the function’s parentheses, R also understands the call, if we just provide the value to be used as the first argument. You will often encounter the convention that the first argument is written without its name and any further arguments are written in full.

rnorm(10, mean = 10, sd = 0.5)
##  [1]  9.236410  9.721322  9.554567  9.815022  9.812235 10.357355  9.928118
##  [8] 10.745933  9.512564  9.373942

1.6.2 Examples: Basic statistical functions

Base R provides us with some basic statistical functions that are used for data analysis.

We should start with defining a numerical vector that contains some data to be analysed.

data <- c(4, 8, 15, 16, 23, 42)

We could be interested in describing this data by its arithmetic mean, median and standard deviation. For this purpose we can use the functions mean(), median(), and sd() provided by base R. All three do not require additional arguments besides the data to be analysed which we can provide using the object data we created beforehand.

mean(data)
## [1] 18
median(data)
## [1] 15.5
sd(data)
## [1] 13.49074

1.7 RStudio Workflow

Up until now, we wrote our code directly into the RStudio console, pressed “Enter” and received the desired output. This works but will not satisfy our needs in the long run. The main problem is, that the code we wrote essentially disappears after running it. Imagine that you want to rerun your code a week from now or even tomorrow. Maybe you took notes and can recreate it, but that means a lot of unsatisfying and error prone work. Also, maybe at some point you want to share code with colleagues, fellow students, or the R community in general. At the same time, as our code gets more complex, spans multiple lines and consists of many interdependent blocks of code, you will inevitably run into the situation where you realise you made a mistake or have to change some code at the very beginning of your R session. This would mean, recreating and rerunning most or all of the code you have already written.

These are some of the reasons why we should start writing our code into so called R Scripts.

1.7.1 R Scripts

To create a new R Script, you can click on “File” > “New File” > “R Script”, or more conveniently press “CTRL” + “Shift” + “N” simultaneously. This creates an untitled script that we can write our code into.

Let’s start with something simple by recreating some of the code from last week.

a <- 17
b <- 4

the_answer <- (a + b) * 2

the_answer
## [1] 42

We assign two numerical values two the objects a and b, assign a calculation based on these objects to the new object the_answer and prompt R to return its value to us. Instead of writing the code line by line into the console, now we write the whole block into the newly created script. We can now run the complete script by clicking on “Source” or, to also get output, “Source with Echo” in the upper toolbar attached to the script’s tab. In most cases I prefer running the script line by line though. This allows full control of the process and enables you to stop in certain lines to e.g. contemplate what the code is doing, check for errors or change details of the code before moving on to the next line. You can do this either by clicking on “Run” in the toolbar or pressing “CTRL” + “Enter” simultaneously. In both cases, RStudio copies the line of code where your text cursor is currently residing into the console and runs it for you. The text cursor then conveniently jumps to the next line in the script. In this way you can quickly run your script line by line, while having full control over when to stop.

You can decide for yourself what the right approach to running your code is, based on any given situation. But remember that R always assumes that you know what you’re doing. There will be no warning prompts if you are about to overwrite work you have previously done.

When you are done writing your script, you might want to save it to the hard drive, preserving your work for later re-runs or for sharing. By clicking on “File” > “Save” or presing “CTRL” + “S” you can save the file with a name of your choosing. The file extension for R Scripts is always “.R”.

One problem – that you will run into sooner or later – is that you will try to run incomplete code from a script, most commonly a missing closing bracket. In this case, RStudio puts the code to be run into the console and begins a new line, starting with +, and then nothing happens. R assumes that your code will continue in a further line and waits for you to enter it after the +. In most cases the right approach is to cancel the entered command, fix your code and re-run it afterwards. To cancel an already entered command, you have to click into the “Console” tab and press “Esc” on your keyboard. The > prompt will reappear in the console and you can continue with your work.

1.7.2 Projects

In many cases, your work will consist of multiple scripts, data files, graphics saved to the disk or additional output. So it makes sense to assign your files to a place on your hard drive. You can do this “by hand” but a convenient approach might be to use RStudio’s project functionality.

By clicking on “File” > “New Project”, you can start the project creation wizard. If you have already created a folder on your hard drive that shall contain the project, you can click on “Existing Directory”, select the folder and click on “Create Project”. You can also create the folder on the fly by clicking on “New Directory” > “New Project” and then choosing a folder name and the sub-folder where it should be placed, before creating the project.

RStudio will now close all files currently open and switch to your newly created project. The name you chose for the project’s folder will also be its name, seen in RStudio’s title bar. When you look at the “File” tab (lower right), you will also see that you are now in the project’s folder. This is your current working directory, a concept we will talk about momentarily. All scripts you create while working in your project will become a part of it. So when you want to return to continuing your work, you can now click on “File” > “Open Project”. All files opened the last time you worked on the project will be reopened and you will again be in the project’s working directory. This is an easy and convenient way to keep your work tidy.

At this point, I would advise you to create a project for this introduction to web scraping and create R scripts for each chapter as parts of the project. The name and sub-folder you choose is not important from the point of view of functionality, but it should make sense to you.

We should now briefly talk about the working directory. If you try to open or save a file directly from an R script – without specifying a complete path – R will always assume you refer to your working directory. If you created a project, this automatically set the project’s folder as the working directory. You can always check for your current working directory by entering getwd() into the console. You can change your current working directory by clicking on “Session” > “Set Working Directory” > “Choose Directory…” or by using the function setwd() with the desired path enclosed by " as the function’s argument.

1.7.3 Comments

You should get into the habit of commenting your code as early as possible. Comments are started with one or multiple #. All code following the # will not be evaluated by R and thus serves as the perfect place to comment on what you were doing and thinking while writing the code. Why do this? When you reopen a script that you have not been working on in a while, it can be hard to understand what you tried to do in the first place. Commented code makes this much easier. This is even more true if you share your code with other people. They may have very different approaches to certain R problems and clearly commented code will help them to quickly understand it. You should see this as a sign of respect towards the time your peers may invest in helping you with your coding problems.

# assigning objects
a <- 17
b <- 4

# calculating the answer
the_answer <- (a + b) * 2

the_answer
## [1] 42
# but what is the question?

If you plan on using setwd() in your script, it is a good idea to comment this line before sharing your script. Other people will have different folder structures and will want to decide for themselves. The same goes for all lines that will save something to the hard drive, e.g. data sets or exported graphics. The R and RStudio communities are very welcoming and you will always find people that are willing to lend you their help, so you should return the favour and be polite in your code. This includes writing clear comments and not cluttering anyone’s hard drive with files they may not want to have.