1 R & R Studio
1.1 Installing R
R is a freely available programming language used predominantly for data science and statistical computations. For more information on the language and to access the documentation, visit: https://www.r-project.org/. From there you can also follow the link to CRAN, the Comprehensive R Archive Network, or access it directly by visiting https://cran.r-project.org/.
The latest versions of R will always be hosted at CRAN. At the top of the landing page, you will find links to the installers for each operating system. If you are using Windows, please choose “base” after following the link and then download the offered file. In the case of Mac OS download the first file listed under “Latest release”. In both cases, execute the file and install R to a directory of your choice. If you are using Linux, the link on CRAN will offer installation advice for some of the more popular distributions. In any case, you can check the package manager of your choice for the latest available release for your system.
All examples used on this Website were written and tested with R version 4.1.3 “One Push-Up”. While it is not to be expected, they might, nonetheless, return errors in newer or older versions of R.
1.2 RStudio
The basic R installation provides a terminal that could in principle be used to follow the contents of this introduction to Web Scraping. The widely more common approach is to use an external IDE – Integrated Development Environment –, the most popular being RStudio. Using an IDE will dramatically improve your workflow and I would strongly recommend using RStudio for this purpose.
RStudio is also freely available and can be found at https://www.rstudio.com/. Following the “Download” link and scrolling down (ignoring the different versions offered for professional usage) you will find the latest installers for several operating systems offered as downloads. In most cases, simply installing RStudio after R has been installed, will work “out of the box”.
1.2.1 Overview
The RStudio interface consists of four sub-areas. The bottom-left shows the “Console” as well as additional tabs which you will rarely need in the beginning. The console can be used to evaluate R code live. We will begin working with the console soon, so this will make more sense to you in a bit. The top-left shows opened files, e.g. R scripts. This is where you will actually spend most of your time. This introduction proceeds from using one-time commands in the console to writing your code in scripts that can be re-opened, re-run and shared. The top-right has several tabs of which “Environment” should be our main concern at this point. Here you will see all data objects created in your RStudio session. More on this later. Finally, the bottom-right shows us, amongst other things, the “Files” in a selected folder, graphical output under “Plots” and requested “Help” on packages and functions.
1.3 Hello World!
So, let’s begin with putting R & RStudio to use. For now, we will write our
commands directly into the console.
You will notice a >
sign at the beginning of the last line in the console.
This is a prompt, as in “write commands here”. Try writing this command and
executing it with the “Enter” key:
You just entered your first R command, received your first output and also used
your first function. We will address functions in more detail later. For now, it
is enough to know that the command print()
prints everything that is enclosed
in its parentheses to the output. The output begins with [1]
, indicating that
this is the first, and in this case the only, element of the output generated by
the executed command. Please note, that RStudio will not print ##
before the
output. In the shown code segments on this website, ##
is inserted before the
output to allow copying the code directly to RStudio, as one or mutliple #
indicate that a line is a comment, and thus is not evaluated as a command by R.
1.3.1 Calculating with R
R understands the basic arithmetic symbols + - * /
and thus the console can be
used as a calculator. Many functions for more involved calculations, e.g.
sqrt()
for taking a square root of the content enclosed in the parenthesis,
are available. x^y
can be used to write \(x^y\). For now, you
should write the code below line for line into the R console and execute each
line with the “Enter” key.
1.3.2 Comparison operators
We can use comparison operators to compare two values and receive the test
result as output. To test if two values are equal, we write ==
. To test if
they are not equal, we can use !=
We can also compare if the first value is less <
, less or equal <=
,
larger >
or larger or equal >=
, compared to the second value.
1.3.3 Logical operators
We can also combine tests by applying logical operators to create more
complex conditions. &
– AND – checks if both tests combined by it
are TRUE
and only returns TRUE
if both are at the same time. |
– OR –
checks if at least on of the tests is TRUE
and returns TRUE
if one or
both are. When combining two tests, we thus have these possibilities:
TRUE & TRUE
returnsTRUE
TRUE & FALSE
,FALSE & TRUE
andFALSE & FALSE
all returnFALSE
TRUE | TRUE
,TRUE | FALSE
andFALSE | TRUE
all returnTRUE
FALSE | FALSE
returnsFALSE
1.4 Objects
Some of the power of using a language like R for computation, comes from the
ability to store data or results for later use and further analysis.
In R, all types of data are stored in objects.
On a basic level, an object is a name that we define that has some form of data
assigned to it. To assign data to a name, we use the assignment operator <-
.
A handy keyboard shortcut for writing the assignment operator is pressing the “Alt” and “-” keys simultaneously. Learning this shortcut early, will safe you on a lot of typing and keyboard gymnastics.
After we assigned a value to an object, we can recall that value, by writing the object’s name.
We can also use defined objects in calculations and function calls (more on those later). Note, that if we assign a value to an already defined object, the stored value is overwritten by the new one.
the_answer <- the_answer / 2
the_answer
## [1] 21
a <- 17
b <- 4
the_answer <- (a + b) * 2
the_answer
## [1] 42
All objects we define are listed in the “Environment” tab, seen in the upper
right of RStudio. If we ever want to remove objects from the environment,
we can use the rm()
function. In general, this is not necessary, but it can
help with keeping the list from getting cluttered.
1.5 Vectors
When we assigned a number to an object, we actually created a vector. A vector
is a one-dimensional data structure that can contain multiple elements. The
number of elements determine the length of the vector. So a vector with only
one element is still a vector, but with a length of 1.
To assign multiple elements to a vector, we use the combine function c()
. All
values inside the parentheses, separated by ,
, are combined as elements to
form the vector.
1.5.1 Subsetting
If we want to access certain elements of a vector, we have to use subsetting. This is achieved by adding square brackets to the object’s name, containing the position of the element in its vector. In order to access the first or third element, we can write:
We can also access multiple elements at once, using c()
inside the brackets or
by defining a range of positions using :
.
1.5.2 Types of vectors
Observing the vector v
we created in the environment, we notice that RStudio
writes num [1:3]
before listing the values of the elements. The second part,
indicates the length of 3, while the first part shows the type of the vector we
created. In this case the type is numeric. Numeric vectors, as you might have
guessed, contain numbers. We can also use str()
to receive info on type,
length and content of a vector.
There are a number of other types of vectors, the two most important – besides numeric vectors – being logical and character vectors.
Logical vectors can only contain the values TRUE
and FALSE
. Strictly
speaking, they – as the other types of vectors – can also contain NA
,
indicating a missing value. We will talk more about NA
s later on. Logical
vectors are often created when we test for something. For example, we can test,
if the elements in a numerical vector are larger or equal to 5 and receive a
logical vector containing the test results.
Character vectors contain strings of characters. When assigning strings, they have to be enclosed in quotation marks.
char_v <- c("This", "is", "a", "character", "vector!")
char_v
## [1] "This" "is" "a" "character" "vector!"
char_v_2 <- c("This is also", "a character vector!")
char_v_2
## [1] "This is also" "a character vector!"
We can compare character vectors only for (non-)equality, not for being smaller or larger.
"same" == "same"
## [1] TRUE
"same" == "not the same"
## [1] FALSE
"same" != "not the same"
## [1] TRUE
Character vectors also cannot be used to calculate. This can get problematic, if numbers are stored as characters, which arises frequently when Web Scraping.
a <- c(1, 2, 3)
b <- c("7", "8", "9")
str(a)
## num [1:3] 1 2 3
str(b)
## chr [1:3] "7" "8" "9"
a + b
## Error in a + b: non-numeric argument to binary operator
As we enclosed the elements of vector b
in quotation marks, R interprets the
data as characters instead of numbers. Since characters cannot be used for
calculations, we received an error message. But we can make R interpret the
characters as numbers by using as.numeric()
.
1.5.3 A brief look at lists
Note that a vector of a certain type, can only contain elements of that type. So we cannot mix data types in the same vector. If we want to mix data types, we can use lists instead of vectors.
l <-list(1, TRUE, "Hello World!")
str(l)
## List of 3
## $ : num 1
## $ : logi TRUE
## $ : chr "Hello World!"
Lists can also contain other lists to represent hierarchical data structures. We will see lists “in action” later on in this course.
1.6 Functions
Functions provide an easy and concise way of performing more or less complex
tasks using predefined bits of R code that are provided in “base R” – i.e.
that come with the basic R installation – or in the various additional packages
that are available for installation. We have already used a number of functions
up to this point, e.g. print()
. To “call” a function, we write its name,
followed by parentheses. Inside the parentheses additional arguments are
provided to R. In most cases, some data has to be entered as the first
argument. For example, print()
writes the text provided as argument to the
output. More complex functions often allow for more than one argument. Sometimes
these are required, but more often these additional arguments are optional and
can be used to change some options from the default value to the one desired.
1.6.1 Help
But how do we know which arguments can or have to be provide to use a function
and what their effects are? We can check the documentation on CRAN or use Google
to find additional information. Another often more convenient way, is to use the
help functionality build into R. By writing ?
in front of the function name
into the console and executing the line by pressing “Enter”, the help file is
opened in the lower right of the RStudio window. Let’s try this for the
function rnorm()
.
The help file tells us several things. rnorm()
is part of a family of
functions that are related to the normal distribution, each providing a distinct
functionality. The functionality of rnorm()
being the generation of random
numbers stemming from the normal distribution. We also learn, that three
arguments can be provided. n
, the number of observations to be generated, as
well as mean
and sd
, the mean and the standard deviation of the normal
distribution to be drawn from. We also see that mean
and sd
are provided
with the standard values 0 and 1 respectively, indicated by the =
. We also see
that n
has no standard value. So we have to provide a value for n
, but not
for mean
and sd
. Just writing rnorm()
will result in an error.
To provide an argument to a function, we write the name of the argument,
followed by =
and the value to be provided. Note that, since rnorm()
draws
random numbers, your output will differ from the output presented here.
rnorm(n = 10)
## [1] -0.4825981 -0.1270875 -1.3090852 0.7795997 0.2651731 0.8898682
## [7] 0.4773540 0.1411927 0.3071022 -0.8846552
In the same vein, additional arguments that are allowed by the function can be defined, instead of using their default values.
rnorm(n = 10, mean = 10, sd = 0.5)
## [1] 9.403351 10.580990 10.447558 10.223672 9.916762 9.357506 10.537004
## [8] 8.619759 10.252021 9.831583
We can also skip writing the names of arguments in many cases. As the n
argument is the first listed in the function’s parentheses, R also understands
the call, if we just provide the value to be used as the first argument. You
will often encounter the convention that the first argument is written without
its name and any further arguments are written in full.
1.6.2 Examples: Basic statistical functions
Base R provides us with some basic statistical functions that are used for data analysis.
We should start with defining a numerical vector that contains some data to be analysed.
We could be interested in describing this data by its arithmetic mean, median
and standard deviation. For this purpose we can use the functions mean()
,
median()
, and sd()
provided by base R. All three do not require additional
arguments besides the data to be analysed which we can provide using the
object data
we created beforehand.
1.7 RStudio Workflow
Up until now, we wrote our code directly into the RStudio console, pressed “Enter” and received the desired output. This works but will not satisfy our needs in the long run. The main problem is, that the code we wrote essentially disappears after running it. Imagine that you want to rerun your code a week from now or even tomorrow. Maybe you took notes and can recreate it, but that means a lot of unsatisfying and error prone work. Also, maybe at some point you want to share code with colleagues, fellow students, or the R community in general. At the same time, as our code gets more complex, spans multiple lines and consists of many interdependent blocks of code, you will inevitably run into the situation where you realise you made a mistake or have to change some code at the very beginning of your R session. This would mean, recreating and rerunning most or all of the code you have already written.
These are some of the reasons why we should start writing our code into so called R Scripts.
1.7.1 R Scripts
To create a new R Script, you can click on “File” > “New File” > “R Script”, or more conveniently press “CTRL” + “Shift” + “N” simultaneously. This creates an untitled script that we can write our code into.
Let’s start with something simple by recreating some of the code from last week.
We assign two numerical values two the objects a
and b
, assign a calculation
based on these objects to the new object the_answer
and prompt R to return its
value to us. Instead of writing the code line by line into the console, now we
write the whole block into the newly created script. We can now run the complete
script by clicking on “Source” or, to also get output, “Source with Echo” in the
upper toolbar attached to the script’s tab.
In most cases I prefer running the script line by line though. This allows
full control of the process and enables you to stop in certain lines to
e.g. contemplate what the code is doing, check for errors or change details of
the code before moving on to the next line. You can do this either by clicking
on “Run” in the toolbar or pressing “CTRL” + “Enter” simultaneously. In both
cases, RStudio copies the line of code where your text cursor is currently
residing into the console and runs it for you. The text cursor then conveniently
jumps to the next line in the script. In this way you can quickly run your
script line by line, while having full control over when to stop.
You can decide for yourself what the right approach to running your code is, based on any given situation. But remember that R always assumes that you know what you’re doing. There will be no warning prompts if you are about to overwrite work you have previously done.
When you are done writing your script, you might want to save it to the hard drive, preserving your work for later re-runs or for sharing. By clicking on “File” > “Save” or presing “CTRL” + “S” you can save the file with a name of your choosing. The file extension for R Scripts is always “.R”.
One problem – that you will run into sooner or later – is that you will try to
run incomplete code from a script, most commonly a missing closing bracket. In
this case, RStudio puts the code to be run into the console and begins a new
line, starting with +
, and then nothing happens. R assumes that your code will
continue in a further line and waits for you to enter it after the +
. In most
cases the right approach is to cancel the entered command, fix your code and
re-run it afterwards. To cancel an already entered command, you have to click
into the “Console” tab and press “Esc” on your keyboard. The >
prompt will reappear in the console and you can continue with your work.
1.7.2 Projects
In many cases, your work will consist of multiple scripts, data files, graphics saved to the disk or additional output. So it makes sense to assign your files to a place on your hard drive. You can do this “by hand” but a convenient approach might be to use RStudio’s project functionality.
By clicking on “File” > “New Project”, you can start the project creation wizard. If you have already created a folder on your hard drive that shall contain the project, you can click on “Existing Directory”, select the folder and click on “Create Project”. You can also create the folder on the fly by clicking on “New Directory” > “New Project” and then choosing a folder name and the sub-folder where it should be placed, before creating the project.
RStudio will now close all files currently open and switch to your newly created project. The name you chose for the project’s folder will also be its name, seen in RStudio’s title bar. When you look at the “File” tab (lower right), you will also see that you are now in the project’s folder. This is your current working directory, a concept we will talk about momentarily. All scripts you create while working in your project will become a part of it. So when you want to return to continuing your work, you can now click on “File” > “Open Project”. All files opened the last time you worked on the project will be reopened and you will again be in the project’s working directory. This is an easy and convenient way to keep your work tidy.
At this point, I would advise you to create a project for this introduction to web scraping and create R scripts for each chapter as parts of the project. The name and sub-folder you choose is not important from the point of view of functionality, but it should make sense to you.
We should now briefly talk about the working directory. If you try to open or
save a file directly from an R script – without specifying a complete path – R
will always assume you refer to your working directory. If you created a
project, this automatically set the project’s folder as the working directory.
You can always check for your current working directory by entering getwd()
into the console. You can change your current working directory by clicking on
“Session” > “Set Working Directory” > “Choose Directory…” or by using the
function setwd()
with the desired path enclosed by "
as the function’s
argument.
1.7.3 Comments
You should get into the habit of commenting your code as early as possible. Comments are started with one or multiple
#
. All code following the#
will not be evaluated by R and thus serves as the perfect place to comment on what you were doing and thinking while writing the code. Why do this? When you reopen a script that you have not been working on in a while, it can be hard to understand what you tried to do in the first place. Commented code makes this much easier. This is even more true if you share your code with other people. They may have very different approaches to certain R problems and clearly commented code will help them to quickly understand it. You should see this as a sign of respect towards the time your peers may invest in helping you with your coding problems.If you plan on using
setwd()
in your script, it is a good idea to comment this line before sharing your script. Other people will have different folder structures and will want to decide for themselves. The same goes for all lines that will save something to the hard drive, e.g. data sets or exported graphics. The R and RStudio communities are very welcoming and you will always find people that are willing to lend you their help, so you should return the favour and be polite in your code. This includes writing clear comments and not cluttering anyone’s hard drive with files they may not want to have.