9 Good practice
In this following section, we will concern ourselves with questions regarding good practice of web scraping and how to scrape responsibly. There are no wrong or right answers to these questions. But in my view, it is a privilege that we are able to access this wealth of data and we should treat it as such. Why? We have a responsibility towards the people, institutions and companies whose data we collect to respect certain boundaries and guidelines of data collection. These boundaries might have been set by others or might be constraints we impose upon ourselves.
I will not be talking about the legal situation concerning web scraping. First and foremost, I am not an expert on the legalities and do not feel equipped to give any responsible advice. Additionally, this is a relatively new field and legal rulings might change. On top of that, the specific rules you have to adhere to may very well depend on the context of your scraping project. It’s possible that it is allowed to collect certain data for use in a scientific project, but not for use in a commercial one. Certain types of data are not allowed to be collected at all. When in doubt, try to get advice from your superiors or experts on the legal questions surrounding web scraping and privacy laws.
Nonetheless, if we follow some basic principles of good practice, we can design our web scraping projects in a way that is respectful to the owners of the data as well as true to our own standards of responsible data collection. So to boil it down:
Rule #1 of Scrape Club: Scrape responsibly and with respect to the data sources.
Rule #2 of Scrape Club: Scrape responsibly and with respect to the data sources!
9.1 We ain’t no web crawlers
Web crawlers – also known as web spiders or robots – are pieces of software that aim at collecting and indexing all content on the internet. Web crawlers are for example, operated by search engines whose aim is to index the content of the web and make it accessible via their search interface.
It is not our goal to indiscriminately and systematically collect every piece of data on the web. Web scraping is aimed at extracting some specific data as precisely as possible and analysing the collected data in a specific usage context, in our case scientific research. We are not web crawlers.
What does this imply for a good practice of web scraping?
Collect only what you need
This means that you have to think hard about which pieces of data you actually have to collect to meet your goals, before you start collecting. Rarely are all subpages of a website needed, so you should only download those that contain data of interest.
Use an API if there is one
Some websites may give you access to an API, an application programming interface. APIs in the web context are mostly used for allowing third party sites and applications to access the services a website provides. For example, YouTube’s IFrame player API is used when a YouTube Video is embedded in a third party site. The API gives access to the video and much of YouTube’s functionality. Another example would be Twitter’s API that can be used by tweet reader apps to access your tweets.
In many cases, you can use those APIs to directly access the data you are interested in instead of scraping the raw HTML code for it. Sometimes the APIs are openly accessible, sometimes you have to apply for access first. Using the latter as an example, if you want to use Twitter’s API to collect tweets for scientific analysis, you have to fill out a form, describing your project and planned data usage, and then await approval. This extra work could potentially give you a more direct access to cleaner data. Additionally, if you use an API you automatically play by the rules of the provider.
So, if there is an API you can access and it also gives you access to the data you are interested in, it is a good idea to use the API instead of “traditional” scraping techniques. The technical details of accessing APIs are beyond the scope of this introduction. Also APIs can differ tremendously in their functionality. The first step should always be to read the APIs documentation. You can also check whether there already is an R package that simplifies access to an API, e.g. TwitteR or Rfacebook, before you start writing your own scripts to access an API.
Anonymisation of personal data
Whenever we as web scrapers are collecting personal data that could be used to identify people, we ought to anonymise any information that could lead to identification; this primarily concerns real names in customer reviews, forums and so on. I would argue that this should also extend to user names in general. Most of the time, we do not need the actual names and any simple numeric identifier constructed by us would suffice to discern users during the analysis. If we cannot directly anonymise the data during collection, we should do this after collection. Especially if results of the analysis or a constructed data set are to be publicly released: anonymisation is paramount!
robots.txt
Many websites have set up a file called robots.txt on their servers which is predominantly aimed at informing automatic web crawlers which parts of the website they are allowed to collect. While the robots.txt files may ask the crawlers to stay out of certain subfolders of a website, they have no power in actually stopping them. If the crawler is polite, it will follow the guidelines of the robots.txt.
It may be up to debate whether web scrapers should follow the robots.txt guidelines, as we have already established that we are not web crawlers. In my opinion, we should at least take a look at these files when planning a project. If the data we plan to collect is excluded in the guidelines, we still have to decide if we want to go forward with the project: Is the collection of this specific piece of data really necessary? Do I have support by superiors or my employer? Can we contact the website operator, describe our project and kindly ask for permission?
9.1.1 On robots.txt files
The robots.txt files are usually accessible by appending the URL of the mainpage of a website. If you do not find a robots.txt in this manner, there most probably is none; e.g. to access the robots.txt file for the website https://www.wahlrecht.de/ we can use the URL https://www.wahlrecht.de/robots.txt{target_blank}.
Let us briefly look at the structure of this file. Its first block looks like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /ueberhang/wpb/
Disallow: /ueberhang/chronic/
The field “User-agent” defines to which crawlers the rules apply to. “*” is short for everyone. The “Disallow” fields define which subdirectories of the website may not be accessed by the particular User-agent. So in this case, no User-agent may access the subdirectories “/cgi-bin/”, “/temp/”, and so on.
Let’s have a look at the second block:
User-Agent: Pixray-Seeker
Disallow: /
Here a specific User-agent is defined, so the rules that follow only refer to the crawler “Pixray-Seeker”. Here there is only one Disallow rule, namely “/”. “/” is short for “everything”: therefore the Pixray-Seeker crawler is not allowed access to any content on the website.
9.2 Reduce traffic
Every time a website is accessed, traffic is generated. Traffic here refers to
the transfer of the HTML data from the website’s server to our computer, this
produces monetary costs on the side of the providers of the website. Many
websites are financed through advertisements. When downloading the website
directly from R using read_html()
, we are circumventing the ad display in the
browser and thus the provider will get no revenue. Both may seem negligible in
many cases, but if we aim to act respectful to the website providers, we should
reduce unnecessary traffic when we are able to.
Collect only once (if possible)
The most simple step we can undertake is not to download the HTML code every time we re-run our script. This reduces traffic and also makes our code faster to execute. See chapter 8 on how to save the downloaded HTML files.
Test, test, test
We should also test our code on a small scale before finally running it on large amounts of data. If our aim is to download a large number of websites and/or files, we ought to make sure that our code actually works. While there are problems in our code – and there will be – we may create a lot of unnecessary traffic during bugfixing. If we test our code until we are sure it will run without errors on a single or a small set of subpages, instead of all, we will reduce traffic and again, also reduce the time we have to wait for the downloads to finish.
Set waiting times
If we are downloading many pages or files, it may also be a good idea to slow down the process on our end by setting a waiting time between each download, this will spread our traffic over time. If a server detects unusually large amounts of site requests, it may even block our IP which would make us unable to continue our downloads. Waiting a little while between each request is a way to circumvent this. A waiting time between 2-5 seconds should be more than enough in most cases. Sometimes the robots.txt also specifies desired waiting times for a specific site.
9.2.1 Waiting times in action
Let us have another look on the first example from chapter 7 to see
how to set waiting times in practice when downloading multiple pages with
read_html()
.
First, let us generate the list of links we want to download like we did in chapter 7.
library(tidyverse)
library(rvest)
links <- "https://www.tidyverse.org/packages/" %>%
read_html() %>%
html_elements(css = "div.package > a") %>%
html_attr(name = "href")
To let R wait for several seconds, we can use the base R function Sys.sleep()
which takes the seconds to wait as its only argument. Since we want the map()
function to add the waiting time before each read_html()
iteration, we have to
include Sys.sleep()
into the iteration process. We can achieve the addition of
the waiting time by defining a multi-line formula using the ~ {...}
notation.
Each new line enclosed by the {}
is run once for each iteration over the
object we pass on to map
. So for each element of the links
object, R first
waits for two seconds and then applies read_html()
to the element of links
.
Note, that we have to refer to the element of links
in this formula notation
by writing .
between the parentheses of read_html()
. So on the first
iteration the .
stands for the first element of links
, on the second for the
second element and so on.
We successfully added a waiting time of two seconds between each iteration. Note
that read_html()
has to come last if we want this short notation to work
because only the return value of the last function within the braces is assigned
to pages
.
9.3 Citation and an additional resource
Some of the ideas expressed above are in part based on the subchapter “9.3 Web scraping: Good practice” from:
Munzert et. al (2015). Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Chichester: Wiley.
The authors also describe some guidelines of a good practice of web scraping, that in part overlap with the set of guidelines presented here, but go beyond in describing the technical implementation in R and also the legal situation by way of some examples. A read is highly recommended.