13 Regular expressions
In web scraping, we are often dealing with text data that has been collected from a website and may include information that is not easily accessible and usable in data analysis. We are interested either in detecting the occurrence of certain words or in extracting some information from the strings in our data. For these purposes we will use some functions from the core tidyverse package stringr. We will also use regular expressions, a powerful and flexible formal representation of characters that we can use to select substrings by their content.
This chapter will make use of some practical examples, introducing the basics of regular expressions and stringr. This can only be a starting point; if you want to follow the road of web scraping, you most probably will have to further your knowledge on regular expressions at some point. I would recommended reading the chapter on strings from “R for Data Science” by Wickham and Grolemund as a next step: https://r4ds.had.co.nz/strings.html, as well as the RStudio Cheat Sheet on stringr: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf.
13.1 Cleaning the district
column
We can use the data on Berlin police reports scraped earlier (10) for our examples.
In our example the data on police reports contain the district a report refers to as a column which needs some cleaning to be usable. We have already done this in subchapter 11.1.2, but we will use some alternative and more advanced approaches here.
reports %>%
select(District) %>%
head(n = 10)
## # A tibble: 10 × 1
## District
## <chr>
## 1 Ereignisort: Treptow-Köpenick
## 2 Ereignisort: Charlottenburg-Wilmersdorf
## 3 Ereignisort: berlinweit
## 4 Ereignisort: Friedrichshain-Kreuzberg
## 5 Ereignisort: Marzahn-Hellersdorf
## 6 Ereignisort: Friedrichshain-Kreuzberg
## 7 Ereignisort: Mitte
## 8 Ereignisort: Charlottenburg-Wilmersdorf
## 9 Ereignisort: Neukölln
## 10 Ereignisort: berlinweit
One problem we encounter here is that each string begins with “Ereignisort:”.
We want to get rid of this substring, as it does not contain any useful
information. With str_remove()
we can remove substrings from a string. Like
all stringr functions we will use here, str_remove()
takes the string(s) to be
applied to, as its first argument. The argument pattern =
specifies a regular
expression for the pattern that is to be matched and removed. In the code block
below, the substring "Ereignisort:
is removed for all values in the “District”
column. We are using the most basic form of a regular
expression here which is the exact matching of characters. In general, the
complete regular expression is enclosed by "
. Please note, that the whitespace
following :
is part of the regular expression.
reports %>%
mutate(District = str_remove(District, pattern = "Ereignisort: ")) %>%
select(District) %>%
head(n = 10)
## # A tibble: 10 × 1
## District
## <chr>
## 1 Treptow-Köpenick
## 2 Charlottenburg-Wilmersdorf
## 3 berlinweit
## 4 Friedrichshain-Kreuzberg
## 5 Marzahn-Hellersdorf
## 6 Friedrichshain-Kreuzberg
## 7 Mitte
## 8 Charlottenburg-Wilmersdorf
## 9 Neukölln
## 10 berlinweit
We could also keep the substring but translate it to English. For this, we use
str_replace()
which also matches a regular expression, but replaces it with a
string specified in replacement =
. Here, the pattern is shortened to
"Ereignisort"
, as we want to keep the ": "
.
reports %>%
mutate(District = str_replace(District, pattern = "Ereignisort", replacement = "District")) %>%
select(District) %>%
head(n = 10)
## # A tibble: 10 × 1
## District
## <chr>
## 1 District: Treptow-Köpenick
## 2 District: Charlottenburg-Wilmersdorf
## 3 District: berlinweit
## 4 District: Friedrichshain-Kreuzberg
## 5 District: Marzahn-Hellersdorf
## 6 District: Friedrichshain-Kreuzberg
## 7 District: Mitte
## 8 District: Charlottenburg-Wilmersdorf
## 9 District: Neukölln
## 10 District: berlinweit
In the same way, we can replace the ": "
substring with a dash.
reports %>%
mutate(District = str_replace(District, pattern = ": ", replacement = "-")) %>%
select(District) %>%
head(n = 10)
## # A tibble: 10 × 1
## District
## <chr>
## 1 Ereignisort-Treptow-Köpenick
## 2 Ereignisort-Charlottenburg-Wilmersdorf
## 3 Ereignisort-berlinweit
## 4 Ereignisort-Friedrichshain-Kreuzberg
## 5 Ereignisort-Marzahn-Hellersdorf
## 6 Ereignisort-Friedrichshain-Kreuzberg
## 7 Ereignisort-Mitte
## 8 Ereignisort-Charlottenburg-Wilmersdorf
## 9 Ereignisort-Neukölln
## 10 Ereignisort-berlinweit
In chapter 11 we saw district names with two parts that are sometimes written with and sometimes without whitespace around the dash, as seen here:
reports %>%
group_by(District) %>%
summarize(reports = n())
## # A tibble: 22 × 2
## District reports
## <chr> <int>
## 1 Ereignisort: Charlottenburg - Wilmersdorf 579
## 2 Ereignisort: Charlottenburg-Wilmersdorf 1427
## 3 Ereignisort: Friedrichshain - Kreuzberg 661
## 4 Ereignisort: Friedrichshain-Kreuzberg 1701
## 5 Ereignisort: Lichtenberg 1232
## 6 Ereignisort: Marzahn - Hellersdorf 253
## 7 Ereignisort: Marzahn-Hellersdorf 881
## 8 Ereignisort: Mitte 3314
## 9 Ereignisort: Neukölln 1952
## 10 Ereignisort: Pankow 1629
## # ℹ 12 more rows
We can easily unify those names using str_replace
and regular expressions.
reports %>%
mutate(District = str_replace(District, pattern = " - ", replacement = "-")) %>%
group_by(District) %>%
summarize(reports = n())
## # A tibble: 16 × 2
## District reports
## <chr> <int>
## 1 Ereignisort: Charlottenburg-Wilmersdorf 2006
## 2 Ereignisort: Friedrichshain-Kreuzberg 2362
## 3 Ereignisort: Lichtenberg 1232
## 4 Ereignisort: Marzahn-Hellersdorf 1134
## 5 Ereignisort: Mitte 3314
## 6 Ereignisort: Neukölln 1952
## 7 Ereignisort: Pankow 1629
## 8 Ereignisort: Reinickendorf 1193
## 9 Ereignisort: Spandau 1170
## 10 Ereignisort: Steglitz-Zehlendorf 1045
## 11 Ereignisort: Tempelhof-Schöneberg 1737
## 12 Ereignisort: Treptow-Köpenick 1346
## 13 Ereignisort: berlinweit 389
## 14 Ereignisort: bezirksübergreifend 953
## 15 Ereignisort: bundesweit 63
## 16 <NA> 427
If we want to remove the whitespace and unify the names in the process, we use
str_remove()
and the regular expression "\\s"
, being a shortcut for matching
all kinds of whitespace.
At this point we have to talk about escape characters. Some characters used in
regular expressions have a special meaning, e.g. "."
, "?"
or "*"
. If we
just write one of those characters in a regular expression, R will not interpret
it as a literal character but by its special meaning. "."
used in a regular
expression for example, stands for any character. If we want to match a
literal .
in a string though, we have to escape the dot by using the escape
character \
or actually two, telling R to take the next character
literally. So to match a dot, we write "\\."
.
If this seems confusing and needlessly complicated, save your rage for when you
find out what you have to write to match one literal backslash in a string:
\\\\
. Some commentary on this: https://xkcd.com/1638/
Escape characters are hard to understand when starting out using regular expressions, but for now it is enough to know that some characters have to be escaped and that you find a full list of those in the cheat sheet linked above.
In a similar way, we have to use two escape characters in "\\s"
to make R
interpret it not as an “s” but as the shortcut for whitespace characters we want
to use:
reports %>%
mutate(District = str_remove(District, pattern = "\\s")) %>%
group_by(District) %>%
summarize(reports = n())
## # A tibble: 22 × 2
## District reports
## <chr> <int>
## 1 Ereignisort:Charlottenburg - Wilmersdorf 579
## 2 Ereignisort:Charlottenburg-Wilmersdorf 1427
## 3 Ereignisort:Friedrichshain - Kreuzberg 661
## 4 Ereignisort:Friedrichshain-Kreuzberg 1701
## 5 Ereignisort:Lichtenberg 1232
## 6 Ereignisort:Marzahn - Hellersdorf 253
## 7 Ereignisort:Marzahn-Hellersdorf 881
## 8 Ereignisort:Mitte 3314
## 9 Ereignisort:Neukölln 1952
## 10 Ereignisort:Pankow 1629
## # ℹ 12 more rows
This did not work as expected. The whitespace following the ":"
was removed,
but not the whitespace around the dashes. This is the case because most of the
stringr functions stop after they found the first match for the pattern in a
string. But there are versions of the functions that do not stop after their
first match, ending in _all
: e.g. str_replace_all()
. So if we use
str_remove_all()
in our code, all occurrences of whitespace will be removed.
reports %>%
mutate(District = str_remove_all(District, pattern = "\\s")) %>%
group_by(District) %>%
summarize(reports = n())
## # A tibble: 16 × 2
## District reports
## <chr> <int>
## 1 Ereignisort:Charlottenburg-Wilmersdorf 2006
## 2 Ereignisort:Friedrichshain-Kreuzberg 2362
## 3 Ereignisort:Lichtenberg 1232
## 4 Ereignisort:Marzahn-Hellersdorf 1134
## 5 Ereignisort:Mitte 3314
## 6 Ereignisort:Neukölln 1952
## 7 Ereignisort:Pankow 1629
## 8 Ereignisort:Reinickendorf 1193
## 9 Ereignisort:Spandau 1170
## 10 Ereignisort:Steglitz-Zehlendorf 1045
## 11 Ereignisort:Tempelhof-Schöneberg 1737
## 12 Ereignisort:Treptow-Köpenick 1346
## 13 Ereignisort:berlinweit 389
## 14 Ereignisort:bezirksübergreifend 953
## 15 Ereignisort:bundesweit 63
## 16 <NA> 427
13.2 Detecting robberies in police reports
When collecting the data on police reports, we also scraped the short description of the report’s contents, but we did no analysis on these in chapters 11 & 12. We could try to categorize the texts into types of crimes using regular expressions. Doing this properly, would require a lot of planning. We would have to decide which categories of crimes we want to extract and construct flexible regular expressions for each of these. Please understand the following merely as a first outline for how to approach such a project.
Let us focus solely on robberies. To begin with, we want to detect all strings
that contain the term “raub” in some form. str_detect()
determines whether a
pattern was found in a string, returning TRUE
or FALSE
. We can then count
the number of strings for which the result is TRUE
– remember TRUE
is equal
to \(1\), FALSE
equal to \(0\) – using sum()
.
But what if a text begins with “Raub”. In this case the “R” will be upper case.
Pattern matching is case sensitive and that means that “Raub” is not equal to
“raub”. If we want both ways of writing the term to be counted as equal, we have
to write a regular expression that allows for either an upper or lower case “r”
as the first character. For this we can define a class which contains both
characters enclosed by []
. "[Rr]aub"
means that we want to find either an
“R” or an “r” as the first character and “aub” after this; only one of the
characters from the class is matched.
If we also want to detect terms like “Räuber” or “räuberische”, we have to allow for an “ä” instead of the “a” as the second character by defining a class for the second position in the term:
Let us have a look at some of the strings which returned a TRUE
match, using
str_detect()
in filter()
: telling R we want to see only lines from reports
for which the pattern could be found, and only the first \(10\) of those.
reports %>%
filter(str_detect(Report, pattern = "[Rr][aä]ub")) %>%
select(Report) %>%
head(n=10)
## # A tibble: 10 × 1
## Report
## <chr>
## 1 Festnahme nach räuberischem Diebstahl
## 2 Raub im Juweliergeschäft
## 3 Räuber festgenommen
## 4 Seniorin in ihrer Wohnung beraubt und schwer verletzt zurückgelassen
## 5 Tankstelle beraubt
## 6 Gemeinschaftlicher Raub in Wohnung – Wer kennt diesen Mann?
## 7 Raub auf Tankstelle - Wer kennt diesen Mann?
## 8 Nachtportier beraubt
## 9 Ausgeraubt und mit Pistole bedroht
## 10 Raub auf Hotel - Rezeptionist wird bedroht und verletzt
This looks good at a first glance, but consider this result:
reports %>%
filter(str_detect(Report, pattern = "[Rr][aä]ub")) %>%
filter(Date == "18.06.2021 10:20 Uhr") %>%
select(Report)
## # A tibble: 1 × 1
## Report
## <chr>
## 1 "Mit dem Rettungshubschrauber ins Krankenhaus – Motorradfahrer schwer verletz…
Why was a report on a motorcycle accident included? Because the word “Rettungshubschrauber” also has “raub” in it. You see, how reliably detecting a simple string such as “raub” gets complicated very quickly.
At this point, we would have to put in the hours and define a list of phrases
referring to robberies that we actually want to select while constructing
regular expressions that select those phrases and only those. This is beyond the
scope of this introductory example, but as a first approximation we could say
that we want to find all instances in which the words begin with a capital “R”,
as in “Raub” or “Räuber”, or where “aub” and “äub” are prefixed with
“ge”, “be” or “zu”, as in “geraubt”, “beraubt”, or “auszurauben”. To
achieve this, the regular expression has to allow for four optional substrings,
using grouping and the “OR” operator |
. Groups are defined by enclosing
several expressions in parentheses. While groups have advanced functionality,
for now you can interpret them like parentheses in mathematics. The enclosed
expression is evaluated first and as a whole. If one of the prefixes, connected
by |
, is found in a string and is then followed up by either “a” or “ä” and
“ub” after this, the pattern is matched.
There are other words in the German language that are used to describe the
concept of robberies. Adding all of them and the several individual pre- and/or
suffixes that have to be taken into account to reliably select the terms we want
to select and none of those we do not want to select, is beyond the scope of
this example. But we can at least add one further word to the regular
expression. Using the |
operator we extend the regular expression to also
detect “dieb” with an upper or lower case “d”. The expression now selects all
ways of writing “raub” and “dieb” we allowed for.
Now that we have constructed a provisional regular expression that can detect
reports that are related to robberies, we can also compute their relative
proportion. Remember, we are dealing with a logical vector of ones and zeros:
sum()
counts all ones and thus returns the absolute number of reports and
mean()
divides the number of ones by the number of all reports, thus returning
the relative proportion.
reports$Report %>%
str_detect(pattern = "(R|ger|ber|zur)[aä]ub|[Dd]ieb") %>%
mean()
## [1] 0.09388666
So about \(9.4\)% of police reports deal with robberies if we apply the regular expression constructed above. Remember, that we might still match some words that we do not want to match, while missing others that also relate to robberies. Thus the actual percentage may differ.
13.3 Extracting details from a list of names
The website https://www.bundeswahlleiter.de/bundestagswahlen/2021/gewaehlte/bund-99/ lists all German members of parliament in tables, with the first column holding the full names and academic degrees in one character string. A common application of regular expressions in the context of web scraping is to extract substrings of interest from such text.
We will first scrape the data. The members of parliament are divided into 26
subpages, one for each first letter of their last names, in the form “a.html”,
“b.html” and so on. We construct links to all these pages using the base R
object letters
, a character vector conveniently containing all 26 letters of
the alphabet in lower case. We then read the links and extract the table cells
containing the names into a character vector, which is then transformed into a
tibble.
library(rvest)
website <- "https://www.bundeswahlleiter.de/bundestagswahlen/2021/gewaehlte/bund-99/"
links <- str_c(website, letters, ".html")
pages <- links %>%
map(~ {
Sys.sleep(2)
read_html(.x)
})
names <- pages %>%
map(html_elements, css = "th[scope='row']") %>%
map(html_text, trim = TRUE) %>%
unlist()
names_tbl <- tibble(
full_name = names
)
Our goal is to extract the academic degree from strings of full names. We first
need to understand the pattern of how degrees are listed in those strings. Let
us look at some members of parliament with and without degrees. To filter a
tibble by row numbers, we can conveniently use slice()
from dplyr:
names_tbl %>%
slice(186:189)
## # A tibble: 4 × 1
## full_name
## <chr>
## 1 Gramling, Fabian Benedikt Meinrad
## 2 Dr. Gräßle, Ingeborg Gabriele
## 3 Prof. Dr. Grau, Armin Jürgen
## 4 Gremmels, Timon
If the members of parliament carry an academic degree, the string begins with it and the abbreviated titles always end in a dot. Based on this, we can start constructing a regular expression that will be used to detect and extract academic degrees.
We know that if there is a degree, it is listed at the beginning of the string.
We can refer to the beginning and ending of a string by using an anchor.
Writing ^
in a regular expression refers to the beginning of a string, $
to
the ending. Being a predefined class that includes all upper and lower case
letters of the alphabet, we will use [:alpha:]
to refer to letters in general.
"^[:alpha:]"
will select one upper or lower case letter that is the first
character in a string.
We do not want to select only one character but complete titles. For this we
need to use quantifiers. These are used to define how often elements of an
expression are allowed or required to be repeated. If we know the exact number
of repetitions, we can set this number contained in {}
after an element.
"^[:alpha:]{2}"
will select the first two letters in each string; always
exactly two. The actual titles vary in the number of letters they contain. To
allow for variation in the number of repetitions we use the quantifiers ?
, *
and +
. ?
stands for zero or one, *
for zero or multiple and +
for one or
multiple repetitions, multiple meaning \(n >=1\).
"^[:alpha:]+"
will select the word a string begins with. [:alpha:]
does not
contain any whitespace or punctuation characters. Therefore, the regular
expression will begin selecting letters at the beginning of a string until it
“hits” a whitespace or punctuation character. For members of parliament without
a degree, this would select their last name. To make sure we only select titles,
we can make use of the fact that all titles in this data end with a “.”. As the
dot is a special character – standing for any character – we have to escape
it by using two backslashes.
The regular expression "^[:alpha:]+\\."
selects all words at the beginning of
a string that end with a dot. We use it in the function str_extract()
which
will extract the substring that is specified by the regular expression if it
could be matched in the string. We use the function inside of mutate()
to
assign the extracted degrees into a new column. To get an overview of the
result, we group the data by this new column and count the occurrences per
group.
names_tbl %>%
mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.")) %>%
group_by(academic) %>%
summarise(n())
## # A tibble: 3 × 2
## academic `n()`
## <chr> <int>
## 1 Dr. 111
## 2 Prof. 5
## 3 <NA> 620
We successfully identified a high percentage of doctors, as well as some professors. Let us also check, how it worked for the four members of parliament we examined above.
names_tbl %>%
mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.")) %>%
slice(186:189)
## # A tibble: 4 × 2
## full_name academic
## <chr> <chr>
## 1 Gramling, Fabian Benedikt Meinrad <NA>
## 2 Dr. Gräßle, Ingeborg Gabriele Dr.
## 3 Prof. Dr. Grau, Armin Jürgen Prof.
## 4 Gremmels, Timon <NA>
Members of parliament without a degree received an NA
, which is correct. For
observations with one title, it also worked, but for cases with multiple titles
– i.e. “Prof. Dr.”, it did not. But we can extend the expression to allow for
multiple titles as well.
If there are two or more components to a title, they are separated by
whitespace; which we can refer to using \\s
. After the whitespace there will
be another word ending in a dot. Thus we can reuse the first part of the
expression, without the anchor: "\\s[:alpha:]+\\."
. Some members of parliament
have one title, others have multiple. We have to allow both. A quantifier that
follows a part of an expression grouped by enclosure in parentheses will refer
to the complete group. The regular expression "(\\s[:alpha:]+\\.)*"
allows for
no additional title or for multiple ones. Appending this to the regular
expression used in the code above allows us to select and extract one or more
titles.
names_tbl %>%
mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.(\\s[:alpha:]+\\.)*")) %>%
group_by(academic) %>%
summarise(n())
## # A tibble: 3 × 2
## academic `n()`
## <chr> <int>
## 1 Dr. 111
## 2 Prof. Dr. 5
## 3 <NA> 620
Let us confirm that it worked correctly:
names_tbl %>%
mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.(\\s[:alpha:]+\\.)*")) %>%
slice(186:189)
## # A tibble: 4 × 2
## full_name academic
## <chr> <chr>
## 1 Gramling, Fabian Benedikt Meinrad <NA>
## 2 Dr. Gräßle, Ingeborg Gabriele Dr.
## 3 Prof. Dr. Grau, Armin Jürgen Prof. Dr.
## 4 Gremmels, Timon <NA>