13 Regular expressions

In web scraping, we are often dealing with text data that has been collected from a website and may include information that is not easily accessible and usable in data analysis. We are interested either in detecting the occurrence of certain words or in extracting some information from the strings in our data. For these purposes we will use some functions from the core tidyverse package stringr. We will also use regular expressions, a powerful and flexible formal representation of characters that we can use to select substrings by their content.

This chapter will make use of some practical examples, introducing the basics of regular expressions and stringr. This can only be a starting point; if you want to follow the road of web scraping, you most probably will have to further your knowledge on regular expressions at some point. I would recommended reading the chapter on strings from “R for Data Science” by Wickham and Grolemund as a next step: https://r4ds.had.co.nz/strings.html, as well as the RStudio Cheat Sheet on stringr: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf.

13.1 Cleaning the district column

We can use the data on Berlin police reports scraped earlier (10) for our examples.

library(tidyverse)

load("reports.RData")

In our example the data on police reports contain the district a report refers to as a column which needs some cleaning to be usable. We have already done this in subchapter 11.1.2, but we will use some alternative and more advanced approaches here.

reports %>%
  select(District) %>%
  head(n = 10)
## # A tibble: 10 × 1
##    District                               
##    <chr>                                  
##  1 Ereignisort: Treptow-Köpenick          
##  2 Ereignisort: Charlottenburg-Wilmersdorf
##  3 Ereignisort: berlinweit                
##  4 Ereignisort: Friedrichshain-Kreuzberg  
##  5 Ereignisort: Marzahn-Hellersdorf       
##  6 Ereignisort: Friedrichshain-Kreuzberg  
##  7 Ereignisort: Mitte                     
##  8 Ereignisort: Charlottenburg-Wilmersdorf
##  9 Ereignisort: Neukölln                  
## 10 Ereignisort: berlinweit

One problem we encounter here is that each string begins with “Ereignisort:”. We want to get rid of this substring, as it does not contain any useful information. With str_remove() we can remove substrings from a string. Like all stringr functions we will use here, str_remove() takes the string(s) to be applied to, as its first argument. The argument pattern = specifies a regular expression for the pattern that is to be matched and removed. In the code block below, the substring "Ereignisort: is removed for all values in the “District” column. We are using the most basic form of a regular expression here which is the exact matching of characters. In general, the complete regular expression is enclosed by ". Please note, that the whitespace following : is part of the regular expression.

reports %>%
  mutate(District = str_remove(District, pattern = "Ereignisort: ")) %>%
  select(District) %>% 
  head(n = 10)
## # A tibble: 10 × 1
##    District                  
##    <chr>                     
##  1 Treptow-Köpenick          
##  2 Charlottenburg-Wilmersdorf
##  3 berlinweit                
##  4 Friedrichshain-Kreuzberg  
##  5 Marzahn-Hellersdorf       
##  6 Friedrichshain-Kreuzberg  
##  7 Mitte                     
##  8 Charlottenburg-Wilmersdorf
##  9 Neukölln                  
## 10 berlinweit

We could also keep the substring but translate it to English. For this, we use str_replace() which also matches a regular expression, but replaces it with a string specified in replacement =. Here, the pattern is shortened to "Ereignisort", as we want to keep the ": ".

reports %>%
  mutate(District = str_replace(District, pattern = "Ereignisort", replacement = "District")) %>%
  select(District) %>% 
  head(n = 10)
## # A tibble: 10 × 1
##    District                            
##    <chr>                               
##  1 District: Treptow-Köpenick          
##  2 District: Charlottenburg-Wilmersdorf
##  3 District: berlinweit                
##  4 District: Friedrichshain-Kreuzberg  
##  5 District: Marzahn-Hellersdorf       
##  6 District: Friedrichshain-Kreuzberg  
##  7 District: Mitte                     
##  8 District: Charlottenburg-Wilmersdorf
##  9 District: Neukölln                  
## 10 District: berlinweit

In the same way, we can replace the ": " substring with a dash.

reports %>%
  mutate(District = str_replace(District, pattern = ": ", replacement = "-")) %>%
  select(District) %>% 
  head(n = 10)
## # A tibble: 10 × 1
##    District                              
##    <chr>                                 
##  1 Ereignisort-Treptow-Köpenick          
##  2 Ereignisort-Charlottenburg-Wilmersdorf
##  3 Ereignisort-berlinweit                
##  4 Ereignisort-Friedrichshain-Kreuzberg  
##  5 Ereignisort-Marzahn-Hellersdorf       
##  6 Ereignisort-Friedrichshain-Kreuzberg  
##  7 Ereignisort-Mitte                     
##  8 Ereignisort-Charlottenburg-Wilmersdorf
##  9 Ereignisort-Neukölln                  
## 10 Ereignisort-berlinweit

In chapter 11 we saw district names with two parts that are sometimes written with and sometimes without whitespace around the dash, as seen here:

reports %>%
  group_by(District) %>%
  summarize(reports = n())
## # A tibble: 22 × 2
##    District                                  reports
##    <chr>                                       <int>
##  1 Ereignisort: Charlottenburg - Wilmersdorf     579
##  2 Ereignisort: Charlottenburg-Wilmersdorf      1427
##  3 Ereignisort: Friedrichshain - Kreuzberg       661
##  4 Ereignisort: Friedrichshain-Kreuzberg        1701
##  5 Ereignisort: Lichtenberg                     1232
##  6 Ereignisort: Marzahn - Hellersdorf            253
##  7 Ereignisort: Marzahn-Hellersdorf              881
##  8 Ereignisort: Mitte                           3314
##  9 Ereignisort: Neukölln                        1952
## 10 Ereignisort: Pankow                          1629
## # ℹ 12 more rows

We can easily unify those names using str_replace and regular expressions.

reports %>%
  mutate(District = str_replace(District, pattern = " - ", replacement = "-")) %>%
  group_by(District) %>%
  summarize(reports = n())
## # A tibble: 16 × 2
##    District                                reports
##    <chr>                                     <int>
##  1 Ereignisort: Charlottenburg-Wilmersdorf    2006
##  2 Ereignisort: Friedrichshain-Kreuzberg      2362
##  3 Ereignisort: Lichtenberg                   1232
##  4 Ereignisort: Marzahn-Hellersdorf           1134
##  5 Ereignisort: Mitte                         3314
##  6 Ereignisort: Neukölln                      1952
##  7 Ereignisort: Pankow                        1629
##  8 Ereignisort: Reinickendorf                 1193
##  9 Ereignisort: Spandau                       1170
## 10 Ereignisort: Steglitz-Zehlendorf           1045
## 11 Ereignisort: Tempelhof-Schöneberg          1737
## 12 Ereignisort: Treptow-Köpenick              1346
## 13 Ereignisort: berlinweit                     389
## 14 Ereignisort: bezirksübergreifend            953
## 15 Ereignisort: bundesweit                      63
## 16 <NA>                                        427

If we want to remove the whitespace and unify the names in the process, we use str_remove() and the regular expression "\\s", being a shortcut for matching all kinds of whitespace.

At this point we have to talk about escape characters. Some characters used in regular expressions have a special meaning, e.g. ".", "?" or "*". If we just write one of those characters in a regular expression, R will not interpret it as a literal character but by its special meaning. "." used in a regular expression for example, stands for any character. If we want to match a literal . in a string though, we have to escape the dot by using the escape character \ or actually two, telling R to take the next character literally. So to match a dot, we write "\\.".

If this seems confusing and needlessly complicated, save your rage for when you find out what you have to write to match one literal backslash in a string: \\\\. Some commentary on this: https://xkcd.com/1638/

Escape characters are hard to understand when starting out using regular expressions, but for now it is enough to know that some characters have to be escaped and that you find a full list of those in the cheat sheet linked above.

In a similar way, we have to use two escape characters in "\\s" to make R interpret it not as an “s” but as the shortcut for whitespace characters we want to use:

reports %>%
  mutate(District = str_remove(District, pattern = "\\s")) %>%
  group_by(District) %>%
  summarize(reports = n())
## # A tibble: 22 × 2
##    District                                 reports
##    <chr>                                      <int>
##  1 Ereignisort:Charlottenburg - Wilmersdorf     579
##  2 Ereignisort:Charlottenburg-Wilmersdorf      1427
##  3 Ereignisort:Friedrichshain - Kreuzberg       661
##  4 Ereignisort:Friedrichshain-Kreuzberg        1701
##  5 Ereignisort:Lichtenberg                     1232
##  6 Ereignisort:Marzahn - Hellersdorf            253
##  7 Ereignisort:Marzahn-Hellersdorf              881
##  8 Ereignisort:Mitte                           3314
##  9 Ereignisort:Neukölln                        1952
## 10 Ereignisort:Pankow                          1629
## # ℹ 12 more rows

This did not work as expected. The whitespace following the ":" was removed, but not the whitespace around the dashes. This is the case because most of the stringr functions stop after they found the first match for the pattern in a string. But there are versions of the functions that do not stop after their first match, ending in _all: e.g. str_replace_all(). So if we use str_remove_all() in our code, all occurrences of whitespace will be removed.

reports %>%
  mutate(District = str_remove_all(District, pattern = "\\s")) %>%
  group_by(District) %>%
  summarize(reports = n())
## # A tibble: 16 × 2
##    District                               reports
##    <chr>                                    <int>
##  1 Ereignisort:Charlottenburg-Wilmersdorf    2006
##  2 Ereignisort:Friedrichshain-Kreuzberg      2362
##  3 Ereignisort:Lichtenberg                   1232
##  4 Ereignisort:Marzahn-Hellersdorf           1134
##  5 Ereignisort:Mitte                         3314
##  6 Ereignisort:Neukölln                      1952
##  7 Ereignisort:Pankow                        1629
##  8 Ereignisort:Reinickendorf                 1193
##  9 Ereignisort:Spandau                       1170
## 10 Ereignisort:Steglitz-Zehlendorf           1045
## 11 Ereignisort:Tempelhof-Schöneberg          1737
## 12 Ereignisort:Treptow-Köpenick              1346
## 13 Ereignisort:berlinweit                     389
## 14 Ereignisort:bezirksübergreifend            953
## 15 Ereignisort:bundesweit                      63
## 16 <NA>                                       427

13.2 Detecting robberies in police reports

When collecting the data on police reports, we also scraped the short description of the report’s contents, but we did no analysis on these in chapters 11 & 12. We could try to categorize the texts into types of crimes using regular expressions. Doing this properly, would require a lot of planning. We would have to decide which categories of crimes we want to extract and construct flexible regular expressions for each of these. Please understand the following merely as a first outline for how to approach such a project.

Let us focus solely on robberies. To begin with, we want to detect all strings that contain the term “raub” in some form. str_detect() determines whether a pattern was found in a string, returning TRUE or FALSE. We can then count the number of strings for which the result is TRUE– remember TRUE is equal to \(1\), FALSE equal to \(0\) – using sum().

reports$Report %>%
  str_detect(pattern = "raub") %>%
  sum()
## [1] 522

But what if a text begins with “Raub”. In this case the “R” will be upper case. Pattern matching is case sensitive and that means that “Raub” is not equal to “raub”. If we want both ways of writing the term to be counted as equal, we have to write a regular expression that allows for either an upper or lower case “r” as the first character. For this we can define a class which contains both characters enclosed by []. "[Rr]aub" means that we want to find either an “R” or an “r” as the first character and “aub” after this; only one of the characters from the class is matched.

reports$Report %>%
  str_detect(pattern = "[Rr]aub") %>%
  sum()
## [1] 1166

If we also want to detect terms like “Räuber” or “räuberische”, we have to allow for an “ä” instead of the “a” as the second character by defining a class for the second position in the term:

reports$Report %>%
  str_detect(pattern = "[rR][aä]ub") %>%
  sum()
## [1] 1498

Let us have a look at some of the strings which returned a TRUE match, using str_detect() in filter(): telling R we want to see only lines from reports for which the pattern could be found, and only the first \(10\) of those.

reports %>%
  filter(str_detect(Report, pattern = "[Rr][aä]ub")) %>% 
  select(Report) %>% 
  head(n=10)
## # A tibble: 10 × 1
##    Report                                                              
##    <chr>                                                               
##  1 Festnahme nach räuberischem Diebstahl                               
##  2 Raub im Juweliergeschäft                                            
##  3 Räuber festgenommen                                                 
##  4 Seniorin in ihrer Wohnung beraubt und schwer verletzt zurückgelassen
##  5 Tankstelle beraubt                                                  
##  6 Gemeinschaftlicher Raub in Wohnung – Wer kennt diesen Mann?         
##  7 Raub auf Tankstelle - Wer kennt diesen Mann?                        
##  8 Nachtportier beraubt                                                
##  9 Ausgeraubt und mit Pistole bedroht                                  
## 10 Raub auf Hotel - Rezeptionist wird bedroht und verletzt

This looks good at a first glance, but consider this result:

reports %>%
  filter(str_detect(Report, pattern = "[Rr][aä]ub")) %>% 
  filter(Date == "18.06.2021 10:20 Uhr") %>% 
  select(Report)
## # A tibble: 1 × 1
##   Report                                                                        
##   <chr>                                                                         
## 1 "Mit dem Rettungshubschrauber ins Krankenhaus – Motorradfahrer schwer verletz…

Why was a report on a motorcycle accident included? Because the word “Rettungshubschrauber” also has “raub” in it. You see, how reliably detecting a simple string such as “raub” gets complicated very quickly.

At this point, we would have to put in the hours and define a list of phrases referring to robberies that we actually want to select while constructing regular expressions that select those phrases and only those. This is beyond the scope of this introductory example, but as a first approximation we could say that we want to find all instances in which the words begin with a capital “R”, as in “Raub” or “Räuber”, or where “aub” and “äub” are prefixed with “ge”, “be” or “zu”, as in “geraubt”, “beraubt”, or “auszurauben”. To achieve this, the regular expression has to allow for four optional substrings, using grouping and the “OR” operator |. Groups are defined by enclosing several expressions in parentheses. While groups have advanced functionality, for now you can interpret them like parentheses in mathematics. The enclosed expression is evaluated first and as a whole. If one of the prefixes, connected by |, is found in a string and is then followed up by either “a” or “ä” and “ub” after this, the pattern is matched.

reports$Report %>%
  str_detect(pattern = "(R|ger|ber|zur)[aä]ub") %>%
  sum()
## [1] 1294

There are other words in the German language that are used to describe the concept of robberies. Adding all of them and the several individual pre- and/or suffixes that have to be taken into account to reliably select the terms we want to select and none of those we do not want to select, is beyond the scope of this example. But we can at least add one further word to the regular expression. Using the | operator we extend the regular expression to also detect “dieb” with an upper or lower case “d”. The expression now selects all ways of writing “raub” and “dieb” we allowed for.

reports$Report %>%
  str_detect(pattern = "(R|ger|ber|zur)[aä]ub|[Dd]ieb") %>%
  sum()
## [1] 2061

Now that we have constructed a provisional regular expression that can detect reports that are related to robberies, we can also compute their relative proportion. Remember, we are dealing with a logical vector of ones and zeros: sum() counts all ones and thus returns the absolute number of reports and mean() divides the number of ones by the number of all reports, thus returning the relative proportion.

reports$Report %>%
  str_detect(pattern = "(R|ger|ber|zur)[aä]ub|[Dd]ieb") %>%
  mean()
## [1] 0.09388666

So about \(9.4\)% of police reports deal with robberies if we apply the regular expression constructed above. Remember, that we might still match some words that we do not want to match, while missing others that also relate to robberies. Thus the actual percentage may differ.

13.3 Extracting details from a list of names

The website https://www.bundeswahlleiter.de/bundestagswahlen/2021/gewaehlte/bund-99/ lists all German members of parliament in tables, with the first column holding the full names and academic degrees in one character string. A common application of regular expressions in the context of web scraping is to extract substrings of interest from such text.

We will first scrape the data. The members of parliament are divided into 26 subpages, one for each first letter of their last names, in the form “a.html”, “b.html” and so on. We construct links to all these pages using the base R object letters, a character vector conveniently containing all 26 letters of the alphabet in lower case. We then read the links and extract the table cells containing the names into a character vector, which is then transformed into a tibble.

library(rvest)

website <- "https://www.bundeswahlleiter.de/bundestagswahlen/2021/gewaehlte/bund-99/"
links <- str_c(website, letters, ".html")

pages <- links %>%
  map(~ {
    Sys.sleep(2)
    read_html(.x)
  })

names <- pages %>%
  map(html_elements, css = "th[scope='row']") %>%
  map(html_text, trim = TRUE) %>%
  unlist()

names_tbl <- tibble(
  full_name = names
)

Our goal is to extract the academic degree from strings of full names. We first need to understand the pattern of how degrees are listed in those strings. Let us look at some members of parliament with and without degrees. To filter a tibble by row numbers, we can conveniently use slice() from dplyr:

names_tbl %>% 
  slice(186:189)
## # A tibble: 4 × 1
##   full_name                        
##   <chr>                            
## 1 Gramling, Fabian Benedikt Meinrad
## 2 Dr. Gräßle, Ingeborg Gabriele    
## 3 Prof. Dr. Grau, Armin Jürgen     
## 4 Gremmels, Timon

If the members of parliament carry an academic degree, the string begins with it and the abbreviated titles always end in a dot. Based on this, we can start constructing a regular expression that will be used to detect and extract academic degrees.

We know that if there is a degree, it is listed at the beginning of the string. We can refer to the beginning and ending of a string by using an anchor. Writing ^ in a regular expression refers to the beginning of a string, $ to the ending. Being a predefined class that includes all upper and lower case letters of the alphabet, we will use [:alpha:] to refer to letters in general. "^[:alpha:]" will select one upper or lower case letter that is the first character in a string.

We do not want to select only one character but complete titles. For this we need to use quantifiers. These are used to define how often elements of an expression are allowed or required to be repeated. If we know the exact number of repetitions, we can set this number contained in {} after an element. "^[:alpha:]{2}" will select the first two letters in each string; always exactly two. The actual titles vary in the number of letters they contain. To allow for variation in the number of repetitions we use the quantifiers ?, * and +. ? stands for zero or one, * for zero or multiple and + for one or multiple repetitions, multiple meaning \(n >=1\).

"^[:alpha:]+" will select the word a string begins with. [:alpha:] does not contain any whitespace or punctuation characters. Therefore, the regular expression will begin selecting letters at the beginning of a string until it “hits” a whitespace or punctuation character. For members of parliament without a degree, this would select their last name. To make sure we only select titles, we can make use of the fact that all titles in this data end with a “.”. As the dot is a special character – standing for any character – we have to escape it by using two backslashes.

The regular expression "^[:alpha:]+\\." selects all words at the beginning of a string that end with a dot. We use it in the function str_extract() which will extract the substring that is specified by the regular expression if it could be matched in the string. We use the function inside of mutate() to assign the extracted degrees into a new column. To get an overview of the result, we group the data by this new column and count the occurrences per group.

names_tbl %>%
  mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.")) %>%
  group_by(academic) %>%
  summarise(n())
## # A tibble: 3 × 2
##   academic `n()`
##   <chr>    <int>
## 1 Dr.        111
## 2 Prof.        5
## 3 <NA>       620

We successfully identified a high percentage of doctors, as well as some professors. Let us also check, how it worked for the four members of parliament we examined above.

names_tbl %>%
  mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.")) %>%
  slice(186:189)
## # A tibble: 4 × 2
##   full_name                         academic
##   <chr>                             <chr>   
## 1 Gramling, Fabian Benedikt Meinrad <NA>    
## 2 Dr. Gräßle, Ingeborg Gabriele     Dr.     
## 3 Prof. Dr. Grau, Armin Jürgen      Prof.   
## 4 Gremmels, Timon                   <NA>

Members of parliament without a degree received an NA, which is correct. For observations with one title, it also worked, but for cases with multiple titles – i.e. “Prof. Dr.”, it did not. But we can extend the expression to allow for multiple titles as well.

If there are two or more components to a title, they are separated by whitespace; which we can refer to using \\s. After the whitespace there will be another word ending in a dot. Thus we can reuse the first part of the expression, without the anchor: "\\s[:alpha:]+\\.". Some members of parliament have one title, others have multiple. We have to allow both. A quantifier that follows a part of an expression grouped by enclosure in parentheses will refer to the complete group. The regular expression "(\\s[:alpha:]+\\.)*" allows for no additional title or for multiple ones. Appending this to the regular expression used in the code above allows us to select and extract one or more titles.

names_tbl %>%
  mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.(\\s[:alpha:]+\\.)*")) %>%
  group_by(academic) %>%
  summarise(n())
## # A tibble: 3 × 2
##   academic  `n()`
##   <chr>     <int>
## 1 Dr.         111
## 2 Prof. Dr.     5
## 3 <NA>        620

Let us confirm that it worked correctly:

names_tbl %>%
  mutate(academic = str_extract(full_name, pattern = "^[:alpha:]+\\.(\\s[:alpha:]+\\.)*")) %>%
  slice(186:189)
## # A tibble: 4 × 2
##   full_name                         academic 
##   <chr>                             <chr>    
## 1 Gramling, Fabian Benedikt Meinrad <NA>     
## 2 Dr. Gräßle, Ingeborg Gabriele     Dr.      
## 3 Prof. Dr. Grau, Armin Jürgen      Prof. Dr.
## 4 Gremmels, Timon                   <NA>