11 Transformation with dplyr

In this chapter, we will use several functions from the tidyverse package dplyr to clean the data we scraped in chapter 10. We will then select the columns and filter the observations (rows) we need, as well as compute some grouped summaries of the data.

First, let us load the required packages. dplyr is part of the core tidyverse; the same is true for the package lubridate, which will enable us to deal with dates and times.

library(tidyverse)

11.1 `mutate()`

11.1.1 Computing new variables from existing ones

Before we continue working on our scraped reports data, we will look at some simple examples that will show how new variables can be computed in a tibble from existing ones, using mutate() from dplyr.

For this purpose, let us create a new simple tibble on voter turnout. Note that the entered data is purely illustrational and has no meaning.

exmpl_tbl <- tibble(
  ybirth = c("1993 - 2002", "1983 - 1992", "1973 - 1982", "1963 - 1972"),
  eligible = c(100000, 100000, 100000, 100000),
  turnout_person = c(40000, 50000, 60000, 70000),
  turnout_mail = c (35000, 30000, 25000, 20000)
)

exmpl_tbl
## # A tibble: 4 × 4
##   ybirth      eligible turnout_person turnout_mail
##   <chr>          <dbl>          <dbl>        <dbl>
## 1 1993 - 2002   100000          40000        35000
## 2 1983 - 1992   100000          50000        30000
## 3 1973 - 1982   100000          60000        25000
## 4 1963 - 1972   100000          70000        20000

Here we have different columns for people who were eligible to vote, who voted in person and who voted by mail, by their year of birth. We do not actually care about the difference in the voting method in this example and want one column that combines both. We can get there by using mutate(). The function takes the data to be manipulated as its first argument, followed by one or multiple arguments defining the new columns to be created. We can create this new columns as computations from the columns already present.

To calculate the total turnout, we write:

exmpl_tbl %>% 
  mutate(turnout = turnout_person + turnout_mail)
## # A tibble: 4 × 5
##   ybirth      eligible turnout_person turnout_mail turnout
##   <chr>          <dbl>          <dbl>        <dbl>   <dbl>
## 1 1993 - 2002   100000          40000        35000   75000
## 2 1983 - 1992   100000          50000        30000   80000
## 3 1973 - 1982   100000          60000        25000   85000
## 4 1963 - 1972   100000          70000        20000   90000

We can also immediately start calculating with new columns in the same mutate() command. For this, we chain multiple operations in mutate() separated with ,. To calculate the turnout percentage in a second step, we write the following:

exmpl_tbl %>% 
  mutate(
    turnout = turnout_person + turnout_mail,
    turnout_pct = turnout / eligible
  )
## # A tibble: 4 × 6
##   ybirth      eligible turnout_person turnout_mail turnout turnout_pct
##   <chr>          <dbl>          <dbl>        <dbl>   <dbl>       <dbl>
## 1 1993 - 2002   100000          40000        35000   75000        0.75
## 2 1983 - 1992   100000          50000        30000   80000        0.8 
## 3 1973 - 1982   100000          60000        25000   85000        0.85
## 4 1963 - 1972   100000          70000        20000   90000        0.9

Note that mutate() is not limited to basic arithmetic operations. Many functions can be applied within mutate(): sum() or mean() are two examples we already know.

11.1.2 Cleaning the data on police reports

Now knowing the basic principle of working with mutate(), we can apply this to the data on police reports we scraped in chapter 10.

First we have to load the data and examine what we are working with.

load("reports.RData")

reports
## # A tibble: 12,412 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen  Ereigni…
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen                    Ereigni…
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht                   Ereigni…
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt             Ereigni…
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Ereigni…
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Ereigni…
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park                      Ereigni…
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt              Ereigni…
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Ereigni…
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Ereigni…
## # ℹ 12,402 more rows

We can leave the column containing the text of the police report as it is, but the columns containing the districts, as well as the date and time, could use some work.

The data in the column “District” includes the string “Ereignisort:” before listing the actual name of the district the report refers to. This unnecessarily clutters this column: we can use substr() to remove these leading characters within mutate(). substr() extracts a part of a string we want to keep, taking the data to be applied to, as its first argument – here the name of the column –, the starting character position where the extraction shall begin as its second, and the character position where it shall stop, as its third argument. Character position here refers to the numeric position of a character in a string: to begin extraction after “Ereignisort:” we have to count the length of this substring including the whitespace after “:”, which is \(13\) and thus use \(14\) as the starting position. For the position to stop extraction, we could either use an unrealistically high number or use str_length(District) to determine the exact length of each string.

reports %>% 
  mutate(District = substr(District, 14, 99))
## # A tibble: 12,412 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen  Mitte   
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen                    Reinick…
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht                   Lichten…
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt             Mitte   
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Pankow  
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Friedri…
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park                      Pankow  
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt              Charlot…
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Mitte   
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Charlot…
## # ℹ 12,402 more rows

reports %>% 
  mutate(District = substr(District, 14, str_length(District)))
## # A tibble: 12,412 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen  Mitte   
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen                    Reinick…
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht                   Lichten…
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt             Mitte   
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Pankow  
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Friedri…
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park                      Pankow  
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt              Charlot…
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Mitte   
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Charlot…
## # ℹ 12,402 more rows

The column “Date” includes the date and the time of each report as a character string. To be able to use this data in analysis, we have to extract the date and time in a format R can understand. This is easily achieved using the parsing functions from lubridate. As the character data are written in the format “day.month.year hour:minute”, we have to use the function dmy_hm() on the column.

reports %>% 
  mutate(date_cmpl = dmy_hm(Date))
## # A tibble: 12,412 × 4
##    Date                 Report                      District date_cmpl          
##    <chr>                <chr>                       <chr>    <dttm>             
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehrer… Ereigni… 2025-03-06 15:01:00
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenomm… Ereigni… 2025-03-06 14:57:00
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Na… Ereigni… 2025-03-06 14:49:00
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall… Ereigni… 2025-03-06 11:40:00
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen… Ereigni… 2025-03-06 10:49:00
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzun… Ereigni… 2025-03-06 10:46:00
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park  Ereigni… 2025-03-06 10:43:00
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb … Ereigni… 2025-03-06 09:52:00
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Ve… Ereigni… 2025-03-06 09:21:00
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und … Ereigni… 2025-03-05 16:53:00
## # ℹ 12,402 more rows

At a later point we will do some analysis by year, month, weekday and time of day. lubridate includes functions that extract the subparts of date and time data.

year() extracts the year, month() the month and wday() the day of the week. The argument label = TRUE tells the function to use the names of months and days instead of a numerical value ranging from \(1\) to \(12\) and \(1\) to \(7\) respectively.

reports %>% 
  mutate(date_cmpl = dmy_hm(Date),
         year = year(date_cmpl),
         month = month(date_cmpl, label = TRUE),
         day = wday(date_cmpl, label = TRUE)
         )
## # A tibble: 12,412 × 7
##    Date                 Report    District date_cmpl            year month day  
##    <chr>                <chr>     <chr>    <dttm>              <dbl> <ord> <ord>
##  1 06.03.2025 15:01 Uhr Heranwac… Ereigni… 2025-03-06 15:01:00  2025 Mär   Do   
##  2 06.03.2025 14:57 Uhr Kokstaxi… Ereigni… 2025-03-06 14:57:00  2025 Mär   Do   
##  3 06.03.2025 14:49 Uhr Mehrere … Ereigni… 2025-03-06 14:49:00  2025 Mär   Do   
##  4 06.03.2025 11:40 Uhr Tourist … Ereigni… 2025-03-06 11:40:00  2025 Mär   Do   
##  5 06.03.2025 10:49 Uhr Hausfrie… Ereigni… 2025-03-06 10:49:00  2025 Mär   Do   
##  6 06.03.2025 10:46 Uhr Verkehrs… Ereigni… 2025-03-06 10:46:00  2025 Mär   Do   
##  7 06.03.2025 10:43 Uhr Auseinan… Ereigni… 2025-03-06 10:43:00  2025 Mär   Do   
##  8 06.03.2025 09:52 Uhr Mitarbei… Ereigni… 2025-03-06 09:52:00  2025 Mär   Do   
##  9 06.03.2025 09:21 Uhr Motorrol… Ereigni… 2025-03-06 09:21:00  2025 Mär   Do   
## 10 05.03.2025 16:53 Uhr Mann zei… Ereigni… 2025-03-05 16:53:00  2025 Mär   Mi   
## # ℹ 12,402 more rows

To extract the time of day, we first have to extract the substring of “Date” that contains the time and then apply the lubridate function hm() to it.

reports %>% 
  mutate(time = 
           substr(Date, 12, str_length(Date)) %>% 
           hm()
         )
## # A tibble: 12,412 × 4
##    Date                 Report                               District time      
##    <chr>                <chr>                                <chr>    <Period>  
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichv… Ereigni… 15H 1M 0S 
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen         Ereigni… 14H 57M 0S
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht        Ereigni… 14H 49M 0S
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt  Ereigni… 11H 40M 0S
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Lade… Ereigni… 10H 49M 0S
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgä… Ereigni… 10H 46M 0S
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park           Ereigni… 10H 43M 0S
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt   Ereigni… 9H 52M 0S 
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunf… Ereigni… 9H 21M 0S 
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert si… Ereigni… 16H 53M 0S
## # ℹ 12,402 more rows

We can combine all these individual steps in one mutate() command and apply them to the object reports.

reports <- reports %>% 
  mutate(District = substr(District, 14, str_length(District)),
         date_cmpl = dmy_hm(Date),
         year = year(date_cmpl),
         month = month(date_cmpl, label = TRUE),
         day = wday(date_cmpl, label = TRUE),
         time = 
           substr(Date, 12, str_length(Date)) %>% 
           hm()
         )

reports
## # A tibble: 12,412 × 8
##    Date         Report District date_cmpl            year month day   time      
##    <chr>        <chr>  <chr>    <dttm>              <dbl> <ord> <ord> <Period>  
##  1 06.03.2025 … Heran… Mitte    2025-03-06 15:01:00  2025 Mär   Do    15H 1M 0S 
##  2 06.03.2025 … Kokst… Reinick… 2025-03-06 14:57:00  2025 Mär   Do    14H 57M 0S
##  3 06.03.2025 … Mehre… Lichten… 2025-03-06 14:49:00  2025 Mär   Do    14H 49M 0S
##  4 06.03.2025 … Touri… Mitte    2025-03-06 11:40:00  2025 Mär   Do    11H 40M 0S
##  5 06.03.2025 … Hausf… Pankow   2025-03-06 10:49:00  2025 Mär   Do    10H 49M 0S
##  6 06.03.2025 … Verke… Friedri… 2025-03-06 10:46:00  2025 Mär   Do    10H 46M 0S
##  7 06.03.2025 … Ausei… Pankow   2025-03-06 10:43:00  2025 Mär   Do    10H 43M 0S
##  8 06.03.2025 … Mitar… Charlot… 2025-03-06 09:52:00  2025 Mär   Do    9H 52M 0S 
##  9 06.03.2025 … Motor… Mitte    2025-03-06 09:21:00  2025 Mär   Do    9H 21M 0S 
## 10 05.03.2025 … Mann … Charlot… 2025-03-05 16:53:00  2025 Mär   Mi    16H 53M 0S
## # ℹ 12,402 more rows

11.2 `select()`

As we have already extracted all we need from the “Date” and “date_cmpl” columns, and we will not work with the “Report” column in this chapter, to keep the tibble neat and free of clutter, we remove the three columns by using the function select(). select() is used for selecting columns from a tibbble. The function takes the data as its first argument – here provided by the pipe – and one or several names of columns that should be kept. If columns follow after each other in order, we can also use a “from:to” notation.

reports %>% 
  select(District, year, month, day, time)
## # A tibble: 12,412 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Mitte                       2025 Mär   Do    15H 1M 0S 
##  2 Reinickendorf               2025 Mär   Do    14H 57M 0S
##  3 Lichtenberg                 2025 Mär   Do    14H 49M 0S
##  4 Mitte                       2025 Mär   Do    11H 40M 0S
##  5 Pankow                      2025 Mär   Do    10H 49M 0S
##  6 Friedrichshain-Kreuzberg    2025 Mär   Do    10H 46M 0S
##  7 Pankow                      2025 Mär   Do    10H 43M 0S
##  8 Charlottenburg-Wilmersdorf  2025 Mär   Do    9H 52M 0S 
##  9 Mitte                       2025 Mär   Do    9H 21M 0S 
## 10 Charlottenburg-Wilmersdorf  2025 Mär   Mi    16H 53M 0S
## # ℹ 12,402 more rows

reports %>% 
  select(District, year:time)
## # A tibble: 12,412 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Mitte                       2025 Mär   Do    15H 1M 0S 
##  2 Reinickendorf               2025 Mär   Do    14H 57M 0S
##  3 Lichtenberg                 2025 Mär   Do    14H 49M 0S
##  4 Mitte                       2025 Mär   Do    11H 40M 0S
##  5 Pankow                      2025 Mär   Do    10H 49M 0S
##  6 Friedrichshain-Kreuzberg    2025 Mär   Do    10H 46M 0S
##  7 Pankow                      2025 Mär   Do    10H 43M 0S
##  8 Charlottenburg-Wilmersdorf  2025 Mär   Do    9H 52M 0S 
##  9 Mitte                       2025 Mär   Do    9H 21M 0S 
## 10 Charlottenburg-Wilmersdorf  2025 Mär   Mi    16H 53M 0S
## # ℹ 12,402 more rows

Instead of telling select() which columns we want to keep, we can also tell it which ones not to keep by adding a - before the column names. If we want to exclude multiple columns in one step, we combine them with the c() function and write the - directly before it. In this last example we will assign the result to reports.

reports <- reports %>% 
  select(-c(Date, Report, date_cmpl))

reports
## # A tibble: 12,412 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Mitte                       2025 Mär   Do    15H 1M 0S 
##  2 Reinickendorf               2025 Mär   Do    14H 57M 0S
##  3 Lichtenberg                 2025 Mär   Do    14H 49M 0S
##  4 Mitte                       2025 Mär   Do    11H 40M 0S
##  5 Pankow                      2025 Mär   Do    10H 49M 0S
##  6 Friedrichshain-Kreuzberg    2025 Mär   Do    10H 46M 0S
##  7 Pankow                      2025 Mär   Do    10H 43M 0S
##  8 Charlottenburg-Wilmersdorf  2025 Mär   Do    9H 52M 0S 
##  9 Mitte                       2025 Mär   Do    9H 21M 0S 
## 10 Charlottenburg-Wilmersdorf  2025 Mär   Mi    16H 53M 0S
## # ℹ 12,402 more rows

11.3 `rename()`

As you may have noticed, all newly created columns are written in lower case, while “District” begins with an upper case letter. You may want to rename this column to a lower case name – or all the others to upper case names, depending on your preference.

One approach to renaming is using the function rename() from dplyr: the function takes the data to be applied to as its first argument – again passed on by the pipe here – followed by one or more arguments in the form new_name = old_name.

reports <- reports %>% 
  rename(district = District)

reports
## # A tibble: 12,412 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Mitte                       2025 Mär   Do    15H 1M 0S 
##  2 Reinickendorf               2025 Mär   Do    14H 57M 0S
##  3 Lichtenberg                 2025 Mär   Do    14H 49M 0S
##  4 Mitte                       2025 Mär   Do    11H 40M 0S
##  5 Pankow                      2025 Mär   Do    10H 49M 0S
##  6 Friedrichshain-Kreuzberg    2025 Mär   Do    10H 46M 0S
##  7 Pankow                      2025 Mär   Do    10H 43M 0S
##  8 Charlottenburg-Wilmersdorf  2025 Mär   Do    9H 52M 0S 
##  9 Mitte                       2025 Mär   Do    9H 21M 0S 
## 10 Charlottenburg-Wilmersdorf  2025 Mär   Mi    16H 53M 0S
## # ℹ 12,402 more rows

11.4 `filter()`

Some of our analysis requires us to have complete years in our data. If we want to compare the police reports by year, it makes no sense to compare the numbers for an ongoing year with those that are complete, since the former will obviously have fewer reports. Incomplete years would also impact analysis by month, as some months will have more observations than others; therefore, we should filter our data for complete years.

To filter the observations, we can use the function filter(). As always in tidyverse functions, filter() takes the data to be filtered as its first argument and one or multiple expressions that specify the rules by which to filter. To write these expressions, we can make use of the comparison operators discussed in subchapter 1.3.2.

To filter for all years that are not 2025, we can write:

reports %>% 
  filter(year != 2025)
## # A tibble: 11,912 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Marzahn-Hellersdorf         2024 Okt   Fr    11H 25M 0S
##  2 Mitte                       2024 Sep   Do    11H 30M 0S
##  3 Lichtenberg                 2024 Mai   Mi    14H 7M 0S 
##  4 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 59M 0S
##  5 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 7M 0S 
##  6 Charlottenburg-Wilmersdorf  2024 Dez   Di    13H 46M 0S
##  7 Treptow-Köpenick            2024 Dez   Di    12H 18M 0S
##  8 Lichtenberg                 2024 Dez   Di    11H 56M 0S
##  9 Charlottenburg-Wilmersdorf  2024 Dez   Di    11H 5M 0S 
## 10 Steglitz-Zehlendorf         2024 Dez   Mo    16H 58M 0S
## # ℹ 11,902 more rows

Thus, only the observations where “year” does not equal “2025” remain in the tibble; more accurately, those observations for which the expression year != 2025 is returned as TRUE.

This is all the filtering we will have to do here, but we should still look on how to filter by multiple values at once, which is an important data wrangling still to have.

We can chain multiple expressions when using filter() separated by commas. o for example to reduce the data set to observations from 2024, 2023 and 2022 we could write:

reports %>% 
  filter(year != 2025, year != 2021, year != 2020)
## # A tibble: 6,832 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Marzahn-Hellersdorf         2024 Okt   Fr    11H 25M 0S
##  2 Mitte                       2024 Sep   Do    11H 30M 0S
##  3 Lichtenberg                 2024 Mai   Mi    14H 7M 0S 
##  4 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 59M 0S
##  5 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 7M 0S 
##  6 Charlottenburg-Wilmersdorf  2024 Dez   Di    13H 46M 0S
##  7 Treptow-Köpenick            2024 Dez   Di    12H 18M 0S
##  8 Lichtenberg                 2024 Dez   Di    11H 56M 0S
##  9 Charlottenburg-Wilmersdorf  2024 Dez   Di    11H 5M 0S 
## 10 Steglitz-Zehlendorf         2024 Dez   Mo    16H 58M 0S
## # ℹ 6,822 more rows

Here we are excluding every observation that does not lie between 2022–2024.

Often there are multiple ways to formulate the filtering expressions. Here, we could tell R which values for “year” we want to keep, instead of which we do not want to keep. Instead of listing all those individual years, we can use %in% to define a range of numerical values which should be kept when filtering. We assign the result to a new object for later use.

reports %>% 
  filter(year %in% 2022:2024)
## # A tibble: 6,832 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Marzahn-Hellersdorf         2024 Okt   Fr    11H 25M 0S
##  2 Mitte                       2024 Sep   Do    11H 30M 0S
##  3 Lichtenberg                 2024 Mai   Mi    14H 7M 0S 
##  4 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 59M 0S
##  5 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 7M 0S 
##  6 Charlottenburg-Wilmersdorf  2024 Dez   Di    13H 46M 0S
##  7 Treptow-Köpenick            2024 Dez   Di    12H 18M 0S
##  8 Lichtenberg                 2024 Dez   Di    11H 56M 0S
##  9 Charlottenburg-Wilmersdorf  2024 Dez   Di    11H 5M 0S 
## 10 Steglitz-Zehlendorf         2024 Dez   Mo    16H 58M 0S
## # ℹ 6,822 more rows

While the , behaves like the logical operator &, we can also use | for “OR” when combining expressions to be filtered upon.

For example we could use | to select the same years in this way:

reports %>% 
  filter(year == 2022 | year == 2023 | year == 2024)
## # A tibble: 6,832 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Marzahn-Hellersdorf         2024 Okt   Fr    11H 25M 0S
##  2 Mitte                       2024 Sep   Do    11H 30M 0S
##  3 Lichtenberg                 2024 Mai   Mi    14H 7M 0S 
##  4 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 59M 0S
##  5 Charlottenburg-Wilmersdorf  2024 Dez   Di    16H 7M 0S 
##  6 Charlottenburg-Wilmersdorf  2024 Dez   Di    13H 46M 0S
##  7 Treptow-Köpenick            2024 Dez   Di    12H 18M 0S
##  8 Lichtenberg                 2024 Dez   Di    11H 56M 0S
##  9 Charlottenburg-Wilmersdorf  2024 Dez   Di    11H 5M 0S 
## 10 Steglitz-Zehlendorf         2024 Dez   Mo    16H 58M 0S
## # ℹ 6,822 more rows

It is important to keep in mind the differences between AND and OR here. If we would have used & or , to chain these expressions instead, the following would have happened:

reports %>% 
  filter(year == 2022, year == 2023, year == 2024)
## # A tibble: 0 × 5
## # ℹ 5 variables: district <chr>, year <dbl>, month <ord>, day <ord>,
## #   time <Period>

We are left with nothing in our data set. This happens because the expression in filter() is now searching for observations who are at the same time from 2022, 2023 and 2024. This is obviously impossible.

Let us now actually remove the observations for 2025 from the saved object.

reports_fyears <- reports %>% 
  filter(year != 2025)

11.5 `summarise()` & `group_by()`

We can now begin computing some basic summary statistics for our data. You should note though, that we do not know how the data we scraped is actually created. Do the dates and times on the website refer to the instance in time when a crime occurred – there are indications that this is not the case –, when a report was filed, or even when the PR team of the Berlin police uploaded the report to the website – which might even be the most plausible scenario? Also, are all crimes reported on this website – the daily numbers are too low for this to be the case – or is there an internal selection process? And if so, on what criteria does the selection occur? If this was a real research project, we would absolutely need to gain clarity on these and many other questions, before we even begin with collecting the data. We do not have clarity here, but we have to keep in mind that we may not analyse the statistical distribution of crimes in Berlin, but rather the working practice of the PR team.

11.5.1 Number of reports by groups

The dplyr function summarise() can be used to calculate summary statistics for a whole tibble; the syntax being similar to mutate(). The result is a new tibble, containing only the summary statistics we requested. The function we use for the summary statistic here is n(), which returns the number of observations. As each observation represents one police report, this equals the overall number of reports.

reports_fyears %>% 
  summarise(reports = n())
## # A tibble: 1 × 1
##   reports
##     <int>
## 1   11912

This worked, but is not very interesting, as the result is the same as the length of the reports tibble. We are often more interested in summaries grouped by the value of one or more other variables. We might ask ourselves, if there are differences in the number of released police reports by year; for this purpose, we can group the data by the values of the “year” column using group_by() and then compute the summary statistic separately for each group. Here, n() returns the number of observations for each group, i.e. the number of reports per year. As always with dplyr functions, group_by() needs the data which will be grouped as the first argument, followed by one or multiple columns to group by.

reports_fyears %>% 
  group_by(year) %>% 
  summarise(reports = n())
## # A tibble: 5 × 2
##    year reports
##   <dbl>   <int>
## 1  2020    2581
## 2  2021    2499
## 3  2022    2235
## 4  2023    2196
## 5  2024    2401

While there are differences between the years, there does not seem to be a clear and systematic pattern to it. We can do the same kind of analysis grouped by months and weekdays.

reports_fyears %>% 
  group_by(month) %>% 
  summarise(reports = n())
## # A tibble: 12 × 2
##    month reports
##    <ord>   <int>
##  1 Jan      1013
##  2 Feb       859
##  3 Mär      1020
##  4 Apr      1092
##  5 Mai      1089
##  6 Jun      1015
##  7 Jul      1006
##  8 Aug      1009
##  9 Sep       975
## 10 Okt       964
## 11 Nov       957
## 12 Dez       913

reports_fyears %>% 
  group_by(day) %>% 
  summarise(reports = n())
## # A tibble: 7 × 2
##   day   reports
##   <ord>   <int>
## 1 So       1568
## 2 Mo       1608
## 3 Di       1772
## 4 Mi       1834
## 5 Do       1870
## 6 Fr       1754
## 7 Sa       1506

The analysis by month again shows no systematic variation (that February has less reports will moist probably come down to the shorter length of the month). Looking at the reports by day of the week on the other hand, shows a slight increase in reports over the week, culminating on Thursday As stated above, we do not really know how the data is created, so the main point to take away from this analysis might be, that the PR team of the Berlin police also works on the weekend, but with fewer working hours.

In the same manner, we can analyse the number of released police reports by district.

reports_fyears %>% 
  group_by(district) %>% 
  summarise(reports = n()) %>% 
  arrange(desc(reports))
## # A tibble: 16 × 2
##    district                   reports
##    <chr>                        <int>
##  1 Mitte                         1723
##  2 Friedrichshain-Kreuzberg      1244
##  3 Neukölln                      1064
##  4 Charlottenburg-Wilmersdorf    1059
##  5 Tempelhof-Schöneberg           872
##  6 Pankow                         869
##  7 Lichtenberg                    757
##  8 Treptow-Köpenick               737
##  9 Spandau                        718
## 10 Marzahn-Hellersdorf            677
## 11 Reinickendorf                  650
## 12 Steglitz-Zehlendorf            566
## 13 bezirksübergreifend            463
## 14 berlinweit                     379
## 15 <NA>                            79
## 16 bundesweit                      55

For the grouped summary we also included arrange(desc()) which orders the values from highest to lowest. If you omit desc() the results are ordered ascending.

We might also be interested in the relative share of reports by district. To compute those, one option is to add another column to the tibble resulting from summarise(), in which we calculate the relative share by dividing the “reports” columns values by the total of this column.

reports_fyears %>% 
  group_by(district) %>% 
  summarise(reports = n()) %>% 
  arrange(desc(reports)) %>% 
  mutate(reports_rel = reports / sum(reports))
## # A tibble: 16 × 3
##    district                   reports reports_rel
##    <chr>                        <int>       <dbl>
##  1 Mitte                         1723     0.145  
##  2 Friedrichshain-Kreuzberg      1244     0.104  
##  3 Neukölln                      1064     0.0893 
##  4 Charlottenburg-Wilmersdorf    1059     0.0889 
##  5 Tempelhof-Schöneberg           872     0.0732 
##  6 Pankow                         869     0.0730 
##  7 Lichtenberg                    757     0.0635 
##  8 Treptow-Köpenick               737     0.0619 
##  9 Spandau                        718     0.0603 
## 10 Marzahn-Hellersdorf            677     0.0568 
## 11 Reinickendorf                  650     0.0546 
## 12 Steglitz-Zehlendorf            566     0.0475 
## 13 bezirksübergreifend            463     0.0389 
## 14 berlinweit                     379     0.0318 
## 15 <NA>                            79     0.00663
## 16 bundesweit                      55     0.00462

The results show clear differences in the number of reports by district, with “Mitte” leading by a substantial margin. Other districts with high counts seem to be those that at least in part lie within the inner ring. The outer districts all show lower numbers. Again, we do not know how the data is created, so maybe it is just the case that the crimes in the inner ring are more interesting than in other districts and that the data does not necessarily point to a higher number of crimes in central Berlin. To get a more complete picture, we would also have to relate the number of reports to the citizens or even tourists per district. For this we would need additional external data though.

Using group_by() we can also group by multiple columns. “Mitte” and “Friedrichshain-Kreuzberg” showed the highest numbers among all districts. Let us analyse if these numbers changed over the years for these two districts. First we have to use filter() to use only the observations referring to those districts and then group the data by the remaining districts and year before we count the the number of released reports.

reports_fyears %>% 
  filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>% 
  group_by(district, year) %>% 
  summarise(reports = n())
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   district [2]
##    district                  year reports
##    <chr>                    <dbl>   <int>
##  1 Friedrichshain-Kreuzberg  2020     301
##  2 Friedrichshain-Kreuzberg  2021     312
##  3 Friedrichshain-Kreuzberg  2022     190
##  4 Friedrichshain-Kreuzberg  2023     210
##  5 Friedrichshain-Kreuzberg  2024     231
##  6 Mitte                     2020     374
##  7 Mitte                     2021     328
##  8 Mitte                     2022     318
##  9 Mitte                     2023     333
## 10 Mitte                     2024     370

Looking at “Friedrichshain-Kreuzberg” we can observe a substantial decrease in reports starting from 2022. The same is not true for “Mitte” which despite of a moderate temporal decrease seems to be relatively stable in the number of reports. To analyse this further, we had to go deeper into the data and look at the actual texts of the police reports.

But we should briefly talk about a peculiarity to the way group_by() works with summarise() that can cause headaches, if we are not aware of it. In general, any summarise() function following a group_by() will calculate the summary statistic and then remove one level of grouping. In the examples where we only had one level of grouping, this essentially meant, that the data was ungrouped after summarise(). In the last example we had two levels of grouping. So, after the computation of the number of reports by “district” and “year” the grouping by “year” was removed, but the grouping by “district” remained in effect. We can see this in the output, where R informs us about the column by which the data are grouped and the number of groups in the output: # Groups: district [2]. Another summarise() function would compute the statistic by “district” and then remove this level as well. We can also use ungroup() to remove all grouping from a tibble. In the case of this example, it does not make a practical difference as we only compute the summary and then move on. But, if we assign the results of a summary to an object for later use, we have to decide if we want to remove the grouping or keep it in effect, depending on the goals of the analysis. In general, I would always ungroup the data and group it again if the need arises, as this is less error prone and just a minor increase in the amount of typing needed.

11.5.2 Summary statistics on the time

The column “time” holds the time of day for each report in hours, minutes and seconds as a representation of the actual values, which are seconds from midnight.

reports_fyears[[1, 'time']]
## [1] "11H 25M 0S"

reports_fyears[[1, 'time']] %>% 
  as.numeric()
## [1] 41100

To calculate with the “time” column, we first have to tell R that we explicitly want to use the data as seconds, then compute the summary statistic and after this, transform the display of seconds back into the time of day. For the transformations we can use the functions period_to_seconds() and seconds_to_period(). We will calculate the mean and the median for the time of day over all police reports in one call of summarise().

reports %>% 
  summarise(mean = period_to_seconds(time) %>% 
              mean() %>% 
              seconds_to_period(),
            median = period_to_seconds(time) %>% 
              median() %>% 
              seconds_to_period()
  )
## # A tibble: 1 × 2
##   mean                      median   
##   <Period>                  <Period> 
## 1 12H 31M 10.2336448598144S 12H 4M 0S

The mean time of day over all police reports is about 12:28 while the median is 12:01. This may indicate that the mean is biased towards a later time by a high number of very late reports. We will explore this further graphically in the next chapter. Again, the results may represent the time when crimes occur, but it is more likely, that we are actually analysing the time when the PR team posts their police reports.

11.6 Exporting tables

Exporting the results of our analysis – e.g. summary statistics computed with summarise() – sadly is not as straightforward as one might think or hope. We can always copy and paste values from the output into Word or other pieces of software we might use for writing a report. But this is cumbersome and error prone.

One way of directly exporting tables we already know about is writing them to a .csv file. For more on this please review section 8.1. These can be imported into Excel and also directly into Word. This may be a quick way of exporting the values of a table and importing them into Office, but we have to do all the formatting in the external software and may have to repeat the formatting procedure every time the values change. So we should make sure that all errors have been fixed and the analysis is final before exporting to .csv.

In this example we save the number of police reports by year for the districts “Mitte” and “Friedrichshain-Kreuzberg” we computed earlier into a .csv file:

reports_fyears %>% 
  filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>% 
  group_by(district, year) %>% 
  summarise(reports = n()) %>% 
  write_csv("reports_year_M-FK.csv")
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.

For reasons unknown to anyone but Microsoft, importing a .csv file that uses commas as the delimiter can actually be quite tricky on a German Windows installation if you are using Microsoft Office. Using Libre Office fixes this, but if you want to use Microsoft Office on a German system you might have to use semicolons as the delimiter by saving the file with write_csv2().

11.6.1 Further resources

A multitude of packages that provide functions for formatting and writing tables to different formats exist in the R world. To go into details is beyond the scope of this introduction, but I want to give you some pointers at least.

If you want to skip importing a .csv into Microsoft Office, you may be interested in packages that allow writing directly to .xlsx files. Two options are:

writexl: https://cran.r-project.org/web/packages/writexl/index.html
openxlsx: https://cran.r-project.org/web/packages/openxlsx/index.html

A way to directly produce tables in the .docx format is provided by the flextable package: https://davidgohel.github.io/flextable/

If your are working with LaTeX, the huxtable package can directly output .tex files, as well as HTML, Word, Excel and Powerpoint: https://hughjonesd.github.io/huxtable/

If you want to circumvent using external software for producing reports completely, R Markdown may be of interest to you. This allows you to write your report and your R code in one document and write it to .pdf or a range of other formats. The produced file includes all the output from your code that you want to report, i.e. code, plots and tables. Some resources on R Markdown can be found here:

Introduction on the RStudio website: https://rmarkdown.rstudio.com/lesson-1.html{target_“blank”}
The R Markdown cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf
Chapter on R Markdown in “R for Data Science” by Hadley Wickham and Garrett Grolemund: https://r4ds.had.co.nz/r-markdown.html
“R Markdown Cookbook” by Yihui Xie, Christophe Dervieux, Emily Riederer: https://bookdown.org/yihui/rmarkdown-cookbook/

The website your are viewing right now, as well as many of the linked resources, are built with the bookdown package, which extends the functionality of R Markdown and is suited for more large scale projects.

“bookdown: Authoring Books and Technical Documents with R Markdown” by Yihui Xie: https://bookdown.org/yihui/bookdown/