11 Transformation with dplyr

In this chapter, we will use several functions from the tidyverse package dplyr to clean the data we scraped in chapter 10. We will then select the columns and filter the observations (rows) we need, as well as compute some grouped summaries of the data.

First, let us load the required packages. dplyr is part of the core tidyverse; we will need the package lubridate as well, to be able to deal with dates and times.

library(tidyverse)
library(lubridate)

11.1 mutate()

11.1.1 Computing new variables from existing ones

Before we continue working on our scraped reports data, we will look at some simple examples that will show how new variables can be computed in a tibble from existing ones, using mutate() from dplyr.

For this purpose, let us create a new simple tibble on voter turnout. Note that the entered data is purely illustrational and has no meaning.

exmpl_tbl <- tibble(
  ybirth = c("1993 - 2002", "1983 - 1992", "1973 - 1982", "1963 - 1972"),
  eligible = c(100000, 100000, 100000, 100000),
  turnout_person = c(40000, 50000, 60000, 70000),
  turnout_mail = c (35000, 30000, 25000, 20000)
)

exmpl_tbl
## # A tibble: 4 × 4
##   ybirth      eligible turnout_person turnout_mail
##   <chr>          <dbl>          <dbl>        <dbl>
## 1 1993 - 2002   100000          40000        35000
## 2 1983 - 1992   100000          50000        30000
## 3 1973 - 1982   100000          60000        25000
## 4 1963 - 1972   100000          70000        20000

Here we have different columns for people who were eligible to vote, who voted in person and who voted by mail, by their year of birth. We do not actually care about the difference in the voting method in this example and want one column that combines both. We can get there by using mutate(). The function takes the data to be manipulated as its first argument, followed by one or multiple arguments defining the new columns to be created. We can create this new columns as computations from the columns already present.

To calculate the total turnout, we write:

exmpl_tbl %>% 
  mutate(turnout = turnout_person + turnout_mail)
## # A tibble: 4 × 5
##   ybirth      eligible turnout_person turnout_mail turnout
##   <chr>          <dbl>          <dbl>        <dbl>   <dbl>
## 1 1993 - 2002   100000          40000        35000   75000
## 2 1983 - 1992   100000          50000        30000   80000
## 3 1973 - 1982   100000          60000        25000   85000
## 4 1963 - 1972   100000          70000        20000   90000

We can also immediately start calculating with new columns in the same mutate() command. For this, we chain multiple operations in mutate() separated with ,. To calculate the turnout percentage in a second step, we write the following:

exmpl_tbl %>% 
  mutate(
    turnout = turnout_person + turnout_mail,
    turnout_pct = turnout / eligible
  )
## # A tibble: 4 × 6
##   ybirth      eligible turnout_person turnout_mail turnout turnout_pct
##   <chr>          <dbl>          <dbl>        <dbl>   <dbl>       <dbl>
## 1 1993 - 2002   100000          40000        35000   75000        0.75
## 2 1983 - 1992   100000          50000        30000   80000        0.8 
## 3 1973 - 1982   100000          60000        25000   85000        0.85
## 4 1963 - 1972   100000          70000        20000   90000        0.9

Note that mutate() is not limited to basic arithmetic operations. Many functions can be applied within mutate(): sum() or mean() are two examples we already know.

11.1.2 Cleaning the data on police reports

Now knowing the basic principle of working with mutate(), we can apply this to the data on police reports we scraped in chapter 10.

First we have to load the data and examine what we are working with.

load("reports.RData")

reports
## # A tibble: 21,952 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Ereigni…
##  2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Ereigni…
##  3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… Ereigni…
##  4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl           Ereigni…
##  5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Ereigni…
##  6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen            Ereigni…
##  7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten      Ereigni…
##  8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft                        Ereigni…
##  9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Ereigni…
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… Ereigni…
## # ℹ 21,942 more rows

We can leave the column containing the text of the police report as it is, but the columns containing the districts, as well as the date and time, could use some work.

The data in the column “District” includes the string “Ereignisort:” before listing the actual name of the district the report refers to. This unnecessarily clutters this column: we can use substr() to remove these leading characters within mutate(). substr() extracts a part of a string we want to keep, taking the data to be applied to, as its first argument – here the name of the column –, the starting character position where the extraction shall begin as its second, and the character position where it shall stop, as its third argument. Character position here refers to the numeric position of a character in a string: to begin extraction after “Ereignisort:” we have to count the length of this substring including the whitespace after “:”, which is \(13\) and thus use \(14\) as the starting position. For the position to stop extraction, we could either use an unrealistically high number or use length(District) to determine the exact length of each string.

reports %>% 
  mutate(District = substr(District, 14, 99))
## # A tibble: 21,952 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Treptow…
##  2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Charlot…
##  3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… berlinw…
##  4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl           Friedri…
##  5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Marzahn…
##  6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen            Friedri…
##  7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten      Mitte   
##  8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft                        Charlot…
##  9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Neukölln
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… berlinw…
## # ℹ 21,942 more rows

reports %>% 
  mutate(District = substr(District, 14, length(District)))
## # A tibble: 21,952 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Treptow…
##  2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Charlot…
##  3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… berlinw…
##  4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl           Friedri…
##  5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Marzahn…
##  6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen            Friedri…
##  7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten      Mitte   
##  8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft                        Charlot…
##  9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Neukölln
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… berlinw…
## # ℹ 21,942 more rows

The column “Date” includes the date and the time of each report as a character string. To be able to use this data in analysis, we have to extract the date and time in a format R can understand. This is easily achieved using the parsing functions from lubridate. As the character data are written in the format “day.month.year hour:minute”, we have to use the function dmy_hm() on the column.

reports %>% 
  mutate(date_cmpl = dmy_hm(Date))
## # A tibble: 21,952 × 4
##    Date                 Report                      District date_cmpl          
##    <chr>                <chr>                       <chr>    <dttm>             
##  1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeli… Ereigni… 2023-10-16 17:55:00
##  2 16.10.2023 17:49 Uhr Gefährliche Körperverletzu… Ereigni… 2023-10-16 17:49:00
##  3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenha… Ereigni… 2023-10-16 14:43:00
##  4 16.10.2023 11:26 Uhr Festnahme nach räuberische… Ereigni… 2023-10-16 11:26:00
##  5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in K… Ereigni… 2023-10-16 10:07:00
##  6 16.10.2023 09:53 Uhr Brandstiftung an mehreren … Ereigni… 2023-10-16 09:53:00
##  7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder-… Ereigni… 2023-10-15 21:07:00
##  8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft    Ereigni… 2023-10-15 15:46:00
##  9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus… Ereigni… 2023-10-15 14:59:00
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenha… Ereigni… 2023-10-15 14:47:00
## # ℹ 21,942 more rows

At a later point we will do some analysis by year, month, weekday and time of day. lubridate includes functions that extract the subparts of date and time data.

year() extracts the year, month() the month and wday() the day of the week. The argument label = TRUE tells the function to use the names of months and days instead of a numerical value ranging from \(1\) to \(12\) and \(1\) to \(7\) respectively.

reports %>% 
  mutate(date_cmpl = dmy_hm(Date),
         year = year(date_cmpl),
         month = month(date_cmpl, label = TRUE),
         day = wday(date_cmpl, label = TRUE)
         )
## # A tibble: 21,952 × 7
##    Date                 Report    District date_cmpl            year month day  
##    <chr>                <chr>     <chr>    <dttm>              <dbl> <ord> <ord>
##  1 16.10.2023 17:55 Uhr Verdacht… Ereigni… 2023-10-16 17:55:00  2023 Okt   Mo   
##  2 16.10.2023 17:49 Uhr Gefährli… Ereigni… 2023-10-16 17:49:00  2023 Okt   Mo   
##  3 16.10.2023 14:43 Uhr Vorkommn… Ereigni… 2023-10-16 14:43:00  2023 Okt   Mo   
##  4 16.10.2023 11:26 Uhr Festnahm… Ereigni… 2023-10-16 11:26:00  2023 Okt   Mo   
##  5 16.10.2023 10:07 Uhr Verletzt… Ereigni… 2023-10-16 10:07:00  2023 Okt   Mo   
##  6 16.10.2023 09:53 Uhr Brandsti… Ereigni… 2023-10-16 09:53:00  2023 Okt   Mo   
##  7 15.10.2023 21:07 Uhr Verkehrs… Ereigni… 2023-10-15 21:07:00  2023 Okt   So   
##  8 15.10.2023 15:46 Uhr Raub im … Ereigni… 2023-10-15 15:46:00  2023 Okt   So   
##  9 15.10.2023 14:59 Uhr Einkaufs… Ereigni… 2023-10-15 14:59:00  2023 Okt   So   
## 10 15.10.2023 14:47 Uhr Vorkommn… Ereigni… 2023-10-15 14:47:00  2023 Okt   So   
## # ℹ 21,942 more rows

To extract the time of day, we first have to extract the substring of “Date” that contains the time and then apply the lubridate function hm() to it.

reports %>% 
  mutate(time = 
           substr(Date, 12, length(Date)) %>% 
           hm()
         )
## # A tibble: 21,952 × 4
##    Date                 Report                               District time      
##    <chr>                <chr>                                <chr>    <Period>  
##  1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mo… Ereigni… 17H 55M 0S
##  2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch … Ereigni… 17H 49M 0S
##  3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit de… Ereigni… 14H 43M 0S
##  4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebsta… Ereigni… 11H 26M 0S
##  5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhau… Ereigni… 10H 7M 0S 
##  6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen Ereigni… 9H 53M 0S 
##  7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wege… Ereigni… 21H 7M 0S 
##  8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft             Ereigni… 15H 46M 0S
##  9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen… Ereigni… 14H 59M 0S
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit de… Ereigni… 14H 47M 0S
## # ℹ 21,942 more rows

We can combine all these individual steps in one mutate() command and apply them to the object reports.

reports <- reports %>% 
  mutate(District = substr(District, 14, length(District)),
         date_cmpl = dmy_hm(Date),
         year = year(date_cmpl),
         month = month(date_cmpl, label = TRUE),
         day = wday(date_cmpl, label = TRUE),
         time = 
           substr(Date, 12, length(Date)) %>% 
           hm()
         )

reports
## # A tibble: 21,952 × 8
##    Date         Report District date_cmpl            year month day   time      
##    <chr>        <chr>  <chr>    <dttm>              <dbl> <ord> <ord> <Period>  
##  1 16.10.2023 … Verda… Treptow… 2023-10-16 17:55:00  2023 Okt   Mo    17H 55M 0S
##  2 16.10.2023 … Gefäh… Charlot… 2023-10-16 17:49:00  2023 Okt   Mo    17H 49M 0S
##  3 16.10.2023 … Vorko… berlinw… 2023-10-16 14:43:00  2023 Okt   Mo    14H 43M 0S
##  4 16.10.2023 … Festn… Friedri… 2023-10-16 11:26:00  2023 Okt   Mo    11H 26M 0S
##  5 16.10.2023 … Verle… Marzahn… 2023-10-16 10:07:00  2023 Okt   Mo    10H 7M 0S 
##  6 16.10.2023 … Brand… Friedri… 2023-10-16 09:53:00  2023 Okt   Mo    9H 53M 0S 
##  7 15.10.2023 … Verke… Mitte    2023-10-15 21:07:00  2023 Okt   So    21H 7M 0S 
##  8 15.10.2023 … Raub … Charlot… 2023-10-15 15:46:00  2023 Okt   So    15H 46M 0S
##  9 15.10.2023 … Einka… Neukölln 2023-10-15 14:59:00  2023 Okt   So    14H 59M 0S
## 10 15.10.2023 … Vorko… berlinw… 2023-10-15 14:47:00  2023 Okt   So    14H 47M 0S
## # ℹ 21,942 more rows

11.2 select()

As we have already extracted all we need from the “Date” and “date_cmpl” columns, and we will not work with the “Report” column in this chapter, to keep the tibble neat and free of clutter, we remove the three columns by using the function select(). select() is used for selecting columns from a tibbble. The function takes the data as its first argument – here provided by the pipe – and one or several names of columns that should be kept. If columns follow after each other in order, we can also use a “from:to” notation.

reports %>% 
  select(District, year, month, day, time)
## # A tibble: 21,952 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 21,942 more rows

reports %>% 
  select(District, year:time)
## # A tibble: 21,952 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 21,942 more rows

Instead of telling select() which columns we want to keep, we can also tell it which ones not to keep by adding a - before the column names. If we want to exclude multiple columns in one step, we combine them with the c() function and write the - directly before it. In this last example we will assign the result to reports.

reports <- reports %>% 
  select(-c(Date, Report, date_cmpl))

reports
## # A tibble: 21,952 × 5
##    District                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 21,942 more rows

11.3 rename()

As you may have noticed, all newly created columns are written in lower case, while “District” begins with an upper case letter. You may want to rename this column to a lower case name – or all the others to upper case names, depending on your preference.

One approach to renaming is using the function rename() from dplyr: the function takes the data to be applied to as its first argument – again passed on by the pipe here – followed by one or more arguments in the form new_name = old_name.

reports <- reports %>% 
  rename(district = District)

reports
## # A tibble: 21,952 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 21,942 more rows

11.4 filter()

Some of our analysis requires us to have complete years in our data. If we want to compare the police reports by year, it makes no sense to compare the numbers for an ongoing year with those that are complete, since the former will obviously have fewer reports. Incomplete years would also impact analysis by month, as some months will have more observations than others; therefore, we should filter our data for complete years.

To filter the observations, we can use the function filter(). As always in tidyverse functions, filter() takes the data to be filtered as its first argument and one or multiple expressions that specify the rules by which to filter. To write these expressions, we can make use of the comparison operators discussed in subchapter 1.3.2.

To filter for all years that are not 2022, we can write:

reports %>% 
  filter(year != 2022)
## # A tibble: 19,716 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 19,706 more rows

Thus, only the observations where “year” does not equal “2022” remain in the tibble; more accurately, those observations for which the expression year != 2022 is returned as TRUE.

Closer inspection of the data reveals, that 2014 (the first year that is available on the website) is not complete either. There also seems to be one single report for 2013. We should also exclude these years for analysis. We can chain multiple expressions when using filter() separated by commas.

reports %>% 
  filter(year != 2022, year != 2014, year != 2013)
## # A tibble: 19,012 × 5
##    district                    year month day   time      
##    <chr>                      <dbl> <ord> <ord> <Period>  
##  1 Treptow-Köpenick            2023 Okt   Mo    17H 55M 0S
##  2 Charlottenburg-Wilmersdorf  2023 Okt   Mo    17H 49M 0S
##  3 berlinweit                  2023 Okt   Mo    14H 43M 0S
##  4 Friedrichshain-Kreuzberg    2023 Okt   Mo    11H 26M 0S
##  5 Marzahn-Hellersdorf         2023 Okt   Mo    10H 7M 0S 
##  6 Friedrichshain-Kreuzberg    2023 Okt   Mo    9H 53M 0S 
##  7 Mitte                       2023 Okt   So    21H 7M 0S 
##  8 Charlottenburg-Wilmersdorf  2023 Okt   So    15H 46M 0S
##  9 Neukölln                    2023 Okt   So    14H 59M 0S
## 10 berlinweit                  2023 Okt   So    14H 47M 0S
## # ℹ 19,002 more rows

Often there are multiple ways to formulate the filtering expressions. Here, we could tell R which values for “year” we want to keep, instead of which we do not want to keep. Instead of listing all those individual years, we can use %in% to define a range of numerical values which should be kept when filtering. We assign the result to a new object for later use.

reports_fyears <- reports %>% 
  filter(year %in% 2015:2022)

While the , behaves like the logical operator &, we can also use | for “or” when combining expressions to be filtered upon.

11.5 summarise() & group_by()

We can now begin computing some basic summary statistics for our data. You should note though, that we do not know how the data we scraped is actually created. Do the dates and times on the website refer to the instance in time when a crime occurred – there are indications that this is not the case –, when a report was filed, or even when the PR team of the Berlin police uploaded the report to the website – which might even be the most plausible scenario? Also, are all crimes reported on this website – the daily numbers are too low for this to be the case – or is there an internal selection process? And if so, on what criteria does the selection occur? If this was a real research project, we would absolutely need to gain clarity on these and many other questions, before we even begin with collecting the data. We do not have clarity here, but we have to keep in mind that we may not analyse the statistical distribution of crimes in Berlin, but rather the working practice of the PR team.

11.5.1 Number of reports by groups

The dplyr function summarise() can be used to calculate summary statistics for a whole tibble; the syntax being similar to mutate(). The result is a new tibble, containing only the summary statistics we requested. The function we use for the summary statistic here is n(), which returns the number of observations. As each observation represents one police report, this equals the overall number of reports.

reports_fyears %>% 
  summarise(reports = n())
## # A tibble: 1 × 1
##   reports
##     <int>
## 1   19508

This worked, but is not very interesting, as the result is the same as the length of the reports tibble. We are often more interested in summaries grouped by the value of one or more other variables. We might ask ourselves, if there are differences in the number of released police reports by year; for this purpose, we can group the data by the values of the “year” column using group_by() and then compute the summary statistic separately for each group. Here, n() returns the number of observations for each group, i.e. the number of reports per year. As always with dplyr functions, group_by() needs the data which will be grouped as the first argument, followed by one or multiple columns to group by.

reports_fyears %>% 
  group_by(year) %>% 
  summarise(reports = n())
## # A tibble: 8 × 2
##    year reports
##   <dbl>   <int>
## 1  2015    2369
## 2  2016    2626
## 3  2017    2233
## 4  2018    2342
## 5  2019    2619
## 6  2020    2583
## 7  2021    2500
## 8  2022    2236

While there are differences between the years, there does not seem to be a systematic pattern to it. We can do the same kind of analysis grouped by months and weekdays.

reports_fyears %>% 
  group_by(month) %>% 
  summarise(reports = n())
## # A tibble: 12 × 2
##    month reports
##    <ord>   <int>
##  1 Jan      1735
##  2 Feb      1530
##  3 Mär      1702
##  4 Apr      1657
##  5 Mai      1717
##  6 Jun      1579
##  7 Jul      1645
##  8 Aug      1648
##  9 Sep      1593
## 10 Okt      1627
## 11 Nov      1550
## 12 Dez      1525

reports_fyears %>% 
  group_by(day) %>% 
  summarise(reports = n())
## # A tibble: 7 × 2
##   day   reports
##   <ord>   <int>
## 1 So       2688
## 2 Mo       2663
## 3 Di       2790
## 4 Mi       2909
## 5 Do       2916
## 6 Fr       2903
## 7 Sa       2639

The analysis by month again shows no systematic variation. Looking at the reports by day of the week on the other hand, shows a slight increase in reports over the week, culminating on Friday. As stated above, we do not really know how the data is created, so the main point to take away from this analysis might be, that the PR team of the Berlin police also works on the weekend.

In the same manner, we can analyse the number of released police reports by district.

reports_fyears %>% 
  group_by(district) %>% 
  summarise(reports = n())
## # A tibble: 22 × 2
##    district                     reports
##    <chr>                          <int>
##  1 Charlottenburg - Wilmersdorf     513
##  2 Charlottenburg-Wilmersdorf      1279
##  3 Friedrichshain - Kreuzberg       577
##  4 Friedrichshain-Kreuzberg        1528
##  5 Lichtenberg                     1075
##  6 Marzahn - Hellersdorf            225
##  7 Marzahn-Hellersdorf              758
##  8 Mitte                           2936
##  9 Neukölln                        1734
## 10 Pankow                          1438
## # ℹ 12 more rows

Before we interpret the result, let us examine the district names; districts with a dash are written in two ways: with whitespace around the dash and without one. There may have been a change in the way these names are recorded in the data over the years and we have to deal with it. We can use the function str_remove_all() from stringr to remove every occurrence of a pattern we specify as its argument. If we remove the pattern ” ” we basically delete all whitespace. This can again be combined with mutate() to transform the data in our tibble.

reports_fyears <- reports_fyears %>% 
  mutate(district = str_remove_all(district, pattern = " "))

reports_fyears %>% 
  group_by(district) %>% 
  summarise(reports = n()) %>% 
  arrange(desc(reports))
## # A tibble: 16 × 2
##    district                   reports
##    <chr>                        <int>
##  1 Mitte                         2936
##  2 Friedrichshain-Kreuzberg      2105
##  3 Charlottenburg-Wilmersdorf    1792
##  4 Neukölln                      1734
##  5 Tempelhof-Schöneberg          1542
##  6 Pankow                        1438
##  7 Treptow-Köpenick              1197
##  8 Lichtenberg                   1075
##  9 Reinickendorf                 1038
## 10 Spandau                       1024
## 11 Marzahn-Hellersdorf            983
## 12 Steglitz-Zehlendorf            946
## 13 bezirksübergreifend            921
## 14 <NA>                           382
## 15 berlinweit                     340
## 16 bundesweit                      55

For the grouped summary we also included arrange(desc()) which orders the values from highest to lowest. If you omit desc() the results are ordered ascending.

We might also be interested in the relative share of reports by district. To compute those, one option is to add another column to the tibble resulting from summarise(), in which we calculate the relative share by dividing the “reports” columns values by the total of this column.

reports_fyears %>% 
  group_by(district) %>% 
  summarise(reports = n()) %>% 
  arrange(desc(reports)) %>% 
  mutate(reports_rel = reports / sum(reports))
## # A tibble: 16 × 3
##    district                   reports reports_rel
##    <chr>                        <int>       <dbl>
##  1 Mitte                         2936     0.151  
##  2 Friedrichshain-Kreuzberg      2105     0.108  
##  3 Charlottenburg-Wilmersdorf    1792     0.0919 
##  4 Neukölln                      1734     0.0889 
##  5 Tempelhof-Schöneberg          1542     0.0790 
##  6 Pankow                        1438     0.0737 
##  7 Treptow-Köpenick              1197     0.0614 
##  8 Lichtenberg                   1075     0.0551 
##  9 Reinickendorf                 1038     0.0532 
## 10 Spandau                       1024     0.0525 
## 11 Marzahn-Hellersdorf            983     0.0504 
## 12 Steglitz-Zehlendorf            946     0.0485 
## 13 bezirksübergreifend            921     0.0472 
## 14 <NA>                           382     0.0196 
## 15 berlinweit                     340     0.0174 
## 16 bundesweit                      55     0.00282

The results show clear differences in the number of reports by district, with “Mitte” leading by a substantial margin. Other districts with high counts seem to be those that at least in part lie within the inner ring. The outer districts all show lower numbers. Again, we do not know how the data is created, so maybe it is just the case that the crimes in the inner ring are more interesting than in other districts and that the data does not necessarily point to a higher number of crimes in central Berlin. To get a more complete picture, we would also have to relate the number of reports to the citizens or even tourists per district. For this we would need additional external data though.

Using group_by() we can also group by multiple columns. “Mitte” and “Friedrichshain-Kreuzberg” showed the highest numbers among all districts. Let us analyse if these numbers changed over the years for these two districts. First we have to use filter() to use only the observations referring to those districts and then group the data by the remaining districts and year before we count the the number of released reports.

reports_fyears %>% 
  filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>% 
  group_by(district, year) %>% 
  summarise(reports = n())
## `summarise()` has grouped output by 'district'.
## You can override using the `.groups` argument.
## # A tibble: 16 × 3
## # Groups:   district [2]
##    district                  year reports
##    <chr>                    <dbl>   <int>
##  1 Friedrichshain-Kreuzberg  2015     277
##  2 Friedrichshain-Kreuzberg  2016     269
##  3 Friedrichshain-Kreuzberg  2017     234
##  4 Friedrichshain-Kreuzberg  2018     237
##  5 Friedrichshain-Kreuzberg  2019     285
##  6 Friedrichshain-Kreuzberg  2020     301
##  7 Friedrichshain-Kreuzberg  2021     312
##  8 Friedrichshain-Kreuzberg  2022     190
##  9 Mitte                     2015     367
## 10 Mitte                     2016     448
## 11 Mitte                     2017     347
## 12 Mitte                     2018     363
## 13 Mitte                     2019     391
## 14 Mitte                     2020     374
## 15 Mitte                     2021     328
## 16 Mitte                     2022     318

There does not seem to be a clear pattern for the number of reports over time in both districts. For “Mitte” there is a considerable spike in 2016. To analyse this further, we had to go deeper into the data and look at the actual texts of the police reports.

But we should briefly talk about a peculiarity to the way group_by() works with summarise() that can cause headaches, if we are not aware of it. In general, any summarise() function following a group_by() will calculate the summary statistic and then remove one level of grouping. In the examples where we only had one level of grouping, this essentially meant, that the data was ungrouped after summarise(). In the last example we had two levels of grouping. So, after the computation of the number of reports by “district” and “year” the grouping by “year” was removed, but the grouping by “district” remained in effect. We can see this in the output, where R informs us about the column by which the data are grouped and the number of groups in the output: # Groups: district [2]. Another summarise() function would compute the statistic by “district” and then remove this level as well. We can also use ungroup() to remove all grouping from a tibble. In the case of this example, it does not make a practical difference as we only compute the summary and then move on. But, if we assign the results of a summary to an object for later use, we have to decide if we want to remove the grouping or keep it in effect, depending on the goals of the analysis. In general, I would always ungroup the data and group it again if the need arises, as this is less error prone and just a minor increase in the amount of typing needed.

11.5.2 Summary statistics on the time

The column “time” holds the time of day for each report in hours, minutes and seconds as a representation of the actual values, which are seconds from midnight.

reports_fyears[[1, 'time']]
## [1] "13H 31M 0S"

reports_fyears[[1, 'time']] %>% 
  as.numeric()
## [1] 48660

To calculate with the “time” column, we first have to tell R that we explicitly want to use the data as seconds, then compute the summary statistic and after this, transform the display of seconds back into the time of day. For the transformations we can use the functions period_to_seconds() and seconds_to_period(). We will calculate the mean and the median for the time of day over all police reports in one call of summarise().

reports %>% 
  summarise(mean = period_to_seconds(time) %>% 
              mean() %>% 
              seconds_to_period(),
            median = period_to_seconds(time) %>% 
              median() %>% 
              seconds_to_period()
  )
## # A tibble: 1 × 2
##   mean                      median    
##   <Period>                  <Period>  
## 1 11H 59M 37.1911443148711S 11H 13M 0S

The mean time of day over all police reports is about 11:59 while the median is 11:13. This may indicate that the mean is biased towards a later time by a high number of very late reports. We will explore this further graphically in the next chapter. Again, the results may represent the time when crimes occur, but it is more likely, that we are actually analysing the time when the PR team posts their police reports.

11.6 Exporting tables

Exporting the results of our analysis – e.g. summary statistics computed with summarise() – sadly is not as straightforward as one might think or hope. We can always copy and paste values from the output into Word or other pieces of software we might use for writing a report. But this is cumbersome and error prone.

One way of directly exporting tables we already know about is writing them to a .csv file. For more on this please review section 8.1. These can be imported into Excel and also directly into Word. This may be a quick way of exporting the values of a table and importing them into Office, but we have to do all the formatting in the external software and may have to repeat the formatting procedure every time the values change. So we should make sure that all errors have been fixed and the analysis is final before exporting to .csv.

In this example we save the number of police reports by year for the districts “Mitte” and “Friedrichshain-Kreuzberg” we computed earlier into a .csv file:

reports_fyears %>% 
  filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>% 
  group_by(district, year) %>% 
  summarise(reports = n()) %>% 
  write_csv("reports_year_M-FK.csv")
## `summarise()` has grouped output by 'district'.
## You can override using the `.groups` argument.

For reasons unknown to anyone but Microsoft, importing a .csv file that uses commas as the delimiter can actually be quite tricky on a German Windows installation if you are using Microsoft Office. Using Libre Office fixes this, but if you want to use Microsoft Office on a German system you might have to use semicolons as the delimiter by saving the file with write_csv2().

11.6.1 Further resources

A multitude of packages that provide functions for formatting and writing tables to different formats exist in the R world. To go into details is beyond the scope of this introduction, but I want to give you some pointers at least.

If you want to skip importing a .csv into Microsoft Office, you may be interested in packages that allow writing directly to .xlsx files. Two options are:

A way to directly produce tables in the .docx format is provided by the flextable package: https://davidgohel.github.io/flextable/

If your are working with LaTeX, the huxtable package can directly output .tex files, as well as HTML, Word, Excel and Powerpoint: https://hughjonesd.github.io/huxtable/

If you want to circumvent using external software for producing reports completely, R Markdown may be of interest to you. This allows you to write your report and your R code in one document and write it to .pdf or a range of other formats. The produced file includes all the output from your code that you want to report, i.e. code, plots and tables. Some resources on R Markdown can be found here:

The website your are viewing right now, as well as many of the linked resources, are built with the bookdown package, which extends the functionality of R Markdown and is suited for more large scale projects.