11 Transformation with dplyr
In this chapter, we will use several functions from the tidyverse package dplyr to clean the data we scraped in chapter 10. We will then select the columns and filter the observations (rows) we need, as well as compute some grouped summaries of the data.
First, let us load the required packages. dplyr is part of the core tidyverse; the same is true for the package lubridate, which will enable us to deal with dates and times.
11.1 mutate()
11.1.1 Computing new variables from existing ones
Before we continue working on our scraped reports data, we will look at some simple
examples that will show how new variables can be computed in a tibble from
existing ones, using mutate()
from dplyr.
For this purpose, let us create a new simple tibble on voter turnout. Note that the entered data is purely illustrational and has no meaning.
exmpl_tbl <- tibble(
ybirth = c("1993 - 2002", "1983 - 1992", "1973 - 1982", "1963 - 1972"),
eligible = c(100000, 100000, 100000, 100000),
turnout_person = c(40000, 50000, 60000, 70000),
turnout_mail = c (35000, 30000, 25000, 20000)
)
exmpl_tbl
## # A tibble: 4 × 4
## ybirth eligible turnout_person turnout_mail
## <chr> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000
## 2 1983 - 1992 100000 50000 30000
## 3 1973 - 1982 100000 60000 25000
## 4 1963 - 1972 100000 70000 20000
Here we have different columns for people who were eligible to vote, who voted in person and who voted by
mail, by their year of birth. We do not actually care about the difference in
the voting method in this example and want one column that combines both. We can
get there by using mutate()
. The function takes the data to be manipulated as
its first argument, followed by one or multiple arguments defining the new
columns to be created. We can create this new columns as computations from the
columns already present.
To calculate the total turnout, we write:
exmpl_tbl %>%
mutate(turnout = turnout_person + turnout_mail)
## # A tibble: 4 × 5
## ybirth eligible turnout_person turnout_mail turnout
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000 75000
## 2 1983 - 1992 100000 50000 30000 80000
## 3 1973 - 1982 100000 60000 25000 85000
## 4 1963 - 1972 100000 70000 20000 90000
We can also immediately start calculating with new columns in the same mutate()
command. For this, we chain multiple operations in mutate()
separated with ,
.
To calculate the turnout percentage in a second step, we write the following:
exmpl_tbl %>%
mutate(
turnout = turnout_person + turnout_mail,
turnout_pct = turnout / eligible
)
## # A tibble: 4 × 6
## ybirth eligible turnout_person turnout_mail turnout turnout_pct
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000 75000 0.75
## 2 1983 - 1992 100000 50000 30000 80000 0.8
## 3 1973 - 1982 100000 60000 25000 85000 0.85
## 4 1963 - 1972 100000 70000 20000 90000 0.9
Note that mutate()
is not limited to basic arithmetic operations. Many
functions can be applied within mutate()
: sum()
or mean()
are two examples
we already know.
11.1.2 Cleaning the data on police reports
Now knowing the basic principle of working with mutate()
, we can apply this to
the data on police reports we scraped in chapter 10.
First we have to load the data and examine what we are working with.
load("reports.RData")
reports
## # A tibble: 12,412 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen Ereigni…
## 2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen Ereigni…
## 3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht Ereigni…
## 4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt Ereigni…
## 5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Ereigni…
## 6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Ereigni…
## 7 06.03.2025 10:43 Uhr Auseinandersetzung im Park Ereigni…
## 8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt Ereigni…
## 9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Ereigni…
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Ereigni…
## # ℹ 12,402 more rows
We can leave the column containing the text of the police report as it is, but the columns containing the districts, as well as the date and time, could use some work.
The data in the column “District” includes the string “Ereignisort:” before
listing the actual name of the district the report refers to. This unnecessarily
clutters this column: we can use substr()
to remove these leading characters
within mutate()
. substr()
extracts a part of a string we want
to keep, taking the data to be applied to, as its first argument – here the
name of the column –, the starting character position where the extraction
shall begin as its second, and the character position where it shall stop, as
its third argument. Character position here refers to the numeric position of a
character in a string: to begin extraction after “Ereignisort:” we have to
count the length of this substring including the whitespace after “:”, which is
\(13\) and thus use \(14\) as the starting position. For the position to stop
extraction, we could either use an unrealistically high number or use
str_length(District)
to determine the exact length of each string.
reports %>%
mutate(District = substr(District, 14, 99))
## # A tibble: 12,412 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen Mitte
## 2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen Reinick…
## 3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht Lichten…
## 4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt Mitte
## 5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Pankow
## 6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Friedri…
## 7 06.03.2025 10:43 Uhr Auseinandersetzung im Park Pankow
## 8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt Charlot…
## 9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Mitte
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Charlot…
## # ℹ 12,402 more rows
reports %>%
mutate(District = substr(District, 14, str_length(District)))
## # A tibble: 12,412 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen Mitte
## 2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen Reinick…
## 3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht Lichten…
## 4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt Mitte
## 5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Pankow
## 6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Friedri…
## 7 06.03.2025 10:43 Uhr Auseinandersetzung im Park Pankow
## 8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt Charlot…
## 9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Mitte
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Charlot…
## # ℹ 12,402 more rows
The column “Date” includes the date and the time of each report as a character
string. To be able to use this data in analysis, we have to extract the date and
time in a format R can understand. This is easily achieved using the parsing
functions from lubridate. As the character data are written in the format
“day.month.year hour:minute”, we have to use the function dmy_hm()
on the
column.
reports %>%
mutate(date_cmpl = dmy_hm(Date))
## # A tibble: 12,412 × 4
## Date Report District date_cmpl
## <chr> <chr> <chr> <dttm>
## 1 06.03.2025 15:01 Uhr Heranwachsender mit mehrer… Ereigni… 2025-03-06 15:01:00
## 2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenomm… Ereigni… 2025-03-06 14:57:00
## 3 06.03.2025 14:49 Uhr Mehrere Brände in einer Na… Ereigni… 2025-03-06 14:49:00
## 4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall… Ereigni… 2025-03-06 11:40:00
## 5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen… Ereigni… 2025-03-06 10:49:00
## 6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzun… Ereigni… 2025-03-06 10:46:00
## 7 06.03.2025 10:43 Uhr Auseinandersetzung im Park Ereigni… 2025-03-06 10:43:00
## 8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb … Ereigni… 2025-03-06 09:52:00
## 9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Ve… Ereigni… 2025-03-06 09:21:00
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und … Ereigni… 2025-03-05 16:53:00
## # ℹ 12,402 more rows
At a later point we will do some analysis by year, month, weekday and time of day. lubridate includes functions that extract the subparts of date and time data.
year()
extracts the year, month()
the month and wday()
the day of the
week. The argument label = TRUE
tells the function to use the names of months
and days instead of a numerical value ranging from \(1\) to \(12\) and \(1\) to \(7\)
respectively.
reports %>%
mutate(date_cmpl = dmy_hm(Date),
year = year(date_cmpl),
month = month(date_cmpl, label = TRUE),
day = wday(date_cmpl, label = TRUE)
)
## # A tibble: 12,412 × 7
## Date Report District date_cmpl year month day
## <chr> <chr> <chr> <dttm> <dbl> <ord> <ord>
## 1 06.03.2025 15:01 Uhr Heranwac… Ereigni… 2025-03-06 15:01:00 2025 Mär Do
## 2 06.03.2025 14:57 Uhr Kokstaxi… Ereigni… 2025-03-06 14:57:00 2025 Mär Do
## 3 06.03.2025 14:49 Uhr Mehrere … Ereigni… 2025-03-06 14:49:00 2025 Mär Do
## 4 06.03.2025 11:40 Uhr Tourist … Ereigni… 2025-03-06 11:40:00 2025 Mär Do
## 5 06.03.2025 10:49 Uhr Hausfrie… Ereigni… 2025-03-06 10:49:00 2025 Mär Do
## 6 06.03.2025 10:46 Uhr Verkehrs… Ereigni… 2025-03-06 10:46:00 2025 Mär Do
## 7 06.03.2025 10:43 Uhr Auseinan… Ereigni… 2025-03-06 10:43:00 2025 Mär Do
## 8 06.03.2025 09:52 Uhr Mitarbei… Ereigni… 2025-03-06 09:52:00 2025 Mär Do
## 9 06.03.2025 09:21 Uhr Motorrol… Ereigni… 2025-03-06 09:21:00 2025 Mär Do
## 10 05.03.2025 16:53 Uhr Mann zei… Ereigni… 2025-03-05 16:53:00 2025 Mär Mi
## # ℹ 12,402 more rows
To extract the time of day, we first have to extract the substring of “Date”
that contains the time and then apply the lubridate function hm()
to it.
reports %>%
mutate(time =
substr(Date, 12, str_length(Date)) %>%
hm()
)
## # A tibble: 12,412 × 4
## Date Report District time
## <chr> <chr> <chr> <Period>
## 1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichv… Ereigni… 15H 1M 0S
## 2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen Ereigni… 14H 57M 0S
## 3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht Ereigni… 14H 49M 0S
## 4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt Ereigni… 11H 40M 0S
## 5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Lade… Ereigni… 10H 49M 0S
## 6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgä… Ereigni… 10H 46M 0S
## 7 06.03.2025 10:43 Uhr Auseinandersetzung im Park Ereigni… 10H 43M 0S
## 8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt Ereigni… 9H 52M 0S
## 9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunf… Ereigni… 9H 21M 0S
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert si… Ereigni… 16H 53M 0S
## # ℹ 12,402 more rows
We can combine all these individual steps in one mutate()
command and apply
them to the object reports
.
reports <- reports %>%
mutate(District = substr(District, 14, str_length(District)),
date_cmpl = dmy_hm(Date),
year = year(date_cmpl),
month = month(date_cmpl, label = TRUE),
day = wday(date_cmpl, label = TRUE),
time =
substr(Date, 12, str_length(Date)) %>%
hm()
)
reports
## # A tibble: 12,412 × 8
## Date Report District date_cmpl year month day time
## <chr> <chr> <chr> <dttm> <dbl> <ord> <ord> <Period>
## 1 06.03.2025 … Heran… Mitte 2025-03-06 15:01:00 2025 Mär Do 15H 1M 0S
## 2 06.03.2025 … Kokst… Reinick… 2025-03-06 14:57:00 2025 Mär Do 14H 57M 0S
## 3 06.03.2025 … Mehre… Lichten… 2025-03-06 14:49:00 2025 Mär Do 14H 49M 0S
## 4 06.03.2025 … Touri… Mitte 2025-03-06 11:40:00 2025 Mär Do 11H 40M 0S
## 5 06.03.2025 … Hausf… Pankow 2025-03-06 10:49:00 2025 Mär Do 10H 49M 0S
## 6 06.03.2025 … Verke… Friedri… 2025-03-06 10:46:00 2025 Mär Do 10H 46M 0S
## 7 06.03.2025 … Ausei… Pankow 2025-03-06 10:43:00 2025 Mär Do 10H 43M 0S
## 8 06.03.2025 … Mitar… Charlot… 2025-03-06 09:52:00 2025 Mär Do 9H 52M 0S
## 9 06.03.2025 … Motor… Mitte 2025-03-06 09:21:00 2025 Mär Do 9H 21M 0S
## 10 05.03.2025 … Mann … Charlot… 2025-03-05 16:53:00 2025 Mär Mi 16H 53M 0S
## # ℹ 12,402 more rows
11.2 select()
As we have already extracted all we need from the “Date” and “date_cmpl” columns,
and we will not work with the “Report” column in this chapter, to keep the
tibble neat and free of clutter, we remove the three columns by using the
function select()
. select()
is used for selecting columns from a tibbble. The
function takes the data as its first argument – here provided by the pipe –
and one or several names of columns that should be kept. If columns follow after
each other in order, we can also use a “from:to” notation.
reports %>%
select(District, year, month, day, time)
## # A tibble: 12,412 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Mitte 2025 Mär Do 15H 1M 0S
## 2 Reinickendorf 2025 Mär Do 14H 57M 0S
## 3 Lichtenberg 2025 Mär Do 14H 49M 0S
## 4 Mitte 2025 Mär Do 11H 40M 0S
## 5 Pankow 2025 Mär Do 10H 49M 0S
## 6 Friedrichshain-Kreuzberg 2025 Mär Do 10H 46M 0S
## 7 Pankow 2025 Mär Do 10H 43M 0S
## 8 Charlottenburg-Wilmersdorf 2025 Mär Do 9H 52M 0S
## 9 Mitte 2025 Mär Do 9H 21M 0S
## 10 Charlottenburg-Wilmersdorf 2025 Mär Mi 16H 53M 0S
## # ℹ 12,402 more rows
reports %>%
select(District, year:time)
## # A tibble: 12,412 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Mitte 2025 Mär Do 15H 1M 0S
## 2 Reinickendorf 2025 Mär Do 14H 57M 0S
## 3 Lichtenberg 2025 Mär Do 14H 49M 0S
## 4 Mitte 2025 Mär Do 11H 40M 0S
## 5 Pankow 2025 Mär Do 10H 49M 0S
## 6 Friedrichshain-Kreuzberg 2025 Mär Do 10H 46M 0S
## 7 Pankow 2025 Mär Do 10H 43M 0S
## 8 Charlottenburg-Wilmersdorf 2025 Mär Do 9H 52M 0S
## 9 Mitte 2025 Mär Do 9H 21M 0S
## 10 Charlottenburg-Wilmersdorf 2025 Mär Mi 16H 53M 0S
## # ℹ 12,402 more rows
Instead of telling select()
which columns we want to keep, we can also tell it
which ones not to keep by adding a -
before the column names. If we want to
exclude multiple columns in one step, we combine them with the c()
function
and write the -
directly before it. In this last example we will assign the
result to reports
.
reports <- reports %>%
select(-c(Date, Report, date_cmpl))
reports
## # A tibble: 12,412 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Mitte 2025 Mär Do 15H 1M 0S
## 2 Reinickendorf 2025 Mär Do 14H 57M 0S
## 3 Lichtenberg 2025 Mär Do 14H 49M 0S
## 4 Mitte 2025 Mär Do 11H 40M 0S
## 5 Pankow 2025 Mär Do 10H 49M 0S
## 6 Friedrichshain-Kreuzberg 2025 Mär Do 10H 46M 0S
## 7 Pankow 2025 Mär Do 10H 43M 0S
## 8 Charlottenburg-Wilmersdorf 2025 Mär Do 9H 52M 0S
## 9 Mitte 2025 Mär Do 9H 21M 0S
## 10 Charlottenburg-Wilmersdorf 2025 Mär Mi 16H 53M 0S
## # ℹ 12,402 more rows
11.3 rename()
As you may have noticed, all newly created columns are written in lower case, while “District” begins with an upper case letter. You may want to rename this column to a lower case name – or all the others to upper case names, depending on your preference.
One approach to renaming is using the function rename()
from dplyr: the
function takes the data to be applied to as its first argument – again passed
on by the pipe here – followed by one or more arguments in the form
new_name = old_name
.
reports <- reports %>%
rename(district = District)
reports
## # A tibble: 12,412 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Mitte 2025 Mär Do 15H 1M 0S
## 2 Reinickendorf 2025 Mär Do 14H 57M 0S
## 3 Lichtenberg 2025 Mär Do 14H 49M 0S
## 4 Mitte 2025 Mär Do 11H 40M 0S
## 5 Pankow 2025 Mär Do 10H 49M 0S
## 6 Friedrichshain-Kreuzberg 2025 Mär Do 10H 46M 0S
## 7 Pankow 2025 Mär Do 10H 43M 0S
## 8 Charlottenburg-Wilmersdorf 2025 Mär Do 9H 52M 0S
## 9 Mitte 2025 Mär Do 9H 21M 0S
## 10 Charlottenburg-Wilmersdorf 2025 Mär Mi 16H 53M 0S
## # ℹ 12,402 more rows
11.4 filter()
Some of our analysis requires us to have complete years in our data. If we want to compare the police reports by year, it makes no sense to compare the numbers for an ongoing year with those that are complete, since the former will obviously have fewer reports. Incomplete years would also impact analysis by month, as some months will have more observations than others; therefore, we should filter our data for complete years.
To filter the observations, we can use the function filter()
. As always in
tidyverse functions, filter()
takes the data to be filtered as its first
argument and one or multiple expressions that specify the rules by which to
filter. To write these expressions, we can make use of the comparison operators
discussed in subchapter 1.3.2.
To filter for all years that are not 2025, we can write:
reports %>%
filter(year != 2025)
## # A tibble: 11,912 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Marzahn-Hellersdorf 2024 Okt Fr 11H 25M 0S
## 2 Mitte 2024 Sep Do 11H 30M 0S
## 3 Lichtenberg 2024 Mai Mi 14H 7M 0S
## 4 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 59M 0S
## 5 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 7M 0S
## 6 Charlottenburg-Wilmersdorf 2024 Dez Di 13H 46M 0S
## 7 Treptow-Köpenick 2024 Dez Di 12H 18M 0S
## 8 Lichtenberg 2024 Dez Di 11H 56M 0S
## 9 Charlottenburg-Wilmersdorf 2024 Dez Di 11H 5M 0S
## 10 Steglitz-Zehlendorf 2024 Dez Mo 16H 58M 0S
## # ℹ 11,902 more rows
Thus, only the observations where “year” does not equal “2025” remain in the
tibble; more accurately, those observations for which the expression
year != 2025
is returned as TRUE
.
This is all the filtering we will have to do here, but we should still look on how to filter by multiple values at once, which is an important data wrangling still to have.
We can chain multiple expressions when using filter()
separated by commas.
o for example to reduce the data set to observations from 2024, 2023 and 2022 we
could write:
reports %>%
filter(year != 2025, year != 2021, year != 2020)
## # A tibble: 6,832 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Marzahn-Hellersdorf 2024 Okt Fr 11H 25M 0S
## 2 Mitte 2024 Sep Do 11H 30M 0S
## 3 Lichtenberg 2024 Mai Mi 14H 7M 0S
## 4 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 59M 0S
## 5 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 7M 0S
## 6 Charlottenburg-Wilmersdorf 2024 Dez Di 13H 46M 0S
## 7 Treptow-Köpenick 2024 Dez Di 12H 18M 0S
## 8 Lichtenberg 2024 Dez Di 11H 56M 0S
## 9 Charlottenburg-Wilmersdorf 2024 Dez Di 11H 5M 0S
## 10 Steglitz-Zehlendorf 2024 Dez Mo 16H 58M 0S
## # ℹ 6,822 more rows
Here we are excluding every observation that does not lie between 2022–2024.
Often there are multiple ways to formulate the filtering expressions. Here, we
could tell R which values for “year” we want to keep, instead of which we do not
want to keep. Instead of listing all those individual years, we can use %in%
to define a range of numerical values which should be kept when filtering. We
assign the result to a new object for later use.
reports %>%
filter(year %in% 2022:2024)
## # A tibble: 6,832 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Marzahn-Hellersdorf 2024 Okt Fr 11H 25M 0S
## 2 Mitte 2024 Sep Do 11H 30M 0S
## 3 Lichtenberg 2024 Mai Mi 14H 7M 0S
## 4 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 59M 0S
## 5 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 7M 0S
## 6 Charlottenburg-Wilmersdorf 2024 Dez Di 13H 46M 0S
## 7 Treptow-Köpenick 2024 Dez Di 12H 18M 0S
## 8 Lichtenberg 2024 Dez Di 11H 56M 0S
## 9 Charlottenburg-Wilmersdorf 2024 Dez Di 11H 5M 0S
## 10 Steglitz-Zehlendorf 2024 Dez Mo 16H 58M 0S
## # ℹ 6,822 more rows
While the ,
behaves like the logical operator &
, we can also use |
for
“OR” when combining expressions to be filtered upon.
For example we could use |
to select the same years in this way:
reports %>%
filter(year == 2022 | year == 2023 | year == 2024)
## # A tibble: 6,832 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Marzahn-Hellersdorf 2024 Okt Fr 11H 25M 0S
## 2 Mitte 2024 Sep Do 11H 30M 0S
## 3 Lichtenberg 2024 Mai Mi 14H 7M 0S
## 4 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 59M 0S
## 5 Charlottenburg-Wilmersdorf 2024 Dez Di 16H 7M 0S
## 6 Charlottenburg-Wilmersdorf 2024 Dez Di 13H 46M 0S
## 7 Treptow-Köpenick 2024 Dez Di 12H 18M 0S
## 8 Lichtenberg 2024 Dez Di 11H 56M 0S
## 9 Charlottenburg-Wilmersdorf 2024 Dez Di 11H 5M 0S
## 10 Steglitz-Zehlendorf 2024 Dez Mo 16H 58M 0S
## # ℹ 6,822 more rows
It is important to keep in mind the differences between AND and OR here.
If we would have used &
or ,
to chain these expressions instead, the
following would have happened:
reports %>%
filter(year == 2022, year == 2023, year == 2024)
## # A tibble: 0 × 5
## # ℹ 5 variables: district <chr>, year <dbl>, month <ord>, day <ord>,
## # time <Period>
We are left with nothing in our data set. This happens because the expression in
filter()
is now searching for observations who are at the same time from
2022, 2023 and 2024. This is obviously impossible.
Let us now actually remove the observations for 2025 from the saved object.
11.5 summarise()
& group_by()
We can now begin computing some basic summary statistics for our data. You should note though, that we do not know how the data we scraped is actually created. Do the dates and times on the website refer to the instance in time when a crime occurred – there are indications that this is not the case –, when a report was filed, or even when the PR team of the Berlin police uploaded the report to the website – which might even be the most plausible scenario? Also, are all crimes reported on this website – the daily numbers are too low for this to be the case – or is there an internal selection process? And if so, on what criteria does the selection occur? If this was a real research project, we would absolutely need to gain clarity on these and many other questions, before we even begin with collecting the data. We do not have clarity here, but we have to keep in mind that we may not analyse the statistical distribution of crimes in Berlin, but rather the working practice of the PR team.
11.5.1 Number of reports by groups
The dplyr function summarise()
can be used to calculate summary statistics for
a whole tibble; the syntax being similar to mutate()
. The result is a new
tibble, containing only the summary statistics we requested. The function we use
for the summary statistic here is n()
, which returns the number of
observations. As each observation represents one police report, this equals the
overall number of reports.
This worked, but is not very interesting, as the result is the same as the length
of the reports tibble. We are often more interested in summaries grouped by the
value of one or more other variables. We might ask ourselves, if there are
differences in the number of released police reports by year; for this purpose,
we can group the data by the values of the “year” column using group_by()
and
then compute the summary statistic separately for each group. Here, n()
returns the number of observations for each group, i.e. the number of reports
per year. As always with dplyr functions, group_by()
needs the data which
will be grouped as the first argument, followed by one or multiple columns to
group by.
reports_fyears %>%
group_by(year) %>%
summarise(reports = n())
## # A tibble: 5 × 2
## year reports
## <dbl> <int>
## 1 2020 2581
## 2 2021 2499
## 3 2022 2235
## 4 2023 2196
## 5 2024 2401
While there are differences between the years, there does not seem to be a clear and systematic pattern to it. We can do the same kind of analysis grouped by months and weekdays.
reports_fyears %>%
group_by(month) %>%
summarise(reports = n())
## # A tibble: 12 × 2
## month reports
## <ord> <int>
## 1 Jan 1013
## 2 Feb 859
## 3 Mär 1020
## 4 Apr 1092
## 5 Mai 1089
## 6 Jun 1015
## 7 Jul 1006
## 8 Aug 1009
## 9 Sep 975
## 10 Okt 964
## 11 Nov 957
## 12 Dez 913
reports_fyears %>%
group_by(day) %>%
summarise(reports = n())
## # A tibble: 7 × 2
## day reports
## <ord> <int>
## 1 So 1568
## 2 Mo 1608
## 3 Di 1772
## 4 Mi 1834
## 5 Do 1870
## 6 Fr 1754
## 7 Sa 1506
The analysis by month again shows no systematic variation (that February has less reports will moist probably come down to the shorter length of the month). Looking at the reports by day of the week on the other hand, shows a slight increase in reports over the week, culminating on Thursday As stated above, we do not really know how the data is created, so the main point to take away from this analysis might be, that the PR team of the Berlin police also works on the weekend, but with fewer working hours.
In the same manner, we can analyse the number of released police reports by district.
reports_fyears %>%
group_by(district) %>%
summarise(reports = n()) %>%
arrange(desc(reports))
## # A tibble: 16 × 2
## district reports
## <chr> <int>
## 1 Mitte 1723
## 2 Friedrichshain-Kreuzberg 1244
## 3 Neukölln 1064
## 4 Charlottenburg-Wilmersdorf 1059
## 5 Tempelhof-Schöneberg 872
## 6 Pankow 869
## 7 Lichtenberg 757
## 8 Treptow-Köpenick 737
## 9 Spandau 718
## 10 Marzahn-Hellersdorf 677
## 11 Reinickendorf 650
## 12 Steglitz-Zehlendorf 566
## 13 bezirksübergreifend 463
## 14 berlinweit 379
## 15 <NA> 79
## 16 bundesweit 55
For the grouped summary we also included arrange(desc())
which orders the values
from highest to lowest. If you omit desc()
the results are ordered
ascending.
We might also be interested in the relative share of reports by district. To
compute those, one option is to add another column to the tibble resulting from
summarise()
, in which we calculate the relative share by dividing the
“reports” columns values by the total of this column.
reports_fyears %>%
group_by(district) %>%
summarise(reports = n()) %>%
arrange(desc(reports)) %>%
mutate(reports_rel = reports / sum(reports))
## # A tibble: 16 × 3
## district reports reports_rel
## <chr> <int> <dbl>
## 1 Mitte 1723 0.145
## 2 Friedrichshain-Kreuzberg 1244 0.104
## 3 Neukölln 1064 0.0893
## 4 Charlottenburg-Wilmersdorf 1059 0.0889
## 5 Tempelhof-Schöneberg 872 0.0732
## 6 Pankow 869 0.0730
## 7 Lichtenberg 757 0.0635
## 8 Treptow-Köpenick 737 0.0619
## 9 Spandau 718 0.0603
## 10 Marzahn-Hellersdorf 677 0.0568
## 11 Reinickendorf 650 0.0546
## 12 Steglitz-Zehlendorf 566 0.0475
## 13 bezirksübergreifend 463 0.0389
## 14 berlinweit 379 0.0318
## 15 <NA> 79 0.00663
## 16 bundesweit 55 0.00462
The results show clear differences in the number of reports by district, with “Mitte” leading by a substantial margin. Other districts with high counts seem to be those that at least in part lie within the inner ring. The outer districts all show lower numbers. Again, we do not know how the data is created, so maybe it is just the case that the crimes in the inner ring are more interesting than in other districts and that the data does not necessarily point to a higher number of crimes in central Berlin. To get a more complete picture, we would also have to relate the number of reports to the citizens or even tourists per district. For this we would need additional external data though.
Using group_by()
we can also group by multiple columns. “Mitte” and
“Friedrichshain-Kreuzberg” showed the highest numbers among all districts. Let
us analyse if these numbers changed over the years for these two districts.
First we have to use filter()
to use only the observations referring to those
districts and then group the data by the remaining districts and year before we
count the the number of released reports.
reports_fyears %>%
filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>%
group_by(district, year) %>%
summarise(reports = n())
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: district [2]
## district year reports
## <chr> <dbl> <int>
## 1 Friedrichshain-Kreuzberg 2020 301
## 2 Friedrichshain-Kreuzberg 2021 312
## 3 Friedrichshain-Kreuzberg 2022 190
## 4 Friedrichshain-Kreuzberg 2023 210
## 5 Friedrichshain-Kreuzberg 2024 231
## 6 Mitte 2020 374
## 7 Mitte 2021 328
## 8 Mitte 2022 318
## 9 Mitte 2023 333
## 10 Mitte 2024 370
Looking at “Friedrichshain-Kreuzberg” we can observe a substantial decrease in reports starting from 2022. The same is not true for “Mitte” which despite of a moderate temporal decrease seems to be relatively stable in the number of reports. To analyse this further, we had to go deeper into the data and look at the actual texts of the police reports.
But we should briefly talk about a peculiarity to the way group_by()
works
with summarise()
that can cause headaches, if we are not aware of it. In
general, any summarise()
function following a group_by()
will calculate the
summary statistic and then remove one level of grouping. In the examples where
we only had one level of grouping, this essentially meant, that the data was
ungrouped after summarise()
. In the last example we had two levels of
grouping. So, after the computation of the number of reports by “district” and
“year” the grouping by “year” was removed, but the grouping by “district”
remained in effect. We can see this in the output, where R informs us about the
column by which the data are grouped and the number of groups in the output:
# Groups: district [2]
. Another summarise()
function would compute the
statistic by “district” and then remove this level as well. We can also use
ungroup()
to remove all grouping from a tibble. In the case of this example,
it does not make a practical difference as we only compute the summary and then
move on. But, if we assign the results of a summary to an object for later use,
we have to decide if we want to remove the grouping or keep it in effect,
depending on the goals of the analysis. In general, I would always ungroup the
data and group it again if the need arises, as this is less error prone and just
a minor increase in the amount of typing needed.
11.5.2 Summary statistics on the time
The column “time” holds the time of day for each report in hours, minutes and seconds as a representation of the actual values, which are seconds from midnight.
reports_fyears[[1, 'time']]
## [1] "11H 25M 0S"
reports_fyears[[1, 'time']] %>%
as.numeric()
## [1] 41100
To calculate with the “time” column, we first have to tell R that we explicitly
want to use the data as seconds, then compute the summary statistic and after
this, transform the display of seconds back into the time of day. For the
transformations we can use the functions period_to_seconds()
and
seconds_to_period()
. We will calculate the mean and the median for the time of
day over all police reports in one call of summarise()
.
reports %>%
summarise(mean = period_to_seconds(time) %>%
mean() %>%
seconds_to_period(),
median = period_to_seconds(time) %>%
median() %>%
seconds_to_period()
)
## # A tibble: 1 × 2
## mean median
## <Period> <Period>
## 1 12H 31M 10.2336448598144S 12H 4M 0S
The mean time of day over all police reports is about 12:28 while the median is 12:01. This may indicate that the mean is biased towards a later time by a high number of very late reports. We will explore this further graphically in the next chapter. Again, the results may represent the time when crimes occur, but it is more likely, that we are actually analysing the time when the PR team posts their police reports.
11.6 Exporting tables
Exporting the results of our analysis – e.g. summary statistics computed with
summarise()
– sadly is not as straightforward as one might think or hope. We
can always copy and paste values from the output into Word or other pieces of
software we might use for writing a report. But this is cumbersome and error
prone.
One way of directly exporting tables we already know about is writing them to a .csv file. For more on this please review section 8.1. These can be imported into Excel and also directly into Word. This may be a quick way of exporting the values of a table and importing them into Office, but we have to do all the formatting in the external software and may have to repeat the formatting procedure every time the values change. So we should make sure that all errors have been fixed and the analysis is final before exporting to .csv.
In this example we save the number of police reports by year for the districts “Mitte” and “Friedrichshain-Kreuzberg” we computed earlier into a .csv file:
reports_fyears %>%
filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>%
group_by(district, year) %>%
summarise(reports = n()) %>%
write_csv("reports_year_M-FK.csv")
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
For reasons unknown to anyone but Microsoft, importing a .csv file that uses
commas as the delimiter can actually be quite tricky on a German Windows
installation if you are using Microsoft Office. Using Libre Office fixes this,
but if you want to use Microsoft Office on a German system you might have to use
semicolons as the delimiter by saving the file with write_csv2()
.
11.6.1 Further resources
A multitude of packages that provide functions for formatting and writing tables to different formats exist in the R world. To go into details is beyond the scope of this introduction, but I want to give you some pointers at least.
If you want to skip importing a .csv into Microsoft Office, you may be interested in packages that allow writing directly to .xlsx files. Two options are:
- writexl: https://cran.r-project.org/web/packages/writexl/index.html
- openxlsx: https://cran.r-project.org/web/packages/openxlsx/index.html
A way to directly produce tables in the .docx format is provided by the flextable package: https://davidgohel.github.io/flextable/
If your are working with LaTeX, the huxtable package can directly output .tex files, as well as HTML, Word, Excel and Powerpoint: https://hughjonesd.github.io/huxtable/
If you want to circumvent using external software for producing reports completely, R Markdown may be of interest to you. This allows you to write your report and your R code in one document and write it to .pdf or a range of other formats. The produced file includes all the output from your code that you want to report, i.e. code, plots and tables. Some resources on R Markdown can be found here:
- Introduction on the RStudio website: https://rmarkdown.rstudio.com/lesson-1.html{target_“blank”}
- The R Markdown cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf
- Chapter on R Markdown in “R for Data Science” by Hadley Wickham and Garrett Grolemund: https://r4ds.had.co.nz/r-markdown.html
- “R Markdown Cookbook” by Yihui Xie, Christophe Dervieux, Emily Riederer: https://bookdown.org/yihui/rmarkdown-cookbook/
The website your are viewing right now, as well as many of the linked resources, are built with the bookdown package, which extends the functionality of R Markdown and is suited for more large scale projects.
- “bookdown: Authoring Books and Technical Documents with R Markdown” by Yihui Xie: https://bookdown.org/yihui/bookdown/