11 Transformation with dplyr
In this chapter, we will use several functions from the tidyverse package dplyr to clean the data we scraped in chapter 10. We will then select the columns and filter the observations (rows) we need, as well as compute some grouped summaries of the data.
First, let us load the required packages. dplyr is part of the core tidyverse; we will need the package lubridate as well, to be able to deal with dates and times.
11.1 mutate()
11.1.1 Computing new variables from existing ones
Before we continue working on our scraped reports data, we will look at some simple
examples that will show how new variables can be computed in a tibble from
existing ones, using mutate()
from dplyr.
For this purpose, let us create a new simple tibble on voter turnout. Note that the entered data is purely illustrational and has no meaning.
exmpl_tbl <- tibble(
ybirth = c("1993 - 2002", "1983 - 1992", "1973 - 1982", "1963 - 1972"),
eligible = c(100000, 100000, 100000, 100000),
turnout_person = c(40000, 50000, 60000, 70000),
turnout_mail = c (35000, 30000, 25000, 20000)
)
exmpl_tbl
## # A tibble: 4 × 4
## ybirth eligible turnout_person turnout_mail
## <chr> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000
## 2 1983 - 1992 100000 50000 30000
## 3 1973 - 1982 100000 60000 25000
## 4 1963 - 1972 100000 70000 20000
Here we have different columns for people who were eligible to vote, who voted in person and who voted by
mail, by their year of birth. We do not actually care about the difference in
the voting method in this example and want one column that combines both. We can
get there by using mutate()
. The function takes the data to be manipulated as
its first argument, followed by one or multiple arguments defining the new
columns to be created. We can create this new columns as computations from the
columns already present.
To calculate the total turnout, we write:
exmpl_tbl %>%
mutate(turnout = turnout_person + turnout_mail)
## # A tibble: 4 × 5
## ybirth eligible turnout_person turnout_mail turnout
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000 75000
## 2 1983 - 1992 100000 50000 30000 80000
## 3 1973 - 1982 100000 60000 25000 85000
## 4 1963 - 1972 100000 70000 20000 90000
We can also immediately start calculating with new columns in the same mutate()
command. For this, we chain multiple operations in mutate()
separated with ,
.
To calculate the turnout percentage in a second step, we write the following:
exmpl_tbl %>%
mutate(
turnout = turnout_person + turnout_mail,
turnout_pct = turnout / eligible
)
## # A tibble: 4 × 6
## ybirth eligible turnout_person turnout_mail turnout turnout_pct
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1993 - 2002 100000 40000 35000 75000 0.75
## 2 1983 - 1992 100000 50000 30000 80000 0.8
## 3 1973 - 1982 100000 60000 25000 85000 0.85
## 4 1963 - 1972 100000 70000 20000 90000 0.9
Note that mutate()
is not limited to basic arithmetic operations. Many
functions can be applied within mutate()
: sum()
or mean()
are two examples
we already know.
11.1.2 Cleaning the data on police reports
Now knowing the basic principle of working with mutate()
, we can apply this to
the data on police reports we scraped in chapter 10.
First we have to load the data and examine what we are working with.
load("reports.RData")
reports
## # A tibble: 21,952 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Ereigni…
## 2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Ereigni…
## 3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… Ereigni…
## 4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl Ereigni…
## 5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Ereigni…
## 6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen Ereigni…
## 7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten Ereigni…
## 8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft Ereigni…
## 9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Ereigni…
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… Ereigni…
## # ℹ 21,942 more rows
We can leave the column containing the text of the police report as it is, but the columns containing the districts, as well as the date and time, could use some work.
The data in the column “District” includes the string “Ereignisort:” before
listing the actual name of the district the report refers to. This unnecessarily
clutters this column: we can use substr()
to remove these leading characters
within mutate()
. substr()
extracts a part of a string we want
to keep, taking the data to be applied to, as its first argument – here the
name of the column –, the starting character position where the extraction
shall begin as its second, and the character position where it shall stop, as
its third argument. Character position here refers to the numeric position of a
character in a string: to begin extraction after “Ereignisort:” we have to
count the length of this substring including the whitespace after “:”, which is
\(13\) and thus use \(14\) as the starting position. For the position to stop
extraction, we could either use an unrealistically high number or use
length(District)
to determine the exact length of each string.
reports %>%
mutate(District = substr(District, 14, 99))
## # A tibble: 21,952 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Treptow…
## 2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Charlot…
## 3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… berlinw…
## 4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl Friedri…
## 5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Marzahn…
## 6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen Friedri…
## 7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten Mitte
## 8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft Charlot…
## 9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Neukölln
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… berlinw…
## # ℹ 21,942 more rows
reports %>%
mutate(District = substr(District, 14, length(District)))
## # A tibble: 21,952 × 3
## Date Report District
## <chr> <chr> <chr>
## 1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mordkommissio… Treptow…
## 2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch Schuss - Du… Charlot…
## 3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit dem Nahost-Ko… berlinw…
## 4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebstahl Friedri…
## 5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhaus verstorben Marzahn…
## 6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen Friedri…
## 7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wegerechten Mitte
## 8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft Charlot…
## 9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen - Mordkomm… Neukölln
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit den weltweite… berlinw…
## # ℹ 21,942 more rows
The column “Date” includes the date and the time of each report as a character
string. To be able to use this data in analysis, we have to extract the date and
time in a format R can understand. This is easily achieved using the parsing
functions from lubridate. As the character data are written in the format
“day.month.year hour:minute”, we have to use the function dmy_hm()
on the
column.
reports %>%
mutate(date_cmpl = dmy_hm(Date))
## # A tibble: 21,952 × 4
## Date Report District date_cmpl
## <chr> <chr> <chr> <dttm>
## 1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeli… Ereigni… 2023-10-16 17:55:00
## 2 16.10.2023 17:49 Uhr Gefährliche Körperverletzu… Ereigni… 2023-10-16 17:49:00
## 3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenha… Ereigni… 2023-10-16 14:43:00
## 4 16.10.2023 11:26 Uhr Festnahme nach räuberische… Ereigni… 2023-10-16 11:26:00
## 5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in K… Ereigni… 2023-10-16 10:07:00
## 6 16.10.2023 09:53 Uhr Brandstiftung an mehreren … Ereigni… 2023-10-16 09:53:00
## 7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder-… Ereigni… 2023-10-15 21:07:00
## 8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft Ereigni… 2023-10-15 15:46:00
## 9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus… Ereigni… 2023-10-15 14:59:00
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenha… Ereigni… 2023-10-15 14:47:00
## # ℹ 21,942 more rows
At a later point we will do some analysis by year, month, weekday and time of day. lubridate includes functions that extract the subparts of date and time data.
year()
extracts the year, month()
the month and wday()
the day of the
week. The argument label = TRUE
tells the function to use the names of months
and days instead of a numerical value ranging from \(1\) to \(12\) and \(1\) to \(7\)
respectively.
reports %>%
mutate(date_cmpl = dmy_hm(Date),
year = year(date_cmpl),
month = month(date_cmpl, label = TRUE),
day = wday(date_cmpl, label = TRUE)
)
## # A tibble: 21,952 × 7
## Date Report District date_cmpl year month day
## <chr> <chr> <chr> <dttm> <dbl> <ord> <ord>
## 1 16.10.2023 17:55 Uhr Verdacht… Ereigni… 2023-10-16 17:55:00 2023 Okt Mo
## 2 16.10.2023 17:49 Uhr Gefährli… Ereigni… 2023-10-16 17:49:00 2023 Okt Mo
## 3 16.10.2023 14:43 Uhr Vorkommn… Ereigni… 2023-10-16 14:43:00 2023 Okt Mo
## 4 16.10.2023 11:26 Uhr Festnahm… Ereigni… 2023-10-16 11:26:00 2023 Okt Mo
## 5 16.10.2023 10:07 Uhr Verletzt… Ereigni… 2023-10-16 10:07:00 2023 Okt Mo
## 6 16.10.2023 09:53 Uhr Brandsti… Ereigni… 2023-10-16 09:53:00 2023 Okt Mo
## 7 15.10.2023 21:07 Uhr Verkehrs… Ereigni… 2023-10-15 21:07:00 2023 Okt So
## 8 15.10.2023 15:46 Uhr Raub im … Ereigni… 2023-10-15 15:46:00 2023 Okt So
## 9 15.10.2023 14:59 Uhr Einkaufs… Ereigni… 2023-10-15 14:59:00 2023 Okt So
## 10 15.10.2023 14:47 Uhr Vorkommn… Ereigni… 2023-10-15 14:47:00 2023 Okt So
## # ℹ 21,942 more rows
To extract the time of day, we first have to extract the substring of “Date”
that contains the time and then apply the lubridate function hm()
to it.
reports %>%
mutate(time =
substr(Date, 12, length(Date)) %>%
hm()
)
## # A tibble: 21,952 × 4
## Date Report District time
## <chr> <chr> <chr> <Period>
## 1 16.10.2023 17:55 Uhr Verdacht eines Tötungsdeliktes - Mo… Ereigni… 17H 55M 0S
## 2 16.10.2023 17:49 Uhr Gefährliche Körperverletzung durch … Ereigni… 17H 49M 0S
## 3 16.10.2023 14:43 Uhr Vorkommnisse im Zusammenhang mit de… Ereigni… 14H 43M 0S
## 4 16.10.2023 11:26 Uhr Festnahme nach räuberischem Diebsta… Ereigni… 11H 26M 0S
## 5 16.10.2023 10:07 Uhr Verletzte Radfahrerin in Krankenhau… Ereigni… 10H 7M 0S
## 6 16.10.2023 09:53 Uhr Brandstiftung an mehreren Fahrzeugen Ereigni… 9H 53M 0S
## 7 15.10.2023 21:07 Uhr Verkehrsunfall mit Sonder- und Wege… Ereigni… 21H 7M 0S
## 8 15.10.2023 15:46 Uhr Raub im Juweliergeschäft Ereigni… 15H 46M 0S
## 9 15.10.2023 14:59 Uhr Einkaufswagen aus Hochhaus geworfen… Ereigni… 14H 59M 0S
## 10 15.10.2023 14:47 Uhr Vorkommnisse im Zusammenhang mit de… Ereigni… 14H 47M 0S
## # ℹ 21,942 more rows
We can combine all these individual steps in one mutate()
command and apply
them to the object reports
.
reports <- reports %>%
mutate(District = substr(District, 14, length(District)),
date_cmpl = dmy_hm(Date),
year = year(date_cmpl),
month = month(date_cmpl, label = TRUE),
day = wday(date_cmpl, label = TRUE),
time =
substr(Date, 12, length(Date)) %>%
hm()
)
reports
## # A tibble: 21,952 × 8
## Date Report District date_cmpl year month day time
## <chr> <chr> <chr> <dttm> <dbl> <ord> <ord> <Period>
## 1 16.10.2023 … Verda… Treptow… 2023-10-16 17:55:00 2023 Okt Mo 17H 55M 0S
## 2 16.10.2023 … Gefäh… Charlot… 2023-10-16 17:49:00 2023 Okt Mo 17H 49M 0S
## 3 16.10.2023 … Vorko… berlinw… 2023-10-16 14:43:00 2023 Okt Mo 14H 43M 0S
## 4 16.10.2023 … Festn… Friedri… 2023-10-16 11:26:00 2023 Okt Mo 11H 26M 0S
## 5 16.10.2023 … Verle… Marzahn… 2023-10-16 10:07:00 2023 Okt Mo 10H 7M 0S
## 6 16.10.2023 … Brand… Friedri… 2023-10-16 09:53:00 2023 Okt Mo 9H 53M 0S
## 7 15.10.2023 … Verke… Mitte 2023-10-15 21:07:00 2023 Okt So 21H 7M 0S
## 8 15.10.2023 … Raub … Charlot… 2023-10-15 15:46:00 2023 Okt So 15H 46M 0S
## 9 15.10.2023 … Einka… Neukölln 2023-10-15 14:59:00 2023 Okt So 14H 59M 0S
## 10 15.10.2023 … Vorko… berlinw… 2023-10-15 14:47:00 2023 Okt So 14H 47M 0S
## # ℹ 21,942 more rows
11.2 select()
As we have already extracted all we need from the “Date” and “date_cmpl” columns,
and we will not work with the “Report” column in this chapter, to keep the
tibble neat and free of clutter, we remove the three columns by using the
function select()
. select()
is used for selecting columns from a tibbble. The
function takes the data as its first argument – here provided by the pipe –
and one or several names of columns that should be kept. If columns follow after
each other in order, we can also use a “from:to” notation.
reports %>%
select(District, year, month, day, time)
## # A tibble: 21,952 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 21,942 more rows
reports %>%
select(District, year:time)
## # A tibble: 21,952 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 21,942 more rows
Instead of telling select()
which columns we want to keep, we can also tell it
which ones not to keep by adding a -
before the column names. If we want to
exclude multiple columns in one step, we combine them with the c()
function
and write the -
directly before it. In this last example we will assign the
result to reports
.
reports <- reports %>%
select(-c(Date, Report, date_cmpl))
reports
## # A tibble: 21,952 × 5
## District year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 21,942 more rows
11.3 rename()
As you may have noticed, all newly created columns are written in lower case, while “District” begins with an upper case letter. You may want to rename this column to a lower case name – or all the others to upper case names, depending on your preference.
One approach to renaming is using the function rename()
from dplyr: the
function takes the data to be applied to as its first argument – again passed
on by the pipe here – followed by one or more arguments in the form
new_name = old_name
.
reports <- reports %>%
rename(district = District)
reports
## # A tibble: 21,952 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 21,942 more rows
11.4 filter()
Some of our analysis requires us to have complete years in our data. If we want to compare the police reports by year, it makes no sense to compare the numbers for an ongoing year with those that are complete, since the former will obviously have fewer reports. Incomplete years would also impact analysis by month, as some months will have more observations than others; therefore, we should filter our data for complete years.
To filter the observations, we can use the function filter()
. As always in
tidyverse functions, filter()
takes the data to be filtered as its first
argument and one or multiple expressions that specify the rules by which to
filter. To write these expressions, we can make use of the comparison operators
discussed in subchapter 1.3.2.
To filter for all years that are not 2022, we can write:
reports %>%
filter(year != 2022)
## # A tibble: 19,716 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 19,706 more rows
Thus, only the observations where “year” does not equal “2022” remain in the
tibble; more accurately, those observations for which the expression
year != 2022
is returned as TRUE
.
Closer inspection of the data reveals, that 2014 (the first year that is
available on the website) is not complete either. There also seems to be one
single report for 2013. We should also exclude these years for analysis. We can
chain multiple expressions when using filter()
separated by commas.
reports %>%
filter(year != 2022, year != 2014, year != 2013)
## # A tibble: 19,012 × 5
## district year month day time
## <chr> <dbl> <ord> <ord> <Period>
## 1 Treptow-Köpenick 2023 Okt Mo 17H 55M 0S
## 2 Charlottenburg-Wilmersdorf 2023 Okt Mo 17H 49M 0S
## 3 berlinweit 2023 Okt Mo 14H 43M 0S
## 4 Friedrichshain-Kreuzberg 2023 Okt Mo 11H 26M 0S
## 5 Marzahn-Hellersdorf 2023 Okt Mo 10H 7M 0S
## 6 Friedrichshain-Kreuzberg 2023 Okt Mo 9H 53M 0S
## 7 Mitte 2023 Okt So 21H 7M 0S
## 8 Charlottenburg-Wilmersdorf 2023 Okt So 15H 46M 0S
## 9 Neukölln 2023 Okt So 14H 59M 0S
## 10 berlinweit 2023 Okt So 14H 47M 0S
## # ℹ 19,002 more rows
Often there are multiple ways to formulate the filtering expressions. Here, we
could tell R which values for “year” we want to keep, instead of which we do not
want to keep. Instead of listing all those individual years, we can use %in%
to define a range of numerical values which should be kept when filtering. We
assign the result to a new object for later use.
While the ,
behaves like the logical operator &
, we can also use |
for
“or” when combining expressions to be filtered upon.
11.5 summarise()
& group_by()
We can now begin computing some basic summary statistics for our data. You should note though, that we do not know how the data we scraped is actually created. Do the dates and times on the website refer to the instance in time when a crime occurred – there are indications that this is not the case –, when a report was filed, or even when the PR team of the Berlin police uploaded the report to the website – which might even be the most plausible scenario? Also, are all crimes reported on this website – the daily numbers are too low for this to be the case – or is there an internal selection process? And if so, on what criteria does the selection occur? If this was a real research project, we would absolutely need to gain clarity on these and many other questions, before we even begin with collecting the data. We do not have clarity here, but we have to keep in mind that we may not analyse the statistical distribution of crimes in Berlin, but rather the working practice of the PR team.
11.5.1 Number of reports by groups
The dplyr function summarise()
can be used to calculate summary statistics for
a whole tibble; the syntax being similar to mutate()
. The result is a new
tibble, containing only the summary statistics we requested. The function we use
for the summary statistic here is n()
, which returns the number of
observations. As each observation represents one police report, this equals the
overall number of reports.
This worked, but is not very interesting, as the result is the same as the length
of the reports tibble. We are often more interested in summaries grouped by the
value of one or more other variables. We might ask ourselves, if there are
differences in the number of released police reports by year; for this purpose,
we can group the data by the values of the “year” column using group_by()
and
then compute the summary statistic separately for each group. Here, n()
returns the number of observations for each group, i.e. the number of reports
per year. As always with dplyr functions, group_by()
needs the data which
will be grouped as the first argument, followed by one or multiple columns to
group by.
reports_fyears %>%
group_by(year) %>%
summarise(reports = n())
## # A tibble: 8 × 2
## year reports
## <dbl> <int>
## 1 2015 2369
## 2 2016 2626
## 3 2017 2233
## 4 2018 2342
## 5 2019 2619
## 6 2020 2583
## 7 2021 2500
## 8 2022 2236
While there are differences between the years, there does not seem to be a systematic pattern to it. We can do the same kind of analysis grouped by months and weekdays.
reports_fyears %>%
group_by(month) %>%
summarise(reports = n())
## # A tibble: 12 × 2
## month reports
## <ord> <int>
## 1 Jan 1735
## 2 Feb 1530
## 3 Mär 1702
## 4 Apr 1657
## 5 Mai 1717
## 6 Jun 1579
## 7 Jul 1645
## 8 Aug 1648
## 9 Sep 1593
## 10 Okt 1627
## 11 Nov 1550
## 12 Dez 1525
reports_fyears %>%
group_by(day) %>%
summarise(reports = n())
## # A tibble: 7 × 2
## day reports
## <ord> <int>
## 1 So 2688
## 2 Mo 2663
## 3 Di 2790
## 4 Mi 2909
## 5 Do 2916
## 6 Fr 2903
## 7 Sa 2639
The analysis by month again shows no systematic variation. Looking at the reports by day of the week on the other hand, shows a slight increase in reports over the week, culminating on Friday. As stated above, we do not really know how the data is created, so the main point to take away from this analysis might be, that the PR team of the Berlin police also works on the weekend.
In the same manner, we can analyse the number of released police reports by district.
reports_fyears %>%
group_by(district) %>%
summarise(reports = n())
## # A tibble: 22 × 2
## district reports
## <chr> <int>
## 1 Charlottenburg - Wilmersdorf 513
## 2 Charlottenburg-Wilmersdorf 1279
## 3 Friedrichshain - Kreuzberg 577
## 4 Friedrichshain-Kreuzberg 1528
## 5 Lichtenberg 1075
## 6 Marzahn - Hellersdorf 225
## 7 Marzahn-Hellersdorf 758
## 8 Mitte 2936
## 9 Neukölln 1734
## 10 Pankow 1438
## # ℹ 12 more rows
Before we interpret the result, let us examine the district names; districts
with a dash are written in two ways: with whitespace around the dash and without
one. There may have been a change in the way these names are recorded in the data
over the years and we have to deal with it. We can use the function
str_remove_all()
from stringr to remove every occurrence of a pattern we
specify as its argument. If we remove the pattern ” ” we basically delete all
whitespace. This can again be combined with mutate()
to transform the data in
our tibble.
reports_fyears <- reports_fyears %>%
mutate(district = str_remove_all(district, pattern = " "))
reports_fyears %>%
group_by(district) %>%
summarise(reports = n()) %>%
arrange(desc(reports))
## # A tibble: 16 × 2
## district reports
## <chr> <int>
## 1 Mitte 2936
## 2 Friedrichshain-Kreuzberg 2105
## 3 Charlottenburg-Wilmersdorf 1792
## 4 Neukölln 1734
## 5 Tempelhof-Schöneberg 1542
## 6 Pankow 1438
## 7 Treptow-Köpenick 1197
## 8 Lichtenberg 1075
## 9 Reinickendorf 1038
## 10 Spandau 1024
## 11 Marzahn-Hellersdorf 983
## 12 Steglitz-Zehlendorf 946
## 13 bezirksübergreifend 921
## 14 <NA> 382
## 15 berlinweit 340
## 16 bundesweit 55
For the grouped summary we also included arrange(desc())
which orders the values
from highest to lowest. If you omit desc()
the results are ordered
ascending.
We might also be interested in the relative share of reports by district. To
compute those, one option is to add another column to the tibble resulting from
summarise()
, in which we calculate the relative share by dividing the
“reports” columns values by the total of this column.
reports_fyears %>%
group_by(district) %>%
summarise(reports = n()) %>%
arrange(desc(reports)) %>%
mutate(reports_rel = reports / sum(reports))
## # A tibble: 16 × 3
## district reports reports_rel
## <chr> <int> <dbl>
## 1 Mitte 2936 0.151
## 2 Friedrichshain-Kreuzberg 2105 0.108
## 3 Charlottenburg-Wilmersdorf 1792 0.0919
## 4 Neukölln 1734 0.0889
## 5 Tempelhof-Schöneberg 1542 0.0790
## 6 Pankow 1438 0.0737
## 7 Treptow-Köpenick 1197 0.0614
## 8 Lichtenberg 1075 0.0551
## 9 Reinickendorf 1038 0.0532
## 10 Spandau 1024 0.0525
## 11 Marzahn-Hellersdorf 983 0.0504
## 12 Steglitz-Zehlendorf 946 0.0485
## 13 bezirksübergreifend 921 0.0472
## 14 <NA> 382 0.0196
## 15 berlinweit 340 0.0174
## 16 bundesweit 55 0.00282
The results show clear differences in the number of reports by district, with “Mitte” leading by a substantial margin. Other districts with high counts seem to be those that at least in part lie within the inner ring. The outer districts all show lower numbers. Again, we do not know how the data is created, so maybe it is just the case that the crimes in the inner ring are more interesting than in other districts and that the data does not necessarily point to a higher number of crimes in central Berlin. To get a more complete picture, we would also have to relate the number of reports to the citizens or even tourists per district. For this we would need additional external data though.
Using group_by()
we can also group by multiple columns. “Mitte” and
“Friedrichshain-Kreuzberg” showed the highest numbers among all districts. Let
us analyse if these numbers changed over the years for these two districts.
First we have to use filter()
to use only the observations referring to those
districts and then group the data by the remaining districts and year before we
count the the number of released reports.
reports_fyears %>%
filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>%
group_by(district, year) %>%
summarise(reports = n())
## `summarise()` has grouped output by 'district'.
## You can override using the `.groups` argument.
## # A tibble: 16 × 3
## # Groups: district [2]
## district year reports
## <chr> <dbl> <int>
## 1 Friedrichshain-Kreuzberg 2015 277
## 2 Friedrichshain-Kreuzberg 2016 269
## 3 Friedrichshain-Kreuzberg 2017 234
## 4 Friedrichshain-Kreuzberg 2018 237
## 5 Friedrichshain-Kreuzberg 2019 285
## 6 Friedrichshain-Kreuzberg 2020 301
## 7 Friedrichshain-Kreuzberg 2021 312
## 8 Friedrichshain-Kreuzberg 2022 190
## 9 Mitte 2015 367
## 10 Mitte 2016 448
## 11 Mitte 2017 347
## 12 Mitte 2018 363
## 13 Mitte 2019 391
## 14 Mitte 2020 374
## 15 Mitte 2021 328
## 16 Mitte 2022 318
There does not seem to be a clear pattern for the number of reports over time in both districts. For “Mitte” there is a considerable spike in 2016. To analyse this further, we had to go deeper into the data and look at the actual texts of the police reports.
But we should briefly talk about a peculiarity to the way group_by()
works
with summarise()
that can cause headaches, if we are not aware of it. In
general, any summarise()
function following a group_by()
will calculate the
summary statistic and then remove one level of grouping. In the examples where
we only had one level of grouping, this essentially meant, that the data was
ungrouped after summarise()
. In the last example we had two levels of
grouping. So, after the computation of the number of reports by “district” and
“year” the grouping by “year” was removed, but the grouping by “district”
remained in effect. We can see this in the output, where R informs us about the
column by which the data are grouped and the number of groups in the output:
# Groups: district [2]
. Another summarise()
function would compute the
statistic by “district” and then remove this level as well. We can also use
ungroup()
to remove all grouping from a tibble. In the case of this example,
it does not make a practical difference as we only compute the summary and then
move on. But, if we assign the results of a summary to an object for later use,
we have to decide if we want to remove the grouping or keep it in effect,
depending on the goals of the analysis. In general, I would always ungroup the
data and group it again if the need arises, as this is less error prone and just
a minor increase in the amount of typing needed.
11.5.2 Summary statistics on the time
The column “time” holds the time of day for each report in hours, minutes and seconds as a representation of the actual values, which are seconds from midnight.
reports_fyears[[1, 'time']]
## [1] "13H 31M 0S"
reports_fyears[[1, 'time']] %>%
as.numeric()
## [1] 48660
To calculate with the “time” column, we first have to tell R that we explicitly
want to use the data as seconds, then compute the summary statistic and after
this, transform the display of seconds back into the time of day. For the
transformations we can use the functions period_to_seconds()
and
seconds_to_period()
. We will calculate the mean and the median for the time of
day over all police reports in one call of summarise()
.
reports %>%
summarise(mean = period_to_seconds(time) %>%
mean() %>%
seconds_to_period(),
median = period_to_seconds(time) %>%
median() %>%
seconds_to_period()
)
## # A tibble: 1 × 2
## mean median
## <Period> <Period>
## 1 11H 59M 37.1911443148711S 11H 13M 0S
The mean time of day over all police reports is about 11:59 while the median is 11:13. This may indicate that the mean is biased towards a later time by a high number of very late reports. We will explore this further graphically in the next chapter. Again, the results may represent the time when crimes occur, but it is more likely, that we are actually analysing the time when the PR team posts their police reports.
11.6 Exporting tables
Exporting the results of our analysis – e.g. summary statistics computed with
summarise()
– sadly is not as straightforward as one might think or hope. We
can always copy and paste values from the output into Word or other pieces of
software we might use for writing a report. But this is cumbersome and error
prone.
One way of directly exporting tables we already know about is writing them to a .csv file. For more on this please review section 8.1. These can be imported into Excel and also directly into Word. This may be a quick way of exporting the values of a table and importing them into Office, but we have to do all the formatting in the external software and may have to repeat the formatting procedure every time the values change. So we should make sure that all errors have been fixed and the analysis is final before exporting to .csv.
In this example we save the number of police reports by year for the districts “Mitte” and “Friedrichshain-Kreuzberg” we computed earlier into a .csv file:
reports_fyears %>%
filter(district == "Mitte" | district == "Friedrichshain-Kreuzberg") %>%
group_by(district, year) %>%
summarise(reports = n()) %>%
write_csv("reports_year_M-FK.csv")
## `summarise()` has grouped output by 'district'.
## You can override using the `.groups` argument.
For reasons unknown to anyone but Microsoft, importing a .csv file that uses
commas as the delimiter can actually be quite tricky on a German Windows
installation if you are using Microsoft Office. Using Libre Office fixes this,
but if you want to use Microsoft Office on a German system you might have to use
semicolons as the delimiter by saving the file with write_csv2()
.
11.6.1 Further resources
A multitude of packages that provide functions for formatting and writing tables to different formats exist in the R world. To go into details is beyond the scope of this introduction, but I want to give you some pointers at least.
If you want to skip importing a .csv into Microsoft Office, you may be interested in packages that allow writing directly to .xlsx files. Two options are:
- writexl: https://cran.r-project.org/web/packages/writexl/index.html
- openxlsx: https://cran.r-project.org/web/packages/openxlsx/index.html
A way to directly produce tables in the .docx format is provided by the flextable package: https://davidgohel.github.io/flextable/
If your are working with LaTeX, the huxtable package can directly output .tex files, as well as HTML, Word, Excel and Powerpoint: https://hughjonesd.github.io/huxtable/
If you want to circumvent using external software for producing reports completely, R Markdown may be of interest to you. This allows you to write your report and your R code in one document and write it to .pdf or a range of other formats. The produced file includes all the output from your code that you want to report, i.e. code, plots and tables. Some resources on R Markdown can be found here:
- Introduction on the RStudio website: https://rmarkdown.rstudio.com/lesson-1.html{target_“blank”}
- The R Markdown cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf
- Chapter on R Markdown in “R for Data Science” by Hadley Wickham and Garrett Grolemund: https://r4ds.had.co.nz/r-markdown.html
- “R Markdown Cookbook” by Yihui Xie, Christophe Dervieux, Emily Riederer: https://bookdown.org/yihui/rmarkdown-cookbook/
The website your are viewing right now, as well as many of the linked resources, are built with the bookdown package, which extends the functionality of R Markdown and is suited for more large scale projects.
- “bookdown: Authoring Books and Technical Documents with R Markdown” by Yihui Xie: https://bookdown.org/yihui/bookdown/