10 Scraping Berlin police reports

Over the next three chapters we will follow a small sample project. In this chapter we will briefly outline the research topic and scrape the data. In the next two chapters we will concern ourselfes with cleaning, transforming and analysing the data. First statistically, then grapically.

10.1 Topic and data

This sample project aims to analyse police reports in Berlin. Specifically, we will try to answer two questions:

Does the number of reports differ by district?
Does the number of reports differ over time?
- over years
- over months
- over days of the week
- over time of day

Note that these are ad hoc questions constructed for this sample project. In a real research project, we would have to motivate the research question more clearly and develop hyotheses based on theoretical considerations and existing research. These steps are skipped here to keep the focus on scraping the data and basic methods of data analysis.

Now that topic and questions are defined, we need some data to answer them. The website https://www.berlin.de/polizei/polizeimeldungen/archiv/ contains reports by the Berlin police that are open to the public, beginning with the year 2014. We will gather the links to all subpages, download them and extract the text of the report, as well as the date, time and district where it occurred.

10.2 Scraping the data

10.2.1 Gathering the links

library(tidyverse)
library(rvest)

We begin by downloading the mainpage.

website <- "https://www.berlin.de/polizei/polizeimeldungen/archiv/" %>% 
  read_html()

The next step is to extract the URLs for the yearly archives. All the <a> tags that contain those links, have a title= attribute whose value begins with “Link”. Also they are contained in a <div> tag with the class textile. We can use this in selector construction, extract the value of the href= attribute and, as they are incomplete URLs, append them to the base URL.

links <- website %>% 
  html_elements(css = "div.textile a[title^='Link']") %>% 
  html_attr(name = "href") %>% 
  str_c("https://www.berlin.de", .)

Before we continue with gathering the links to all the subpages, we have some additional cleaning up to do. Clicking through the yearly archive pages reveals that the presented data is incomplete for years before 2020. The archives for all earlier years only contain a few entries, some of them even being obvious errors. This also serves as an example for the realities of web scraping. A website can change all the time and these changes can lead to a scraping script that worked a year, a month or even a day before breaking down. This is the case here. The last time I scraped the website (2023-10-16), the data for years before 2020 was present in the archives. Now it is largely gone. This is a further argument for saving the results of your scraping projects because you may not be able to reproduce the data gathering process at a later date.

We do not know why the data is gone. Maybe it is a technical error on their side or they are deliberately removing older posts. In any case, the data before 2019 is incomplete and does not really lend itself to any analysis. So we do not have any reason to scrape it. We should clean up our vector of links before moving on, by selecting the first 6 elements from it, the years that seem to be complete in the archive.

links <- links[1:6]

links
## [1] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2025/"
## [2] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2024/"
## [3] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2023/"
## [4] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2022/"
## [5] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2021/"
## [6] "https://www.berlin.de/polizei/polizeimeldungen/archiv/2020/"

We can now move on and extract the links to all of the subpages. Each yearly archive page contains several subpages for the reports in that year, divided by pagination; to gather the links to all the subpages, we have to understand how the pagination works. In this case, the URL for the yearly archive is simply appended with the query “?page_at_1_0=” and a value indicating the number of the subpage we want to access. So “https://www.berlin.de/polizei/polizeimeldungen/archiv/2022/?page_at_1_0=5” for example gives access to the fifth page for 2022.

The number of police reports is not constant over the years; the number of subpages per year will also differ. We do not know in advance, how many subpages we will have to download, but we can scrape this information from the last pagination link for each year. We are lucky, as the <a> tag for the last subpage per year is a list item that can be identified by its two classes pager-item and last. We parse the first page for each year, select the <a> tag for the last subpage, extract the text from it and save it as a number. Note that I also added a waiting time of two seconds between each download, as shown in chapter 9.

max_pages <- links %>% 
  map(~ {
    Sys.sleep(2)
    read_html(.x)
  }) %>%
  map(html_element, css = "li.pager-item.last > a") %>% 
  map(html_text) %>% 
  as.numeric()

Now we can construct the links to all subpages; done by nesting two for loops in the script. For this to be understandable, we should have a brief look on for loops.

Like map(), for loops are used for iteration. We can use them to repeat one or multiple commands for a fixed number of times. The following non-runable code block, shows the basic for loop syntax.

for (variable in vector) {
  Code
}

We will replace variable with a name for our counter. Per convention lower case letters are used, typically i. In vector we will define the values the counter will take. If we define the vector 1:5, the counter will take on the value 1 in the first loop, 2 in the second and so on. After the loop runs with the value 5, it ends. The values in vector do not have to be ascending numbers though. We could also use a numeric vector like c(4, 8, 15, 16, 23, 42), a character vector like c("a", "s", "d", "f), or any other vector.

Between the {} brackets, we write the commands that shall be executed in each iteration. The important thing is, that we can use the value the counter takes in each iteration. For example, the following code prints the value of i for each iteration into the console:

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

For loops can also be nested. Then, the inner loop iterates completely for each iteration of the outer loop. In the following example, the inner loop iterates three times for each of the two iterations of the outer loop. So we get six iterations in total. Note, that inner and outer loop have to have different variable names for this to work. The paste() function in print() is used to combine the values of i and j seperated by a - and print this combination into the console, so we can see exactly in which order the iterations occur.

for (i in 1:2) {
  for (j in 1:3) {
    print(paste(i, j, sep = "-"))
  }
}
## [1] "1-1"
## [1] "1-2"
## [1] "1-3"
## [1] "2-1"
## [1] "2-2"
## [1] "2-3"

With this knowledge, we can construct a nested for loop that builds a vector of the links to all subpages for all years.

pag_links <- character()

for (i in 1:length(links)) {
  for (j in 1:max_pages[i]) {
    pag_links <- append(pag_links, values = str_c(links[i], "?page_at_1_0=", j))
  }
}

The outer loop is a counter for the elements in the objects links; in this case the loop counts from 1 to 9, one step for each yearly archive page. For each of these iterations, the inner loop will count from 1 to the number of the last subpage for this year, which we assigned to the vector max_pages. Using subsetting with max_pages[i], we select the correct number of subpages for each year, i.e. each iteration of the outer loop. Using both counters together, we can construct the links to the subpages using str_c(). The function takes the link to a yearly archive indicated by the counter in the outer loop i, appends “?page_at_1_0=” and after this the value of the inner loop counter j, i.e. the subpages (1, 2, …, j) for this year. The resulting link is then added to the vector pag_links using the function append() which adds the data indicated by the argument values to the object specified in the first argument. For this to work, we first have to initialise the object as an empty object outside of the loop. Here I created it as an empty character vector, using pag_links <- character(). If you want to create an empty object without any data type assigned, you can also use empty_vector <- NULL.

10.2.2 Downloading the subpages

Now that the links to all subpages are gathered, we can finally download them. Please note that the download will take upwards of 10 minutes due to the number of subpages (251 at the time of writing), and the additional waiting time of 2 seconds between each iteration.

pages <- pag_links %>% 
  map(~ {
    Sys.sleep(2)
    read_html(.x)
  })

10.2.3 Extracting the data of interest

The goal is to extract the text of the report, the date/time and the district the report refers to. Looking into the source code, we find that all reports are list items in an unordered list. Conveniently for us, all data fields we are interested in have distinct classes we can use in scraping. Date and time are enclosed by a <div> tag with the classes cell, nowrap and date. The report headline is also part of a <div>. Here the classes are cell and text. The same <div> also includes the district’s name, but we can discern between the two. The text is included in a <a> tag, that is a child of the <div> and the district is part of a <span> with the class category, which is also a child of the <div>. We can use this information to construct appropriate CSS selectors like this:

reports <- tibble(
  Date = pages %>% 
    map(html_elements, css = "div.cell.nowrap.date") %>% 
    map(html_text) %>% 
    unlist(),
  Report = pages %>% 
    map(html_elements, css = "div.cell.text > a") %>% 
    map(html_text) %>% 
    unlist(),
  District = pages %>% 
    map(html_elements, css = "div.cell.text > span.category") %>% 
    map(html_text) %>% 
    unlist()
)
## Error in `tibble()`:
## ! Tibble columns must have compatible sizes.
## • Size 12412: Existing data.
## • Size 12331: Column `District`.
## ℹ Only values of size one are recycled.

That did not work. But why? Let us look at the error message we received. It informs us that the column “District” we tried to create, is shorter in length compared to the other two. Since we can not create tibbles out of columns with different lengths, we get this error.

The fact that the “District” column is shorter than the other two, must mean that there are police reports where the information on the district is not listed on the website, which we can confirm by browsing some of the subpages, e.g. https://www.berlin.de/polizei/polizeimeldungen/archiv/2020/.

10.2.3.1 Dealing with missing data in lists

Some of the <span> tags that contain the district are missing. Using the approach presented above, html_elements() just extracts the <span> tags that are present. We tell R that we do want all <span> tags of the class “category”, and this is what R returns to us. For list items where the tag is missing, nothing is returned. But this is not what we want; what we actually want is, that R looks at every single police report and saves its text, date and time, as well as the district, if it is not missing. If it is missing, we want R to save a NA in the cell of the tibble, the representation of missing values in R.

The <div> and <span> tags that contain the data of interest are nested in <li> tags in this case. The <li> tags thus contain the whole set of data, the text, the date/time and the district. The approach here is to make R examine every single <li> tag and extract the data that is present, as well as save an NA for every piece of data that is missing.

To start, we have to extract all the <li> tags and their content from the subpages. Right now, the subpages are saved in the object pages. We use a for loop that takes every element of pages, extracts all the <li> tags from it and adds them to a new list using append(). Note, that we have to use pages[[]] to subset the list of subpages, as we want to access the actual list elements, i.e. the node sets for the sub pages. As with tibbles, pages[] would always return another list. The <li> tags are all children of an <ul> with the class list--tablelist, which we can use in our selector. append() takes an existing object as its first argument and adds the data indicated in the values = argument to it. For this to work, we have to initiate the new list as an empty object before the loop starts.

list_items <- NULL

for (i in 1:length(pages)) {
  list_items <- append(list_items, values = html_elements(pages[[i]], css = "ul.list--tablelist > li"))
}

The newly created list list_items contains a node set for each <li> tag from all subpages. Again, we have to use double brackets to access the node set. With single brackets a new list containing the node set as its first element is returned, as illustrated here:

list_items[[1]]
## {html_node}
## <li>
## [1] <div class="cell nowrap date">06.03.2025 15:01 Uhr</div>\n
## [2] <div class="cell text">\n<a href="/polizei/polizeimeldungen/2025/pressemi ...

list_items[1]
## [[1]]
## {html_node}
## <li>
## [1] <div class="cell nowrap date">06.03.2025 15:01 Uhr</div>\n
## [2] <div class="cell text">\n<a href="/polizei/polizeimeldungen/2025/pressemi ...

We can now make R examine every element of this list one after the other and extract all the data they contain. But what happens when we try to extract a element that is not present in one of the elements? R returns an NA:

html_element(list_items[[1]], css = "span.notthere")
## {xml_missing}
## <NA>

Using a for loop that iterates over the elements in list_items, we write a tibble row by row, filling the data cells with the extracted information and with NA if a element could not be found for this element of list_items. We have to initiate the tibble before the for loop starts: we define the column names, the type of data to be saved and also the length of the columns. The latter is not strictly necessary as we could also have created a tibble with a column length of $0$, but pre-defining their length increases computational efficiency. Still, the for loop has to iterate over several thousand elements and extract the data contained, which will take several minutes to complete.

reports <- tibble(
  Date = character()[1:length(list_items)],
  Report = character()[1:length(list_items)],
  District = character()[1:length(list_items)]
)

for (i in 1:length(list_items)) {
  reports[i, "Date"] <- html_element(list_items[[i]], css = "div.cell.nowrap.date") %>% 
    html_text()
  reports[i, "Report"] <- html_element(list_items[[i]], css = "div.cell.text > a") %>% 
    html_text()
  reports[i, "District"] <- html_element(list_items[[i]], css = "div.cell.text > span.category") %>% 
    html_text()
}

Let’s look at the tibble we just constructed:

reports
## # A tibble: 12,412 × 3
##    Date                 Report                                          District
##    <chr>                <chr>                                           <chr>   
##  1 06.03.2025 15:01 Uhr Heranwachsender mit mehreren Stichverletzungen  Ereigni…
##  2 06.03.2025 14:57 Uhr Kokstaxi-Fahrer festgenommen                    Ereigni…
##  3 06.03.2025 14:49 Uhr Mehrere Brände in einer Nacht                   Ereigni…
##  4 06.03.2025 11:40 Uhr Tourist bei Verkehrsunfall verletzt             Ereigni…
##  5 06.03.2025 10:49 Uhr Hausfriedensbruch begangen und Ladendetektiv b… Ereigni…
##  6 06.03.2025 10:46 Uhr Verkehrsunfall auf Kreuzung – Fußgänger im Kra… Ereigni…
##  7 06.03.2025 10:43 Uhr Auseinandersetzung im Park                      Ereigni…
##  8 06.03.2025 09:52 Uhr Mitarbeiter von Ladendieb verletzt              Ereigni…
##  9 06.03.2025 09:21 Uhr Motorrollerfahrerin bei Verkehrsunfall verletzt Ereigni…
## 10 05.03.2025 16:53 Uhr Mann zeigt Hitlergruß und äußert sich volksver… Ereigni…
## # ℹ 12,402 more rows

This looks good, but we should also confirm, that NAs were handled correctly. We can examine the entry for “30.12.2020 11:27 Uhr” that we saw on https://www.berlin.de/polizei/polizeimeldungen/archiv/2020/, and for which the district was missing. We can use subsetting to just look at this one observation in our tibble. Remembet that when subsetting two dimensional objects like tibbles, we have to supply an index for the row(s) as well as for the column(s) we want to subset. Our goal is, to subset the row for which the column “Date” holds the value “30.12.2020 11:27 Uhr”. We can thus write our row index as reports$Date == "30.12.2020 11:27 Uhr", which reads as: the row(s) for which the value of the column “Date” in the object “reports” is equal to “30.12.2020 11:27 Uhr”. As we want to see all columns for this observation, we do not need to supply a column index. By writing nothing after the , we instruct R to show us all columns.

reports[reports$Date == "30.12.2020 11:27 Uhr", ]
## # A tibble: 1 × 3
##   Date                 Report                     District
##   <chr>                <chr>                      <chr>   
## 1 30.12.2020 11:27 Uhr Fahrzeuge in Brand gesetzt <NA>

This also looks good. We now have extracted the data we need to answer our questions.

10.2.4 Saving the data

As discussed in chapter 8, we save the scraped data at this point. You have seen that we downloaded a lot of subpages, which took a considerable amount of time; if we repeat this for every instance of further data analysis, we create a lot of unnecessary traffic and waste a lot of our own time.

save(reports, file = "reports.RData")

In the next chapter we will continue with cleaning the data, transforming it and calculating some descriptive statistics on it.