Objective

Identify and examine crime trends from Vancouver’s data. Identify how has crime changed over the years, if certain neighbourhoods more prone to crimes, and what type of crime happens the most. In addition, this data will also give me some practice with these packages: dplyr, ggplot2, lubridate, and data.table.

Summary

Given this data: Theft of Vehicle is the most common crime to happen in Vancouver. In general 17:00 seems to be the peak time for a crime to be committed, although certain crimes favour different hours. The Central Business District seems to have the most crime committed over the years. It had been going down over the years until the 2011 riot, since then crime has steadily increased in Central Business District and surrounding neighbourhoods.

Preparations

Packages used:

library(data.table)
library(ggplot2)
library(dplyr)
library(lubridate)

Analysis

Data

Source: The data comes from the Vancouver Open Data Catalogue (all years). It was extracted on 2019-06-11 and contains 603,986 records from 2003-01-01 to 2019-06-7. By David Rucinski

crime_all_years <- fread("crime_csv_all_years.csv")

# Check start and end dates
# crime_all_years %>%
#   arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
#   head()
# 
# crime_all_years %>%
#   arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
#   tail()
Let’s take a peek at the data
head(crime_all_years, 20)
TYPE YEAR MONTH DAY HOUR MINUTE HUNDRED_BLOCK NEIGHBOURHOOD X Y
Break and Enter Commercial 2003 6 21 10 0 3XX TERMINAL AVE Strathcona 493000.2 5457728
Break and Enter Residential/Other 2003 5 22 11 0 13XX BUTE ST West End 490144.6 5458627
Theft of Vehicle 2003 6 29 22 45 66XX MAIN ST Sunset 492603.6 5452436
Offence Against a Person 2003 8 26 NA NA OFFSET TO PROTECT PRIVACY 0.0 0
Break and Enter Residential/Other 2003 6 8 5 0 38XX W 11TH AVE West Point Grey 486285.1 5456660
Break and Enter Residential/Other 2003 12 6 0 0 13XX BUTE ST West End 490144.6 5458627
Theft from Vehicle 2003 2 6 17 0 40XX MACDONALD ST Arbutus Ridge 487791.5 5455365
Break and Enter Commercial 2003 6 7 5 5 E 38TH AVE / MAIN ST Riley Park 492599.2 5453734
Theft from Vehicle 2003 7 3 13 35 BUTE ST / ROBSON ST West End 490767.2 5459255
Theft of Vehicle 2003 10 2 20 0 66XX KERR ST Killarney 497013.7 5452379
Theft of Vehicle 2003 5 30 19 0 25XX BALACLAVA ST Kitsilano 487345.6 5456833
Offence Against a Person 2003 9 23 NA NA OFFSET TO PROTECT PRIVACY 0.0 0
Theft of Vehicle 2003 4 7 18 30 66XX FRASER ST Sunset 493392.6 5452413
Break and Enter Commercial 2003 6 28 5 40 48XX VICTORIA DR Kensington-Cedar Cottage 495241.9 5454303
Theft from Vehicle 2003 3 7 15 0 14XX W 12TH AVE Fairview 490073.8 5456457
Break and Enter Commercial 2003 6 29 0 5 48XX VICTORIA DR Kensington-Cedar Cottage 495241.9 5454303
Theft from Vehicle 2003 3 26 14 0 2XX GRANVILLE ST Central Business District 491833.6 5459287
Theft of Vehicle 2003 3 16 14 20 66XX FRASER ST Sunset 493392.6 5452413
Break and Enter Commercial 2003 9 26 2 30 10XX ALBERNI ST West End 491067.7 5459114
Break and Enter Commercial 2003 6 2 0 1 48XX VICTORIA DR Kensington-Cedar Cottage 495246.7 5454158

Preprocessing

To do:

-Combine date into single record : filter data

-Change TYPE, NEIGHBOURHOOD as factor : clean data

-Check hundred_block == “OFFSET TO PROTECT PRIVACY” : explore data

-Drop hundred_block : filter data

-Make plots : visual data

Breakdown of crime

# Dplyr way
 count_of_crimes <- crime_all_years %>%
   count(TYPE) %>%
   rename(total_of_crime = n)


# Data table way
crime_all_years[, .N, by = .(TYPE)]
TYPE N
Break and Enter Commercial 37745
Break and Enter Residential/Other 65236
Theft of Vehicle 40802
Offence Against a Person 60272
Theft from Vehicle 201531
Mischief 81284
Vehicle Collision or Pedestrian Struck (with Injury) 24650
Theft of Bicycle 29687
Other Theft 62247
Vehicle Collision or Pedestrian Struck (with Fatality) 285
Homicide 247
OFFSET TO PROTECT PRIVACY

Neighbourhood & Time removed

# Through the glimpse() I saw OFFSET TO PROTECT PRIVACY, just wondering what this is and why.
offset <- crime_all_years %>%
  filter(HUNDRED_BLOCK == "OFFSET TO PROTECT PRIVACY") %>%
  group_by(TYPE)

# Offset to protect privacy:
# What kind of a offensives would that need that?
offset <- as.data.table(offset)
offset[, .N, by = .(TYPE)]
TYPE N
Offence Against a Person 60272
Homicide 247

Information removed to protect privacy on some crimes.

Homicides make sense why you would want privacy, also for any unusual charges that would be easy to identify the persons involved through cross referencing news articles.

Clean date-time

crime_all_years$DATE <- with(crime_all_years, ymd(sprintf('%04d%02d%02d', YEAR, MONTH, DAY)))

# Gather up all the time data for crimes and we can look at a time-series analysis, see if there
# is any increase/decrease in crime of this data set. Also check to see time of day the crimes
# are usually committed.


# crime_all_years$TIME <- with(crime_all_years, hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
# Find where na is at
# hour_no_na <- crime_all_years %>%
#   filter( !is.na(HOUR) )
#
# crime_all_years[!is.na(HOUR)]$TIME <- with(crime_all_years[!is.na(HOUR)], hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
#
# Cannot create an hour-minute time with NA's, nor just the subset and let their time be NA.
# When checking time of crimes I will only use "HOUR" and exclude NA's

remove_list <- c("YEAR", "MONTH", "DAY", "MINUTE","HUNDRED_BLOCK")

# Removing some variables we won't need

crime_reduced <- crime_all_years %>%
  select( -remove_list ) %>%
  filter(!is.na(HOUR))

Column names:

colnames(crime_reduced)
[1] "TYPE"          "HOUR"          "NEIGHBOURHOOD" "X"            
[5] "Y"             "DATE"         

Reordering column names

crime_tidy <- crime_reduced[, c(1,6,2,3,4,5)]

Let’s check how our new data looks;

glimpse(crime_tidy)
Observations: 543,467
Variables: 6
$ TYPE          <chr> "Break and Enter Commercial", "Break and Enter R...
$ DATE          <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR          <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <chr> "Strathcona", "West End", "Sunset", "West Point ...
$ X             <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y             <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...
class(crime_tidy)
[1] "data.frame"
# TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.

TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.

crime_tidy <- as.data.table(crime_tidy)

crime_tidy[, TYPE := as.factor(TYPE)]
crime_tidy[, NEIGHBOURHOOD := as.factor(NEIGHBOURHOOD)]

glimpse(crime_tidy)
Observations: 543,467
Variables: 6
$ TYPE          <fct> Break and Enter Commercial, Break and Enter Resi...
$ DATE          <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR          <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <fct> Strathcona, West End, Sunset, West Point Grey, W...
$ X             <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y             <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...
class(crime_tidy)
[1] "data.table" "data.frame"

Results

Crime Distribution

ggplot(count_of_crimes, aes(x = reorder(TYPE,total_of_crime), y = sqrt(total_of_crime) )) + 
  geom_col(fill = "steelblue") +
  coord_flip() + 
  scale_y_continuous(breaks=seq(0,550,50)) +
  labs( y = "Square Root of Total Crime", x = "Type of Crime Committed", title = "Crime Distribution", caption = "Vancouver 2003-2019", subtitle = "By Type of Crime") +
  geom_text(aes(label = round(sqrt(total_of_crime)), y = sqrt(total_of_crime) + 15)) + #adds num to bar
  theme_simple()

Here the count of total crimes is skewed, Vehicle Collision or Pedestrian Struck (with Fatality) & Homicide have such low counts relatively that they barely show on the plot. So finding a good transformation to visualize this I thought of taking the square root. It definitely holds its shape, unlike the log-transform that makes everything look near equal with the exceptation of 2 previous crimes.

Types of Crime at Hour

crime_tidy %>%
ggplot( aes(x = HOUR, fill = TYPE)) +
  geom_bar(position = "stack") +
  labs(title = "Crime Distribution", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
  scale_fill_brewer() +
  theme_simple()

Most crime happens at the same time of day, but we need to look at this separately to see if all crimes fall under this. Expecting that some crimes should happen more during the day when people would not be home, i.e. break and entering a residential home.

crime_all_years %>%
ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Time of crime by year", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
  facet_wrap(~YEAR) +
  theme_simple()

Pretty steady trend throughout the years, the hour of when a crime is committed seems constant. The chance of a crime happening is more likely later in the day, little surprise is the peak at 18:00 (6 pm). We should look into if crime has been steady over the years or if there’s any significant increase/decrease.

# >>> HTML how to center graph ??

crime_tidy %>%
ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "What Time are Certain Crimes Committed?", subtitle = "By Type of Crime"  , caption = "Vancouver 2003-2019") +
  facet_wrap(~TYPE) 

Theft from Vehicle & of Vehicle spike in the late evening and drops just after midnight. I want to look into the break and enter categories, from here it looks like what would be expected. Which is break and entering into a commercial building when work hours are usually finished, break and entering into residential when they are at usual working hours.

Taking a look at Break and Enter

crime_tidy[TYPE  %like% "Break"] %>%
  ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Breaking and Entering", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
  facet_wrap(~TYPE) +
  theme_simple()

Although there are still counts of crime being committed in the off-hours, it is just as expected. These crimes committed are at usual times they would be empty, i.e. commercial buildings after working hours and residential buildings during working hours.

Since ‘other’ is not explicitly said we cannot assume these are homes. There is huge spike at noon for residential/other buildings, the multimodal shape suggests there are key times for breaking and entering these places. In the morning right as people are leaving for work, at noon when most people would have lunch, 18:00 when some people would go out for dinner (or non-residential), and at midnight when most people would be asleep.

crime_tidy[TYPE  %like% "Other Theft"] %>%
  ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Other Theft", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
  theme_simple()

Looks normally distributed about 15.5, interesting.

Crime by Neighbourhood

crime_tidy[, .N, by = .(NEIGHBOURHOOD)] %>%
  arrange(desc(N))
NEIGHBOURHOOD N
Central Business District 131780
West End 47142
Fairview 35751
Mount Pleasant 35074
Grandview-Woodland 30692
Renfrew-Collingwood 30309
Kitsilano 29864
Kensington-Cedar Cottage 27649
Strathcona 24645
Hastings-Sunrise 20592
Sunset 19248
Marpole 14731
Riley Park 14206
Victoria-Fraserview 12045
Killarney 11550
Oakridge 9023
Dunbar-Southlands 8588
Kerrisdale 8247
Arbutus Ridge 6665
West Point Grey 6557
Shaughnessy 6182
South Cambie 5830
Stanley Park 4093
2437
Musqueam 567

Information removed to protect privacy on some crimes.

From the table above it clearly shows that Central Business District has had the most crime, more than double the amount West End has had. Also there is count of crimes that have missing values for NEIGHBOURHOOD, which is offset for privacy reasons, that may be a bit confusing. Thus, to compare all other neighbourhoods and not skew the plot I have removed those 2 factors.

# SQL in R: dyplr
crime_tidy %>%
  filter(NEIGHBOURHOOD != "Central Business District", NEIGHBOURHOOD != "") %>%
  count(NEIGHBOURHOOD) %>%
  group_by(NEIGHBOURHOOD) %>%
  rename(crimes_committed = n) %>%
ggplot( aes(x = reorder(NEIGHBOURHOOD,crimes_committed), y = sqrt(crimes_committed) ) ) +
  geom_col(fill = "steelblue") +
  labs(title = "Crime Count by Neighbourhood", subtitle = "Excluded: Central Business District & Missing Values due to Privacy" , caption = "Vancouver 2003-2019", x = "Neighbourhood", y = "Square Root of Crimes Committed") +
  coord_flip() +
  geom_text(aes(label = round(sqrt(crimes_committed)), y = sqrt(crimes_committed) + 5)) +
  theme_simple()

rogers <- data.frame( X = 491853,
                      Y = 5458227)

crime_tidy %>%
  ggplot( aes(x = X,y = Y, color = NEIGHBOURHOOD)) +
  guides(color = guide_legend(override.aes = list(size=8))) + # resize the legend
  geom_point(alpha = 0.8, data = crime_tidy[crime_tidy$NEIGHBOURHOOD != "",] ) + 
  coord_cartesian(xlim = c(484000,498000), ylim = c(5450000, 5462500) ) +
  labs(title = "UTM Coordinates of Crimes Committed", subtitle = "By Neighbourhood", caption = "Vancouver 2003-2019", y ="North-South position", x = "East-West position ") +
  theme_simple() +
  theme(legend.text = element_text(colour="gray15", size = 10),
        legend.title = element_text(colour="black", size = 12),
        plot.title = element_text(lineheight=3, face="bold", 
               color="black", size=14),
        plot.caption = element_text(color = "gray15")
        ) +
  geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 50, shape = 1) +
  geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 3) + 
  annotate(geom="text", x=491853, y=5458600, label="Rogers Arena",
              color="black")

# Can I make this interactive? drag bar for year/month? Taking up too much memory, 
# will crash for some people, not worth it

With all the coordinates from crimes it almost makes a full map of Vancouver. There is certain sections on that map void of points that have a distinct shape. The bottom right for example, Killarney, has 2 void shapes. They happen to be a golf course and a wooded park that has an off-leash area.

Rogers Arena hosted the 2011 Stanley Cup Finals, on June 15 Vancouver Canucks lost to Boston Bruins 4:0 in game 7, which was followed by a riot.

Neighbourhood Crime: Time Series

# ts_test <- crime_tidy %>%
#   filter(NEIGHBOURHOOD != "",NEIGHBOURHOOD != "Central Business District") %>%
#   group_by(NEIGHBOURHOOD) %>%
#   count(DATE)
# 
# ts_test %>% 
#    group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
#    summarize(amount=sum(n)) %>%
#    filter(amount > 200) %>%
#    summarize(sum = sum(amount) )
 

high_crime <- list("Fairview" , "Grandview-Woodland" , "Hastings-Sunrise","Kensington-Cedar Cottage","Kitsilano", "Mount Pleasant"  ,"Renfrew-Collingwood " ,"West End" )
  
  
low_crime <- list("Arbutus Ridge" ,"Musqueam", "Oakridge", "Shaughnessy", "South Cambie","Killarney", "Dunbar-Southlands", "Kerrisdale", "Stanley Park", "Victoria-Fraserview", "West Point Grey"  )

Here I split up the the neighbourhoods to compare relative changes, need to change the scale to compare.

High-Range Crime Neighbourhoods
# High-Range Crime Neighbourhoods
  crime_tidy %>%
  filter(NEIGHBOURHOOD %in% high_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  #summarize(sum = sum(amount) )
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
    theme_bw(base_size = 22) +
  labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "High-Range Crime Neighbourhoods" ) 

Medium-Range Crime Neighbourhoods
# Medium Range Crime Neighbourhoods
  crime_tidy %>%
  filter(!NEIGHBOURHOOD %in% high_crime, NEIGHBOURHOOD != "", NEIGHBOURHOOD != "Central Business District",            !NEIGHBOURHOOD %in% low_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
    theme_bw(base_size = 22) +
    labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Medium-Range Crime Neighbourhoods" )

Low-Range Crime Neighbourhoods
# Low Range Crime Neighbourhoods
  
    crime_tidy %>%
  filter(NEIGHBOURHOOD %in% low_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
      theme_bw(base_size = 22) +
      labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Low-Range Crime Neighbourhoods" )

# Select NEIGHBOURHOOD where amount > 200
# make a list then can split the neighbourhood and make multiple graphs: central will get its own
# and then can compare the ones with high vs high counts and low vs low counts
Central Business District
    crime_tidy %>%
  filter(NEIGHBOURHOOD == "Central Business District") %>%
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>%
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
      theme_gray() +
        labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Central Business District" )

# 2011-06-01 had a huge spike in crime,  2011 Stanley Cup Finals, West End had a small spike. 
    # Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased.
    # The Rogers Arena that hosts the Vancouver Canucks is located between Central Business District and
    # West End. Both of Which have seen an increase in crime since the riot. This could be the long 
    # lasting effects from the riot. Though damages may have recovered since then maybe the attitude
    # of not caring still carries through.
    
    
# 2019-06-06 end date
    
#how have neighbourhoods changed over time

Counts are for the whole month

In June 2011 Central Business District had a huge spike in crime, 2011-06-15 was the Stanley Cup Finals at Rogers Arena, and West End also a small spike. Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased. The Rogers Arena that hosts the Vancouver Canucks is located in Central Business District near West End. Both of Which have seen an increase in crime since the riot. This could be the long lasting effects from the riot. Though damages may have recovered since then maybe the attitude of not caring still carries through. There may also be other factors contributing to the increase in crime in these areas.

Homicides Counts
offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)] %>%
  ggplot( aes(x = YEAR, y = N)) +
  geom_line( size = 1, color = "steelblue") +
  theme_gray() +
          labs(caption = "Vancouver 2003-2019", title = "Homicides in Vancouver", x = "Date", y = "Count", subtitle = "Count of Homicides per Year" )

offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)][, .(YEAR, N)]
YEAR N
2003 18
2004 22
2005 22
2006 17
2007 19
2008 18
2009 18
2010 10
2011 15
2012 8
2013 7
2014 9
2015 15
2016 11
2017 17
2018 15
2019 6

As of 2019-06-7, the homicide count for 2019 is lower than any previous year, hopefully it stays that way.