Crime in Vancouver

Analysis

Data

Source: The data comes from the Vancouver Open Data Catalogue (all years). It was extracted on 2019-06-11 and contains 603,986 records from 2003-01-01 to 2019-06-7. By David Rucinski

crime_all_years <- fread("crime_csv_all_years.csv")

# Check start and end dates
# crime_all_years %>%
#   arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
#   head()
# 
# crime_all_years %>%
#   arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
#   tail()

Let’s take a peek at the data

head(crime_all_years, 20)

TYPE	YEAR	MONTH	DAY	HOUR	MINUTE	HUNDRED_BLOCK	NEIGHBOURHOOD	X	Y
Break and Enter Commercial	2003	6	21	10	0	3XX TERMINAL AVE	Strathcona	493000.2	5457728
Break and Enter Residential/Other	2003	5	22	11	0	13XX BUTE ST	West End	490144.6	5458627
Theft of Vehicle	2003	6	29	22	45	66XX MAIN ST	Sunset	492603.6	5452436
Offence Against a Person	2003	8	26	NA	NA	OFFSET TO PROTECT PRIVACY		0.0	0
Break and Enter Residential/Other	2003	6	8	5	0	38XX W 11TH AVE	West Point Grey	486285.1	5456660
Break and Enter Residential/Other	2003	12	6	0	0	13XX BUTE ST	West End	490144.6	5458627
Theft from Vehicle	2003	2	6	17	0	40XX MACDONALD ST	Arbutus Ridge	487791.5	5455365
Break and Enter Commercial	2003	6	7	5	5	E 38TH AVE / MAIN ST	Riley Park	492599.2	5453734
Theft from Vehicle	2003	7	3	13	35	BUTE ST / ROBSON ST	West End	490767.2	5459255
Theft of Vehicle	2003	10	2	20	0	66XX KERR ST	Killarney	497013.7	5452379
Theft of Vehicle	2003	5	30	19	0	25XX BALACLAVA ST	Kitsilano	487345.6	5456833
Offence Against a Person	2003	9	23	NA	NA	OFFSET TO PROTECT PRIVACY		0.0	0
Theft of Vehicle	2003	4	7	18	30	66XX FRASER ST	Sunset	493392.6	5452413
Break and Enter Commercial	2003	6	28	5	40	48XX VICTORIA DR	Kensington-Cedar Cottage	495241.9	5454303
Theft from Vehicle	2003	3	7	15	0	14XX W 12TH AVE	Fairview	490073.8	5456457
Break and Enter Commercial	2003	6	29	0	5	48XX VICTORIA DR	Kensington-Cedar Cottage	495241.9	5454303
Theft from Vehicle	2003	3	26	14	0	2XX GRANVILLE ST	Central Business District	491833.6	5459287
Theft of Vehicle	2003	3	16	14	20	66XX FRASER ST	Sunset	493392.6	5452413
Break and Enter Commercial	2003	9	26	2	30	10XX ALBERNI ST	West End	491067.7	5459114
Break and Enter Commercial	2003	6	2	0	1	48XX VICTORIA DR	Kensington-Cedar Cottage	495246.7	5454158

Preprocessing

To do:

-Combine date into single record : filter data

-Change TYPE, NEIGHBOURHOOD as factor : clean data

-Check hundred_block == “OFFSET TO PROTECT PRIVACY” : explore data

-Drop hundred_block : filter data

-Make plots : visual data

Breakdown of crime

# Dplyr way
 count_of_crimes <- crime_all_years %>%
   count(TYPE) %>%
   rename(total_of_crime = n)


# Data table way
crime_all_years[, .N, by = .(TYPE)]

TYPE	N
Break and Enter Commercial	37745
Break and Enter Residential/Other	65236
Theft of Vehicle	40802
Offence Against a Person	60272
Theft from Vehicle	201531
Mischief	81284
Vehicle Collision or Pedestrian Struck (with Injury)	24650
Theft of Bicycle	29687
Other Theft	62247
Vehicle Collision or Pedestrian Struck (with Fatality)	285
Homicide	247

OFFSET TO PROTECT PRIVACY

Neighbourhood & Time removed

# Through the glimpse() I saw OFFSET TO PROTECT PRIVACY, just wondering what this is and why.
offset <- crime_all_years %>%
  filter(HUNDRED_BLOCK == "OFFSET TO PROTECT PRIVACY") %>%
  group_by(TYPE)

# Offset to protect privacy:
# What kind of a offensives would that need that?
offset <- as.data.table(offset)
offset[, .N, by = .(TYPE)]

TYPE	N
Offence Against a Person	60272
Homicide	247

Information removed to protect privacy on some crimes.

Homicides make sense why you would want privacy, also for any unusual charges that would be easy to identify the persons involved through cross referencing news articles.

Clean date-time

crime_all_years$DATE <- with(crime_all_years, ymd(sprintf('%04d%02d%02d', YEAR, MONTH, DAY)))

# Gather up all the time data for crimes and we can look at a time-series analysis, see if there
# is any increase/decrease in crime of this data set. Also check to see time of day the crimes
# are usually committed.


# crime_all_years$TIME <- with(crime_all_years, hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
# Find where na is at
# hour_no_na <- crime_all_years %>%
#   filter( !is.na(HOUR) )
#
# crime_all_years[!is.na(HOUR)]$TIME <- with(crime_all_years[!is.na(HOUR)], hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
#
# Cannot create an hour-minute time with NA's, nor just the subset and let their time be NA.
# When checking time of crimes I will only use "HOUR" and exclude NA's

remove_list <- c("YEAR", "MONTH", "DAY", "MINUTE","HUNDRED_BLOCK")

# Removing some variables we won't need

crime_reduced <- crime_all_years %>%
  select( -remove_list ) %>%
  filter(!is.na(HOUR))

Column names:

colnames(crime_reduced)

[1] "TYPE"          "HOUR"          "NEIGHBOURHOOD" "X"            
[5] "Y"             "DATE"

Reordering column names

crime_tidy <- crime_reduced[, c(1,6,2,3,4,5)]

Let’s check how our new data looks;

glimpse(crime_tidy)

Observations: 543,467
Variables: 6
$ TYPE          <chr> "Break and Enter Commercial", "Break and Enter R...
$ DATE          <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR          <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <chr> "Strathcona", "West End", "Sunset", "West Point ...
$ X             <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y             <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...

class(crime_tidy)

[1] "data.frame"

# TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.

TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.

crime_tidy <- as.data.table(crime_tidy)

crime_tidy[, TYPE := as.factor(TYPE)]
crime_tidy[, NEIGHBOURHOOD := as.factor(NEIGHBOURHOOD)]

glimpse(crime_tidy)

Observations: 543,467
Variables: 6
$ TYPE          <fct> Break and Enter Commercial, Break and Enter Resi...
$ DATE          <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR          <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <fct> Strathcona, West End, Sunset, West Point Grey, W...
$ X             <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y             <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...

class(crime_tidy)

[1] "data.table" "data.frame"

Results

Crime Distribution

ggplot(count_of_crimes, aes(x = reorder(TYPE,total_of_crime), y = sqrt(total_of_crime) )) + 
  geom_col(fill = "steelblue") +
  coord_flip() + 
  scale_y_continuous(breaks=seq(0,550,50)) +
  labs( y = "Square Root of Total Crime", x = "Type of Crime Committed", title = "Crime Distribution", caption = "Vancouver 2003-2019", subtitle = "By Type of Crime") +
  geom_text(aes(label = round(sqrt(total_of_crime)), y = sqrt(total_of_crime) + 15)) + #adds num to bar
  theme_simple()

Here the count of total crimes is skewed, Vehicle Collision or Pedestrian Struck (with Fatality) & Homicide have such low counts relatively that they barely show on the plot. So finding a good transformation to visualize this I thought of taking the square root. It definitely holds its shape, unlike the log-transform that makes everything look near equal with the exceptation of 2 previous crimes.

Types of Crime at Hour

crime_tidy %>%
ggplot( aes(x = HOUR, fill = TYPE)) +
  geom_bar(position = "stack") +
  labs(title = "Crime Distribution", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
  scale_fill_brewer() +
  theme_simple()

Most crime happens at the same time of day, but we need to look at this separately to see if all crimes fall under this. Expecting that some crimes should happen more during the day when people would not be home, i.e. break and entering a residential home.

crime_all_years %>%
ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Time of crime by year", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
  facet_wrap(~YEAR) +
  theme_simple()

Pretty steady trend throughout the years, the hour of when a crime is committed seems constant. The chance of a crime happening is more likely later in the day, little surprise is the peak at 18:00 (6 pm). We should look into if crime has been steady over the years or if there’s any significant increase/decrease.

# >>> HTML how to center graph ??

crime_tidy %>%
ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "What Time are Certain Crimes Committed?", subtitle = "By Type of Crime"  , caption = "Vancouver 2003-2019") +
  facet_wrap(~TYPE)

Theft from Vehicle & of Vehicle spike in the late evening and drops just after midnight. I want to look into the break and enter categories, from here it looks like what would be expected. Which is break and entering into a commercial building when work hours are usually finished, break and entering into residential when they are at usual working hours.

Taking a look at Break and Enter

crime_tidy[TYPE  %like% "Break"] %>%
  ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Breaking and Entering", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
  facet_wrap(~TYPE) +
  theme_simple()

Although there are still counts of crime being committed in the off-hours, it is just as expected. These crimes committed are at usual times they would be empty, i.e. commercial buildings after working hours and residential buildings during working hours.

Since ‘other’ is not explicitly said we cannot assume these are homes. There is huge spike at noon for residential/other buildings, the multimodal shape suggests there are key times for breaking and entering these places. In the morning right as people are leaving for work, at noon when most people would have lunch, 18:00 when some people would go out for dinner (or non-residential), and at midnight when most people would be asleep.

crime_tidy[TYPE  %like% "Other Theft"] %>%
  ggplot( aes(x = HOUR)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Other Theft", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
  theme_simple()

Looks normally distributed about 15.5, interesting.

Crime by Neighbourhood

crime_tidy[, .N, by = .(NEIGHBOURHOOD)] %>%
  arrange(desc(N))

NEIGHBOURHOOD	N
Central Business District	131780
West End	47142
Fairview	35751
Mount Pleasant	35074
Grandview-Woodland	30692
Renfrew-Collingwood	30309
Kitsilano	29864
Kensington-Cedar Cottage	27649
Strathcona	24645
Hastings-Sunrise	20592
Sunset	19248
Marpole	14731
Riley Park	14206
Victoria-Fraserview	12045
Killarney	11550
Oakridge	9023
Dunbar-Southlands	8588
Kerrisdale	8247
Arbutus Ridge	6665
West Point Grey	6557
Shaughnessy	6182
South Cambie	5830
Stanley Park	4093
	2437
Musqueam	567

Information removed to protect privacy on some crimes.

From the table above it clearly shows that Central Business District has had the most crime, more than double the amount West End has had. Also there is count of crimes that have missing values for NEIGHBOURHOOD, which is offset for privacy reasons, that may be a bit confusing. Thus, to compare all other neighbourhoods and not skew the plot I have removed those 2 factors.

# SQL in R: dyplr
crime_tidy %>%
  filter(NEIGHBOURHOOD != "Central Business District", NEIGHBOURHOOD != "") %>%
  count(NEIGHBOURHOOD) %>%
  group_by(NEIGHBOURHOOD) %>%
  rename(crimes_committed = n) %>%
ggplot( aes(x = reorder(NEIGHBOURHOOD,crimes_committed), y = sqrt(crimes_committed) ) ) +
  geom_col(fill = "steelblue") +
  labs(title = "Crime Count by Neighbourhood", subtitle = "Excluded: Central Business District & Missing Values due to Privacy" , caption = "Vancouver 2003-2019", x = "Neighbourhood", y = "Square Root of Crimes Committed") +
  coord_flip() +
  geom_text(aes(label = round(sqrt(crimes_committed)), y = sqrt(crimes_committed) + 5)) +
  theme_simple()

rogers <- data.frame( X = 491853,
                      Y = 5458227)

crime_tidy %>%
  ggplot( aes(x = X,y = Y, color = NEIGHBOURHOOD)) +
  guides(color = guide_legend(override.aes = list(size=8))) + # resize the legend
  geom_point(alpha = 0.8, data = crime_tidy[crime_tidy$NEIGHBOURHOOD != "",] ) + 
  coord_cartesian(xlim = c(484000,498000), ylim = c(5450000, 5462500) ) +
  labs(title = "UTM Coordinates of Crimes Committed", subtitle = "By Neighbourhood", caption = "Vancouver 2003-2019", y ="North-South position", x = "East-West position ") +
  theme_simple() +
  theme(legend.text = element_text(colour="gray15", size = 10),
        legend.title = element_text(colour="black", size = 12),
        plot.title = element_text(lineheight=3, face="bold", 
               color="black", size=14),
        plot.caption = element_text(color = "gray15")
        ) +
  geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 50, shape = 1) +
  geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 3) + 
  annotate(geom="text", x=491853, y=5458600, label="Rogers Arena",
              color="black")

# Can I make this interactive? drag bar for year/month? Taking up too much memory, 
# will crash for some people, not worth it

With all the coordinates from crimes it almost makes a full map of Vancouver. There is certain sections on that map void of points that have a distinct shape. The bottom right for example, Killarney, has 2 void shapes. They happen to be a golf course and a wooded park that has an off-leash area.

Rogers Arena hosted the 2011 Stanley Cup Finals, on June 15 Vancouver Canucks lost to Boston Bruins 4:0 in game 7, which was followed by a riot.

Neighbourhood Crime: Time Series

# ts_test <- crime_tidy %>%
#   filter(NEIGHBOURHOOD != "",NEIGHBOURHOOD != "Central Business District") %>%
#   group_by(NEIGHBOURHOOD) %>%
#   count(DATE)
# 
# ts_test %>% 
#    group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
#    summarize(amount=sum(n)) %>%
#    filter(amount > 200) %>%
#    summarize(sum = sum(amount) )
 

high_crime <- list("Fairview" , "Grandview-Woodland" , "Hastings-Sunrise","Kensington-Cedar Cottage","Kitsilano", "Mount Pleasant"  ,"Renfrew-Collingwood " ,"West End" )
  
  
low_crime <- list("Arbutus Ridge" ,"Musqueam", "Oakridge", "Shaughnessy", "South Cambie","Killarney", "Dunbar-Southlands", "Kerrisdale", "Stanley Park", "Victoria-Fraserview", "West Point Grey"  )

Here I split up the the neighbourhoods to compare relative changes, need to change the scale to compare.

High-Range Crime Neighbourhoods

# High-Range Crime Neighbourhoods
  crime_tidy %>%
  filter(NEIGHBOURHOOD %in% high_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  #summarize(sum = sum(amount) )
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
    theme_bw(base_size = 22) +
  labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "High-Range Crime Neighbourhoods" )

Medium-Range Crime Neighbourhoods

# Medium Range Crime Neighbourhoods
  crime_tidy %>%
  filter(!NEIGHBOURHOOD %in% high_crime, NEIGHBOURHOOD != "", NEIGHBOURHOOD != "Central Business District",            !NEIGHBOURHOOD %in% low_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
    theme_bw(base_size = 22) +
    labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Medium-Range Crime Neighbourhoods" )

Low-Range Crime Neighbourhoods

# Low Range Crime Neighbourhoods
  
    crime_tidy %>%
  filter(NEIGHBOURHOOD %in% low_crime) %>% 
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>% 
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
  facet_wrap(~NEIGHBOURHOOD) +
      theme_bw(base_size = 22) +
      labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Low-Range Crime Neighbourhoods" )

# Select NEIGHBOURHOOD where amount > 200
# make a list then can split the neighbourhood and make multiple graphs: central will get its own
# and then can compare the ones with high vs high counts and low vs low counts

Central Business District

    crime_tidy %>%
  filter(NEIGHBOURHOOD == "Central Business District") %>%
  group_by(NEIGHBOURHOOD) %>%
  count(DATE) %>%
  group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
  summarize(amount=sum(n)) %>%
  ggplot( aes(x = month, y = amount) ) +
  geom_line(color = "steelblue") +
      theme_gray() +
        labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Central Business District" )

# 2011-06-01 had a huge spike in crime,  2011 Stanley Cup Finals, West End had a small spike. 
    # Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased.
    # The Rogers Arena that hosts the Vancouver Canucks is located between Central Business District and
    # West End. Both of Which have seen an increase in crime since the riot. This could be the long 
    # lasting effects from the riot. Though damages may have recovered since then maybe the attitude
    # of not caring still carries through.
    
    
# 2019-06-06 end date
    
#how have neighbourhoods changed over time

Counts are for the whole month

In June 2011 Central Business District had a huge spike in crime, 2011-06-15 was the Stanley Cup Finals at Rogers Arena, and West End also a small spike. Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased. The Rogers Arena that hosts the Vancouver Canucks is located in Central Business District near West End. Both of Which have seen an increase in crime since the riot. This could be the long lasting effects from the riot. Though damages may have recovered since then maybe the attitude of not caring still carries through. There may also be other factors contributing to the increase in crime in these areas.

Homicides Counts

offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)] %>%
  ggplot( aes(x = YEAR, y = N)) +
  geom_line( size = 1, color = "steelblue") +
  theme_gray() +
          labs(caption = "Vancouver 2003-2019", title = "Homicides in Vancouver", x = "Date", y = "Count", subtitle = "Count of Homicides per Year" )

offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)][, .(YEAR, N)]

YEAR	N
2003	18
2004	22
2005	22
2006	17
2007	19
2008	18
2009	18
2010	10
2011	15
2012	8
2013	7
2014	9
2015	15
2016	11
2017	17
2018	15
2019	6

As of 2019-06-7, the homicide count for 2019 is lower than any previous year, hopefully it stays that way.