Identify and examine crime trends from Vancouver’s data. Identify how has crime changed over the years, if certain neighbourhoods more prone to crimes, and what type of crime happens the most. In addition, this data will also give me some practice with these packages: dplyr, ggplot2, lubridate, and data.table.
Given this data: Theft of Vehicle is the most common crime to happen in Vancouver. In general 17:00 seems to be the peak time for a crime to be committed, although certain crimes favour different hours. The Central Business District seems to have the most crime committed over the years. It had been going down over the years until the 2011 riot, since then crime has steadily increased in Central Business District and surrounding neighbourhoods.
Packages used:
library(data.table)
library(ggplot2)
library(dplyr)
library(lubridate)
Source: The data comes from the Vancouver Open Data Catalogue (all years). It was extracted on 2019-06-11 and contains 603,986 records from 2003-01-01 to 2019-06-7. By David Rucinski
crime_all_years <- fread("crime_csv_all_years.csv")
# Check start and end dates
# crime_all_years %>%
# arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
# head()
#
# crime_all_years %>%
# arrange(desc(YEAR),desc(MONTH),desc(DAY)) %>%
# tail()
head(crime_all_years, 20)
TYPE | YEAR | MONTH | DAY | HOUR | MINUTE | HUNDRED_BLOCK | NEIGHBOURHOOD | X | Y |
---|---|---|---|---|---|---|---|---|---|
Break and Enter Commercial | 2003 | 6 | 21 | 10 | 0 | 3XX TERMINAL AVE | Strathcona | 493000.2 | 5457728 |
Break and Enter Residential/Other | 2003 | 5 | 22 | 11 | 0 | 13XX BUTE ST | West End | 490144.6 | 5458627 |
Theft of Vehicle | 2003 | 6 | 29 | 22 | 45 | 66XX MAIN ST | Sunset | 492603.6 | 5452436 |
Offence Against a Person | 2003 | 8 | 26 | NA | NA | OFFSET TO PROTECT PRIVACY | 0.0 | 0 | |
Break and Enter Residential/Other | 2003 | 6 | 8 | 5 | 0 | 38XX W 11TH AVE | West Point Grey | 486285.1 | 5456660 |
Break and Enter Residential/Other | 2003 | 12 | 6 | 0 | 0 | 13XX BUTE ST | West End | 490144.6 | 5458627 |
Theft from Vehicle | 2003 | 2 | 6 | 17 | 0 | 40XX MACDONALD ST | Arbutus Ridge | 487791.5 | 5455365 |
Break and Enter Commercial | 2003 | 6 | 7 | 5 | 5 | E 38TH AVE / MAIN ST | Riley Park | 492599.2 | 5453734 |
Theft from Vehicle | 2003 | 7 | 3 | 13 | 35 | BUTE ST / ROBSON ST | West End | 490767.2 | 5459255 |
Theft of Vehicle | 2003 | 10 | 2 | 20 | 0 | 66XX KERR ST | Killarney | 497013.7 | 5452379 |
Theft of Vehicle | 2003 | 5 | 30 | 19 | 0 | 25XX BALACLAVA ST | Kitsilano | 487345.6 | 5456833 |
Offence Against a Person | 2003 | 9 | 23 | NA | NA | OFFSET TO PROTECT PRIVACY | 0.0 | 0 | |
Theft of Vehicle | 2003 | 4 | 7 | 18 | 30 | 66XX FRASER ST | Sunset | 493392.6 | 5452413 |
Break and Enter Commercial | 2003 | 6 | 28 | 5 | 40 | 48XX VICTORIA DR | Kensington-Cedar Cottage | 495241.9 | 5454303 |
Theft from Vehicle | 2003 | 3 | 7 | 15 | 0 | 14XX W 12TH AVE | Fairview | 490073.8 | 5456457 |
Break and Enter Commercial | 2003 | 6 | 29 | 0 | 5 | 48XX VICTORIA DR | Kensington-Cedar Cottage | 495241.9 | 5454303 |
Theft from Vehicle | 2003 | 3 | 26 | 14 | 0 | 2XX GRANVILLE ST | Central Business District | 491833.6 | 5459287 |
Theft of Vehicle | 2003 | 3 | 16 | 14 | 20 | 66XX FRASER ST | Sunset | 493392.6 | 5452413 |
Break and Enter Commercial | 2003 | 9 | 26 | 2 | 30 | 10XX ALBERNI ST | West End | 491067.7 | 5459114 |
Break and Enter Commercial | 2003 | 6 | 2 | 0 | 1 | 48XX VICTORIA DR | Kensington-Cedar Cottage | 495246.7 | 5454158 |
To do:
-Combine date into single record : filter data
-Change TYPE, NEIGHBOURHOOD as factor : clean data
-Check hundred_block == “OFFSET TO PROTECT PRIVACY” : explore data
-Drop hundred_block : filter data
-Make plots : visual data
# Dplyr way
count_of_crimes <- crime_all_years %>%
count(TYPE) %>%
rename(total_of_crime = n)
# Data table way
crime_all_years[, .N, by = .(TYPE)]
TYPE | N |
---|---|
Break and Enter Commercial | 37745 |
Break and Enter Residential/Other | 65236 |
Theft of Vehicle | 40802 |
Offence Against a Person | 60272 |
Theft from Vehicle | 201531 |
Mischief | 81284 |
Vehicle Collision or Pedestrian Struck (with Injury) | 24650 |
Theft of Bicycle | 29687 |
Other Theft | 62247 |
Vehicle Collision or Pedestrian Struck (with Fatality) | 285 |
Homicide | 247 |
Neighbourhood & Time removed
# Through the glimpse() I saw OFFSET TO PROTECT PRIVACY, just wondering what this is and why.
offset <- crime_all_years %>%
filter(HUNDRED_BLOCK == "OFFSET TO PROTECT PRIVACY") %>%
group_by(TYPE)
# Offset to protect privacy:
# What kind of a offensives would that need that?
offset <- as.data.table(offset)
offset[, .N, by = .(TYPE)]
TYPE | N |
---|---|
Offence Against a Person | 60272 |
Homicide | 247 |
Information removed to protect privacy on some crimes.
Homicides make sense why you would want privacy, also for any unusual charges that would be easy to identify the persons involved through cross referencing news articles.
crime_all_years$DATE <- with(crime_all_years, ymd(sprintf('%04d%02d%02d', YEAR, MONTH, DAY)))
# Gather up all the time data for crimes and we can look at a time-series analysis, see if there
# is any increase/decrease in crime of this data set. Also check to see time of day the crimes
# are usually committed.
# crime_all_years$TIME <- with(crime_all_years, hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
# Find where na is at
# hour_no_na <- crime_all_years %>%
# filter( !is.na(HOUR) )
#
# crime_all_years[!is.na(HOUR)]$TIME <- with(crime_all_years[!is.na(HOUR)], hms(sprintf('%02d%02d', HOUR, MINUTE)))
#
#
# Cannot create an hour-minute time with NA's, nor just the subset and let their time be NA.
# When checking time of crimes I will only use "HOUR" and exclude NA's
remove_list <- c("YEAR", "MONTH", "DAY", "MINUTE","HUNDRED_BLOCK")
# Removing some variables we won't need
crime_reduced <- crime_all_years %>%
select( -remove_list ) %>%
filter(!is.na(HOUR))
Column names:
colnames(crime_reduced)
[1] "TYPE" "HOUR" "NEIGHBOURHOOD" "X"
[5] "Y" "DATE"
Reordering column names
crime_tidy <- crime_reduced[, c(1,6,2,3,4,5)]
Let’s check how our new data looks;
glimpse(crime_tidy)
Observations: 543,467
Variables: 6
$ TYPE <chr> "Break and Enter Commercial", "Break and Enter R...
$ DATE <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <chr> "Strathcona", "West End", "Sunset", "West Point ...
$ X <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...
class(crime_tidy)
[1] "data.frame"
# TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.
TYPE and NEIGHBOURHOOD are characters, and crime_tidy is a dataframe, lets change those to factors so we can plot them and change the df to a data.table. DT will be faster, save some memory, and easy to manipulate.
crime_tidy <- as.data.table(crime_tidy)
crime_tidy[, TYPE := as.factor(TYPE)]
crime_tidy[, NEIGHBOURHOOD := as.factor(NEIGHBOURHOOD)]
glimpse(crime_tidy)
Observations: 543,467
Variables: 6
$ TYPE <fct> Break and Enter Commercial, Break and Enter Resi...
$ DATE <date> 2003-06-21, 2003-05-22, 2003-06-29, 2003-06-08,...
$ HOUR <int> 10, 11, 22, 5, 0, 17, 5, 13, 20, 19, 18, 5, 15, ...
$ NEIGHBOURHOOD <fct> Strathcona, West End, Sunset, West Point Grey, W...
$ X <dbl> 493000.2, 490144.6, 492603.6, 486285.1, 490144.6...
$ Y <dbl> 5457728, 5458627, 5452436, 5456660, 5458627, 545...
class(crime_tidy)
[1] "data.table" "data.frame"
ggplot(count_of_crimes, aes(x = reorder(TYPE,total_of_crime), y = sqrt(total_of_crime) )) +
geom_col(fill = "steelblue") +
coord_flip() +
scale_y_continuous(breaks=seq(0,550,50)) +
labs( y = "Square Root of Total Crime", x = "Type of Crime Committed", title = "Crime Distribution", caption = "Vancouver 2003-2019", subtitle = "By Type of Crime") +
geom_text(aes(label = round(sqrt(total_of_crime)), y = sqrt(total_of_crime) + 15)) + #adds num to bar
theme_simple()
Here the count of total crimes is skewed, Vehicle Collision or Pedestrian Struck (with Fatality) & Homicide have such low counts relatively that they barely show on the plot. So finding a good transformation to visualize this I thought of taking the square root. It definitely holds its shape, unlike the log-transform that makes everything look near equal with the exceptation of 2 previous crimes.
crime_tidy %>%
ggplot( aes(x = HOUR, fill = TYPE)) +
geom_bar(position = "stack") +
labs(title = "Crime Distribution", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
scale_fill_brewer() +
theme_simple()
Most crime happens at the same time of day, but we need to look at this separately to see if all crimes fall under this. Expecting that some crimes should happen more during the day when people would not be home, i.e. break and entering a residential home.
crime_all_years %>%
ggplot( aes(x = HOUR)) +
geom_bar(fill = "steelblue") +
labs(title = "Time of crime by year", subtitle = "By hour of day", caption = "Vancouver 2003-2019") +
facet_wrap(~YEAR) +
theme_simple()
Pretty steady trend throughout the years, the hour of when a crime is committed seems constant. The chance of a crime happening is more likely later in the day, little surprise is the peak at 18:00 (6 pm). We should look into if crime has been steady over the years or if there’s any significant increase/decrease.
# >>> HTML how to center graph ??
crime_tidy %>%
ggplot( aes(x = HOUR)) +
geom_bar(fill = "steelblue") +
labs(title = "What Time are Certain Crimes Committed?", subtitle = "By Type of Crime" , caption = "Vancouver 2003-2019") +
facet_wrap(~TYPE)
Theft from Vehicle & of Vehicle spike in the late evening and drops just after midnight. I want to look into the break and enter categories, from here it looks like what would be expected. Which is break and entering into a commercial building when work hours are usually finished, break and entering into residential when they are at usual working hours.
Taking a look at Break and Enter
crime_tidy[TYPE %like% "Break"] %>%
ggplot( aes(x = HOUR)) +
geom_bar(fill = "steelblue") +
labs(title = "Breaking and Entering", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
facet_wrap(~TYPE) +
theme_simple()
Although there are still counts of crime being committed in the off-hours, it is just as expected. These crimes committed are at usual times they would be empty, i.e. commercial buildings after working hours and residential buildings during working hours.
Since ‘other’ is not explicitly said we cannot assume these are homes. There is huge spike at noon for residential/other buildings, the multimodal shape suggests there are key times for breaking and entering these places. In the morning right as people are leaving for work, at noon when most people would have lunch, 18:00 when some people would go out for dinner (or non-residential), and at midnight when most people would be asleep.
crime_tidy[TYPE %like% "Other Theft"] %>%
ggplot( aes(x = HOUR)) +
geom_bar(fill = "steelblue") +
labs(title = "Other Theft", subtitle = "Count of Crimes at Hour", caption = "Vancouver 2003-2019") +
theme_simple()
Looks normally distributed about 15.5, interesting.
crime_tidy[, .N, by = .(NEIGHBOURHOOD)] %>%
arrange(desc(N))
NEIGHBOURHOOD | N |
---|---|
Central Business District | 131780 |
West End | 47142 |
Fairview | 35751 |
Mount Pleasant | 35074 |
Grandview-Woodland | 30692 |
Renfrew-Collingwood | 30309 |
Kitsilano | 29864 |
Kensington-Cedar Cottage | 27649 |
Strathcona | 24645 |
Hastings-Sunrise | 20592 |
Sunset | 19248 |
Marpole | 14731 |
Riley Park | 14206 |
Victoria-Fraserview | 12045 |
Killarney | 11550 |
Oakridge | 9023 |
Dunbar-Southlands | 8588 |
Kerrisdale | 8247 |
Arbutus Ridge | 6665 |
West Point Grey | 6557 |
Shaughnessy | 6182 |
South Cambie | 5830 |
Stanley Park | 4093 |
2437 | |
Musqueam | 567 |
Information removed to protect privacy on some crimes.
From the table above it clearly shows that Central Business District has had the most crime, more than double the amount West End has had. Also there is count of crimes that have missing values for NEIGHBOURHOOD, which is offset for privacy reasons, that may be a bit confusing. Thus, to compare all other neighbourhoods and not skew the plot I have removed those 2 factors.
# SQL in R: dyplr
crime_tidy %>%
filter(NEIGHBOURHOOD != "Central Business District", NEIGHBOURHOOD != "") %>%
count(NEIGHBOURHOOD) %>%
group_by(NEIGHBOURHOOD) %>%
rename(crimes_committed = n) %>%
ggplot( aes(x = reorder(NEIGHBOURHOOD,crimes_committed), y = sqrt(crimes_committed) ) ) +
geom_col(fill = "steelblue") +
labs(title = "Crime Count by Neighbourhood", subtitle = "Excluded: Central Business District & Missing Values due to Privacy" , caption = "Vancouver 2003-2019", x = "Neighbourhood", y = "Square Root of Crimes Committed") +
coord_flip() +
geom_text(aes(label = round(sqrt(crimes_committed)), y = sqrt(crimes_committed) + 5)) +
theme_simple()
rogers <- data.frame( X = 491853,
Y = 5458227)
crime_tidy %>%
ggplot( aes(x = X,y = Y, color = NEIGHBOURHOOD)) +
guides(color = guide_legend(override.aes = list(size=8))) + # resize the legend
geom_point(alpha = 0.8, data = crime_tidy[crime_tidy$NEIGHBOURHOOD != "",] ) +
coord_cartesian(xlim = c(484000,498000), ylim = c(5450000, 5462500) ) +
labs(title = "UTM Coordinates of Crimes Committed", subtitle = "By Neighbourhood", caption = "Vancouver 2003-2019", y ="North-South position", x = "East-West position ") +
theme_simple() +
theme(legend.text = element_text(colour="gray15", size = 10),
legend.title = element_text(colour="black", size = 12),
plot.title = element_text(lineheight=3, face="bold",
color="black", size=14),
plot.caption = element_text(color = "gray15")
) +
geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 50, shape = 1) +
geom_point(data = rogers, aes(x= X, y = Y), color = "black", size = 3) +
annotate(geom="text", x=491853, y=5458600, label="Rogers Arena",
color="black")
# Can I make this interactive? drag bar for year/month? Taking up too much memory,
# will crash for some people, not worth it
With all the coordinates from crimes it almost makes a full map of Vancouver. There is certain sections on that map void of points that have a distinct shape. The bottom right for example, Killarney, has 2 void shapes. They happen to be a golf course and a wooded park that has an off-leash area.
Rogers Arena hosted the 2011 Stanley Cup Finals, on June 15 Vancouver Canucks lost to Boston Bruins 4:0 in game 7, which was followed by a riot.
# ts_test <- crime_tidy %>%
# filter(NEIGHBOURHOOD != "",NEIGHBOURHOOD != "Central Business District") %>%
# group_by(NEIGHBOURHOOD) %>%
# count(DATE)
#
# ts_test %>%
# group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
# summarize(amount=sum(n)) %>%
# filter(amount > 200) %>%
# summarize(sum = sum(amount) )
high_crime <- list("Fairview" , "Grandview-Woodland" , "Hastings-Sunrise","Kensington-Cedar Cottage","Kitsilano", "Mount Pleasant" ,"Renfrew-Collingwood " ,"West End" )
low_crime <- list("Arbutus Ridge" ,"Musqueam", "Oakridge", "Shaughnessy", "South Cambie","Killarney", "Dunbar-Southlands", "Kerrisdale", "Stanley Park", "Victoria-Fraserview", "West Point Grey" )
Here I split up the the neighbourhoods to compare relative changes, need to change the scale to compare.
# High-Range Crime Neighbourhoods
crime_tidy %>%
filter(NEIGHBOURHOOD %in% high_crime) %>%
group_by(NEIGHBOURHOOD) %>%
count(DATE) %>%
group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
summarize(amount=sum(n)) %>%
#summarize(sum = sum(amount) )
ggplot( aes(x = month, y = amount) ) +
geom_line(color = "steelblue") +
facet_wrap(~NEIGHBOURHOOD) +
theme_bw(base_size = 22) +
labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "High-Range Crime Neighbourhoods" )
# Medium Range Crime Neighbourhoods
crime_tidy %>%
filter(!NEIGHBOURHOOD %in% high_crime, NEIGHBOURHOOD != "", NEIGHBOURHOOD != "Central Business District", !NEIGHBOURHOOD %in% low_crime) %>%
group_by(NEIGHBOURHOOD) %>%
count(DATE) %>%
group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
summarize(amount=sum(n)) %>%
ggplot( aes(x = month, y = amount) ) +
geom_line(color = "steelblue") +
facet_wrap(~NEIGHBOURHOOD) +
theme_bw(base_size = 22) +
labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Medium-Range Crime Neighbourhoods" )
# Low Range Crime Neighbourhoods
crime_tidy %>%
filter(NEIGHBOURHOOD %in% low_crime) %>%
group_by(NEIGHBOURHOOD) %>%
count(DATE) %>%
group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
summarize(amount=sum(n)) %>%
ggplot( aes(x = month, y = amount) ) +
geom_line(color = "steelblue") +
facet_wrap(~NEIGHBOURHOOD) +
theme_bw(base_size = 22) +
labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Low-Range Crime Neighbourhoods" )
# Select NEIGHBOURHOOD where amount > 200
# make a list then can split the neighbourhood and make multiple graphs: central will get its own
# and then can compare the ones with high vs high counts and low vs low counts
crime_tidy %>%
filter(NEIGHBOURHOOD == "Central Business District") %>%
group_by(NEIGHBOURHOOD) %>%
count(DATE) %>%
group_by(NEIGHBOURHOOD, month=floor_date(DATE, "month")) %>%
summarize(amount=sum(n)) %>%
ggplot( aes(x = month, y = amount) ) +
geom_line(color = "steelblue") +
theme_gray() +
labs(caption = "Vancouver 2003-2019", title = "Count of Crime Over Time: By Month", x = "Date", y = "Count", subtitle = "Central Business District" )
# 2011-06-01 had a huge spike in crime, 2011 Stanley Cup Finals, West End had a small spike.
# Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased.
# The Rogers Arena that hosts the Vancouver Canucks is located between Central Business District and
# West End. Both of Which have seen an increase in crime since the riot. This could be the long
# lasting effects from the riot. Though damages may have recovered since then maybe the attitude
# of not caring still carries through.
# 2019-06-06 end date
#how have neighbourhoods changed over time
Counts are for the whole month
In June 2011 Central Business District had a huge spike in crime, 2011-06-15 was the Stanley Cup Finals at Rogers Arena, and West End also a small spike. Since the 2011 Stanley Cup Riot in Vancouver crime in the surrounding areas have steadily increased. The Rogers Arena that hosts the Vancouver Canucks is located in Central Business District near West End. Both of Which have seen an increase in crime since the riot. This could be the long lasting effects from the riot. Though damages may have recovered since then maybe the attitude of not caring still carries through. There may also be other factors contributing to the increase in crime in these areas.
offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)] %>%
ggplot( aes(x = YEAR, y = N)) +
geom_line( size = 1, color = "steelblue") +
theme_gray() +
labs(caption = "Vancouver 2003-2019", title = "Homicides in Vancouver", x = "Date", y = "Count", subtitle = "Count of Homicides per Year" )
offset[TYPE == "Homicide" , .N, by = .(TYPE, YEAR)][, .(YEAR, N)]
YEAR | N |
---|---|
2003 | 18 |
2004 | 22 |
2005 | 22 |
2006 | 17 |
2007 | 19 |
2008 | 18 |
2009 | 18 |
2010 | 10 |
2011 | 15 |
2012 | 8 |
2013 | 7 |
2014 | 9 |
2015 | 15 |
2016 | 11 |
2017 | 17 |
2018 | 15 |
2019 | 6 |
As of 2019-06-7, the homicide count for 2019 is lower than any previous year, hopefully it stays that way.