This will serve as both the first blog post on my site, and a way for me to explore how I feel about personal websites (and personal brands) for academics in general. For this, and all other posts, these views are of course entirely my own opinion and are not to be attributed to anyone I work for or with unless it makes them look better in some way. With that out of the way, let’s begin.
Gone are the days when you could march out of your PhD into a cushy assistant professorship, confident in your inevitable rise to the holy grail of tenured full professor. Now that there are way, way more new PhDs than there are faculty positions, we all must either head for the fabled greener grass of industry or fight over the scraps available to us in academia. For pretty much all fields, this means publish like mad. But in statistics, and specifically machine learning, I’ve noticed an additional requirement that has cropped up.
First, without ruffling too many feathers, I think it’s pretty safe to say that most of the basic ideas of machine learning were developed in statistics long before the computing power existed to implement them. On the other hand, a lot of the cool, practical breakthroughs today are by computer scientists. This has led to the field being almost entirely adopted by computer science, and so statisticians who want to work in this field have to market themselves the same way software developers do, since that’s who these companies are used to dealing with. Consequently, we find ourselves in the purgatory of personal projects.
It’s not that I don’t recognize the benefit of messing around with these tools in order to learn, but as someone who is interested in statistical theory, I wish there were more options for quick projects besides analyzing a dataset. As you can see from my projects page, I do have an example of such a project, but I was required to do that for class. On my own time, I have much more interest in reading and developing theory, which is quite a bit slower than a quick data analysis. All this is to say that I think companies need to realize it’s possible to be a self-motivated learner without having an app startup, and they could improve their hiring diversity by not placing so much emphasis on personal projects. However, I’ve yet to come up with a way to measure “mathematical curiosity” that is as easy to measure as number of projects for “coding curiosity”.
For my first undergraduate student research award, I built a simulation of patient flow through the internal medicine ward at London’s university hospital. Here is the abstract from our submission to INFOR, which has just been resubmitted following review and a request for minor changes.
Effective communication between nurses and physician teams on the internal medicine unit is crucial for high quality, safe, and efficient patient care. In our hospital of interest, a large academic health sciences centre, the physical layout of the unit, current admission process, and the presence of three separate physician teams contribute to uneven workload and communication barriers. We aimed to ad- dress this by physically co-locating each physician teams’ patients so as to facilitate physician-nurse collaboration, and more evenly distribute workload across all three teams. Based upon one year of real-world data, we developed a simulation model of inpatient flow through the internal medicine unit and determined the impact of two proposed changes: co-locating each team’s patients, and new admission rules for how patients are assigned to those teams. Under the new arrangement, each physician team would interact with roughly half the number of nurses, and nurses in turn would have fewer individual team members with whom to communicate, thereby improving effective communication and increasing time for direct patient care for both physicians and nurses.
I presented this work at ORAHS in Oslo, Norway and CanQueue in Edmonton, Canada this past summer. I hope to add some more detail to this page detailing our methodology and results once the paper has been accepted.
This work is adapted from my final project for STA2101 (Applied Statistics I) completed in Fall 2018. All the code and data is available on my Github.
For this project I chose the option of analyzing some interesting data with Python. Using 3 datasets available from Kaggle, I investigated crime in Chicago from 2012 to 2016. The main goal of this is to explore data analysis in Python, specifically spatial plotting and multinomial logistic regression for determining relationships between crimes and the explanatory variables. Some questions I address are where crime occurs, what type of crime occurs, how crime in the area influences school quality, and how police station locations are related to crime hotspots.
The primary dataset used is crimes, which contains all reported incidents of crime in Chicago from 2012 to 2016. To supplement this, there is schools, which contains the progress report card of each public Chicago high school from the 2013-14 school year, and police, which contains location info for all Chicago police stations. These datasets have many fields, so I will only mention the critical ones that are used in a model.
First, I import the libraries I will require.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import statsmodels.api as sm
from scipy.stats import chi2
from scipy.stats import norm
from scipy.special import logsumexp
from colour import Color
from random import choice
Crime Severity
The initial crimes file contains 1,456,714 rows, where each row corresponds to a crime being reported, but I removed 37,174 of these for missing location data. I also excluded crimes from 2017 since there were only 30, so it was logical to just consider years which were fully reported. The field crimes[Primary Type] details the main reason for a crime being reported, and I cleaned this up by removing some white space and grouping types such as “NARCOTICS” and “OTHER NARCOTIC VIOLATION”. I split crimes[Date] into three fields for day, month, and year, and then dropped some redundant fields. A field that may have been useful is crimes[Description], which contains a sentence describing the crime in more depth, but I decided there was too much variability in them to be practical for this analysis.
## Crimes Data Exploration
# We have 1 456 714 rows total
crimes = crimes[pd.notnull(crimes['X Coordinate'])] # 37 083 have NaN for location
crimes = crimes[crimes['X Coordinate'] > 0] # 77 have 0 for location
crimes = crimes[pd.notnull(crimes.Ward)] # 14 have NaN for ward
crimes = crimes[pd.notnull(crimes.District)] # 1 have NaN for district
crimes = crimes[pd.notnull(crimes['Community Area'])] # 22 have NaN for district
crimes = crimes[crimes.Year < 2017] # Only 30 crimes in 2017
# Clean up primary types
crimes = crimes.replace({'Primary Type':['NON - CRIMINAL', 'NON-CRIMINAL (SUBJECT SPECIFIED)']}, {'Primary Type':'NON-CRIMINAL'})
crimes = crimes.replace({'Primary Type':'OTHER NARCOTIC VIOLATION'}, {'Primary Type':'NARCOTICS'})
crimes = crimes.replace({'Primary Type':'CONCEALED CARRY LICENSE VIOLATION'}, {'Primary Type':'WEAPONS VIOLATION'})
# Time of event fields
crimes['Month'] = [int(date[0:2]) for date in crimes.Date]
crimes['Day'] = [int(date[3:5]) for date in crimes.Date]
crimes['Year'] = [int(date[6:10]) for date in crimes.Date]
# Convert boolean to integer
crimes.Arrest = crimes.Arrest.astype(int)
crimes.Domestic = crimes.Domestic.astype(int)
# Lots of columns we don't care about
crimes.drop(['Unnamed: 0', 'Date', 'ID', 'Case Number', 'IUCR', 'Description', 'Location', 'Location Description', 'Updated On', 'Block', 'FBI Code'], axis=1, inplace=True)
For this part of the analysis, I grouped the crime data by the field crimes[Primary Type] using my own subjective scale of severity. Some of these were easy to rank (homicide most severe, non-criminal least severe), while others reveal my personal biases (some may rank prostitution offenses as more severe than I do). Nonetheless, I believe it worthwhile to determine how the covariates impact the severity of the crime committed.
## Crime Severity
# 1 - HOMICIDE, CRIM SEXUAL ASSAULT, KIDNAPPING, HUMAN TRAFFICKING, OFFENSE INVOLVING CHILDREN
# 2 - BURGLARY, THEFT, MOTOR VEHICLE THEFT, ROBBERY, ASSAULT, ARSON, SEX OFFENSE, BATTERY
# 3 - NARCOTICS, STALKING, WEAPONS VIOLATION, CRIMINAL DAMAGE, CRIMINAL TRESPASS
# 4 - GAMBLING, PROSTITUTION, OBSCENITY, LIQUOR LAW VIOLATION, PUBLIC PEACE VIOLATION, PUBLIC INDECENCY
# 5 - INTIMIDATION, INTERFERENCE WITH PUBLIC OFFICER, DECEPTIVE PRACTICE
# 6 - NON-CRIMINAL, OTHER OFFENSE
crime_severity = {'HOMICIDE':1, 'CRIM SEXUAL ASSAULT':1, 'KIDNAPPING':1, 'HUMAN TRAFFICKING':1,
'OFFENSE INVOLVING CHILDREN':1, 'BURGLARY':2, 'THEFT':2, 'MOTOR VEHICLE THEFT':2, 'ROBBERY':2,
'ASSAULT':2, 'ARSON':2, 'SEX OFFENSE':2, 'BATTERY':2, 'NARCOTICS':3, 'STALKING':3, 'WEAPONS VIOLATION':3,
'CRIMINAL DAMAGE':3, 'CRIMINAL TRESPASS':3, 'GAMBLING':4, 'PROSTITUTION':4, 'OBSCENITY':4,
'LIQUOR LAW VIOLATION':4, 'PUBLIC PEACE VIOLATION':4, 'PUBLIC INDECENCY':4, 'INTIMIDATION':5,
'INTERFERENCE WITH PUBLIC OFFICER':5, 'DECEPTIVE PRACTICE':5, 'NON-CRIMINAL':6, 'OTHER OFFENSE':6}
crimes['Crime Severity'] = [crime_severity[crime] for crime in crimes['Primary Type']]
print(crimes.groupby('Crime Severity').agg('count').Latitude)
crime_severity_year = crimes.groupby(['Year', 'Crime Severity']).agg('count').Latitude
crime_year = crimes.groupby('Year').agg('count').Latitude
print(crime_severity_year.div(crime_year, level='Year')*100) # percentage of crimes by severity for each year
This results in the following table of the percentage of severe crimes per year. As expected, very sever crimes are rare (yay!) and less severe crimes are likely not often reported.
I then fit a multinomial logistic model to determine how severe a crime will be given covariates such as location, whether an arrest occurs, whether the crime is domestic related, where the crime occurred, and what time of year the crime occurred during. The results of this model reveal that with “most severe” as the reference category all of the covariates are significant for at least of one the possible outcomes.
Interestingly, we see evidence that an arrest will increase the probability of a lower severity, which I would guess is actually a reverse relationship and that low severity crimes are only reported when there is an easy arrest to be made. There is also a relationship to explore between crime location and severity, but I have yet to explore these further.
Most surprising is that there is an effect on severity by the date that the crime occurs. The sign of the coefficients suggests that a crime that is committed later in the year is more likely to be severe. Of course, one might expect this relationship to be non-linear, with a peak in severity occurring in the summer (see Toronto’s “Summer of Murder” in 2018), and this cannot be captured by the current form of the model.
School Safety
Now, consider the schools dataset. Each row of this dataset corresponds to one of the 188 public high schools in Chicago. There are multiple fields that give a qualitative review of the school based on student and teacher surveys, as well as quantitative fields for test scores, attendance levels, and dropout rates. Most of the quantitative fields had very little data and so were not considered for this analysis. The fields schools[Latitude] and schools[Longitude] are labelled the other way around in the original file, so I swapped those back.
The only survey question that got enough data from a significant proportion of schools is schools[Safe], which rates how safe the school is on a scale of VERY WEAK, WEAK, NEUTRAL, STRONG, or VERY STRONG. After removing the schools with minimal data in the data cleaning process, I was left with 136. However, 18 of these still have NOT ENOUGH DATA in schools[Safe]. The question I am interested in is whether the number of crimes committed “close” to the school impacts how students and teachers rate the safety of the school. As I’ve already removed a significant number of schools for having many missing fields, I will use multiple imputation on these 18 schools.
## Compare crime to school quality
# We have 188 schools total
schools_cols = ['School ID', 'Student Response Rate', 'Teacher Response Rate', 'Involved Family',
'Supportive Environment', 'Ambitious Instruction', 'Effective Leaders', 'Collaborative Teachers',
'Safe', 'School Community', 'Parent-Teacher Partnership', 'Quality of Facilities',
'Healthy Schools Certification', 'Creative Schools Certification', 'EPAS Growth Percentile',
'EPAS Attainment Percentile', 'Grade ACT Attainment Percentile Grade 11', 'ACT Spring 2013 Average Grade 11',
'Student Attendance Percentage 2013', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude']
schools = schools[schools_cols] # Only want to keep the columns that have a reasonable amount of data
schools = schools[list(~(np.array(np.isnan(schools['Student Attendance Percentage 2013']) & np.isnan(schools['One-Year DropOut Rate Percentage 2013']) & np.isnan(schools['ACT Spring 2013 Average Grade 11']))))] # remove 19 that have nearly no data
schools = schools[~np.isnan(schools['ACT Spring 2013 Average Grade 11']) & ~np.isnan(schools['EPAS Growth Percentile']) & ~np.isnan(schools['Grade ACT Attainment Percentile Grade 11']) & np.array(schools['One-Year DropOut Rate Percentage 2013'] > 0)] # drop 33 more schools with no ACT data
# Have to swap Longitude and Latitude
schools.rename(columns = {'Latitude': 'temp'}, inplace=True)
schools.rename(columns = {'Longitude':'Latitude', 'temp':'Longitude'}, inplace=True)
# View breakdown of categorical data
print(schools.groupby('Collaborative Teachers').agg('count')['School ID'])
print(schools.groupby('Safe').agg('count')['School ID']) # only useful one that has a reasonable percentage of missing values
print(schools.groupby('School Community').agg('count')['School ID'])
print(schools.groupby('Parent-Teacher Partnership').agg('count')['School ID'])
print(schools.groupby('Quality of Facilities').agg('count')['School ID'])
crimes['Latitude Radians'] = np.radians(crimes.Latitude)
crimes['Longitude Radians'] = np.radians(crimes.Longitude)
schools['Latitude Radians'] = np.radians(schools.Latitude)
schools['Longitude Radians'] = np.radians(schools.Longitude)
In order to determine the proximity of a crime to a school, I use an approximation of the Euclidean distance between two points determined by latitude and longitude as described by https://andrew.hedges.name/experiments/haversine/.
Earth_Radius = 6373 # approximate in km
# algorithm from https://andrew.hedges.name/experiments/haversine/ under MIT license
# returns array of Euclidean distance between school and each crime in crimes
def GeoDistSchool(school_lat, school_lon):
lat_diff = crimes['Latitude Radians'] - school_lat
lon_diff = crimes['Longitude Radians'] - school_lon
a = np.square(np.sin(lat_diff / 2)) + np.cos(school_lat) * np.cos(crimes['Latitude Radians']) * np.square(np.sin(lon_diff / 2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return Earth_Radius * c
for school in schools.index:
[school_lat, school_lon] = schools.loc[school, ['Latitude Radians', 'Longitude Radians']]
distances = GeoDistSchool(school_lat, school_lon)
schools.loc[school, 'Crimes Committed 0.25km'] = np.sum(distances <= 0.25)
schools.loc[school, 'Crimes Committed 0.5km'] = np.sum(distances <= 0.5)
schools.loc[school, 'Crimes Committed 0.75km'] = np.sum(distances <= 0.75)
schools.loc[school, 'Crimes Committed 1km'] = np.sum(distances <= 1)
schools.loc[school, 'Crimes Committed 1.5km'] = np.sum(distances <= 1.5)
schools.loc[school, 'Crimes Committed 3km'] = np.sum(distances <= 3)
schools.loc[school, 'Crimes Committed 5km'] = np.sum(distances <= 5)
Then, for each school I counted how many crimes were committed from 2012-2016 in a 0.5km and 1km radius around the school, and used this along with the other covariates to fit a multinomial logistic regression model with schools[Safe] as the response variable. You can see that lots of the covariates are not significant, including the number of crimes committed in a certain radius.
I performed a likelihood ratio test on the reduced model and the full model, obtaining a test statistic of G2 = 40.21 which on 28 degrees of freedom corresponds to a p value of 0.0633. Thus, I do not have evidence to conclude that the reduced model is different from the full model, so I will use the reduced model.
# Likelihood ratio test is 2(LL_full - LL_restricted)
G2 = 2 * (schools_exists_model.llf - schools_exists_model_reduced.llf)
df = schools_exists_model_reduced.df_resid - schools_exists_model.df_resid
pval = 1 - chi2.cdf(G2,df)
With this, I impute the value of schools[Safe] for the 18 schools which didn’t have enough data by simulating a multinomial random variable with the respective predicted probabilities. In order to account for the additional error in the imputation, I repeat this process 1000 times, each time recalculating the full model on all 136 datapoints. From each iteration, I obtain confidence intervals for each coefficient, and then I use the bootstrap average lower and upper bound to get my final confidence intervals for the coefficients.
# Impute values of schools.Safe using reduced model
# Rows used to build the imputation model
schools_impute = schools[schools.Safe=='NOT ENOUGH DATA']
# Columns used to impute the data
schools_impute_X = schools_impute[['Student Response Rate', 'EPAS Attainment Percentile', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude']]
# Columns used to fit model on imputed data
schools_imputed_X = schools[['Student Response Rate', 'Teacher Response Rate', 'EPAS Growth Percentile', 'EPAS Attainment Percentile', 'Grade ACT Attainment Percentile Grade 11', 'ACT Spring 2013 Average Grade 11', 'Student Attendance Percentage 2013', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude', 'Crimes Committed 0.5km', 'Crimes Committed 1km']]
schools_imputed_conf_ints = []
# order of params output by model
# 0 - NEUTRAL
# 1 - STRONG
# 2 - VERY STRONG
# 3 - VERY WEAK
# 4 - WEAK
params = {0:'NEUTRAL', 1:'STRONG', 2:'VERY STRONG', 3:'VERY WEAK', 4:'WEAK'}
# Extracts parameters for distribution of coefficient estimates
normal_95 = norm.ppf(0.975) # more decimals than 1.96
schools_exists_model_reduced_coefs = np.transpose(np.array(schools_exists_model_reduced.params)) # mean values for betas
schools_exists_model_reduced_se = np.zeros((4,5)) # standard deviations for betas
for i in range(4):
for j in range(5):
schools_exists_model_reduced_se[i,j] = (schools_exists_model_reduced._results.conf_int()[i][j][1] - schools_exists_model_reduced_coefs[i][j]) / normal_95
np.random.seed(123456) # set seed for reproducibility
# Multiple imputation ran 1000 times
num_iters = 1000
for iter in range(num_iters):
# Matrix of sampled coefficient estimates
# Each row corresponds to an outcome (response)
# Each column corresponds to a covariate
schools_exists_model_reduced_coefs_sample = np.zeros((4,5))
for i in range(4):
for j in range(5):
# Sample each beta from the Normal distribution
schools_exists_model_reduced_coefs_sample[i,j] = np.random.normal(schools_exists_model_reduced_coefs[i,j],schools_exists_model_reduced_se[i,j],1)
# Compute L vector for each row to impute
schools_imputed_temp = schools.copy()
for school in schools_impute.index:
L = np.dot(schools_exists_model_reduced_coefs_sample, schools_impute_X.loc[school]) # L1, L2, etc from lecture notes
L_one = np.append(np.array([0]), L) # L for the reference value is 0, since need exp(L)=1
log_denom = logsumexp(L_one) # computationally better than computing exponentials first and dividing
probs = np.array([np.exp(L_one[i] - log_denom) for i in range(5)])
schools_imputed_temp.loc[school, 'Safe'] = params[np.where(np.random.multinomial(1,probs) == 1)[0][0]]
schools_imputed_temp_y = schools_imputed_temp.Safe
schools_imputed_temp_model = sm.MNLogit(schools_imputed_temp_y, schools_imputed_X).fit(disp=0)
schools_imputed_conf_ints.append(schools_imputed_temp_model._results.conf_int())
# Now access the confidence intervals for each param and find average lower and upper bounds
schools_imputed_bootstrap_conf_ints = np.zeros((4,12,2))
for response in range(4):
for coef in range(12):
conf_lower = np.array([val[response][coef][0] for val in schools_imputed_conf_ints])
conf_upper = np.array([val[response][coef][1] for val in schools_imputed_conf_ints])
schools_imputed_bootstrap_conf_ints[response][coef][0] = np.mean(conf_lower)
schools_imputed_bootstrap_conf_ints[response][coef][1] = np.mean(conf_upper)
The main result from this process is that the bootstrapped confidence interval for number of crimes committed within 0.5km and 1km still contains zero after imputation. Thus, I do not have evidence to conclude that this is significant in the model, which suggests that it does not impact how students and teachers rate the safety of their school in a survey. Python’s statsmodels.MNLogit function mysteriously did not like when I added in more columns at different radius sizes, but I ran it a few times swapping them out for each other and got the same results.
Finally, consider the police dataset. This is the smallest dataset, with only 23 rows – one for each police station in Chicago. The only fields I care about from this dataset are police[Latitude] and police[Longitude]. The null hypothesis I wish to test here is that crime location is not influenced by where police stations are. This is a little nebulous, so first I compare these visually.
ward_crime_percentages = crimes.groupby('Ward').agg('count').Latitude / len(crimes) * 100
colours = [str(col) for col in list(Color("lightsalmon").range_to(Color("darkred"),len(ward_crime_percentages)))]
sorted_crime_percentages = np.sort(ward_crime_percentages)
ward_colours = dict()
for index in range(len(colours)):
ward_colours[ward_crime_percentages.index[np.where(ward_crime_percentages == sorted_crime_percentages[index])[0][0]]] = colours[index]
crimes['Ward Colour'] = [ward_colours[ward] for ward in crimes.Ward]
plt.figure('ward')
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.title('Crimes Heat Map by Ward + Police Stations', fontsize=12)
legend_handles = []
for index in range(len(colours)):
colour = colours[index]
plt.scatter(crimes[crimes['Ward Colour']==colour]['Latitude'], crimes[crimes['Ward Colour']==colour]['Longitude'], c=colour, s=2)
legend_handles = [mpatches.Patch(color=colours[i], label=str(round(sorted_crime_percentages[i],3)) + '%') for i in [0,9,19,29,39,49]]
plt.legend(handles = legend_handles)
plt.scatter(police.LATITUDE, police.LONGITUDE, c='navy', s=24)
plt.show('ward')
It’s clear that more crimes are committed in the very center of the city as well as to the West. The police stations (blue dots) also seem to be slightly more concentrated in the center of the city than at the edges.
Explicitly, I calculate the number of crimes committed within 1km (inner radius) of each police station and the number of crimes between 1km and √2km (outer radius) from each police station. The √2 is so that this strip has the same area as a 1km circle, that is πkm2. I repeat this for 100 iterations of bootstrapping the crime location data, each time computing the mean paired difference of counts.
## Crimes in radius around police stations
police['Latitude Radians'] = np.radians(police.LATITUDE)
police['Longitude Radians'] = np.radians(police.LONGITUDE)
# Bootstrap locations
crimes_locations = list(zip(crimes['Latitude Radians'], crimes['Longitude Radians'])) # collection of all crime locations
num_crimes = len(crimes_locations)
num_bootstraps = 100
paired_mean_differences = np.zeros(num_bootstraps)
sample_size = len(crimes_locations) # experimented with not sampling the same number of crimes
sample_proportion = sample_size / len(crimes_locations)
np.random.seed(654321)
# Each bootstrap samples with replacement from the crime location daat
for bootstrap in range(num_bootstraps):
sample_locations = [choice(crimes_locations) for sample in range(sample_size)]
[sample_lat, sample_lon] = [np.array(tup) for tup in zip(*sample_locations)]
# algorithm from https://andrew.hedges.name/experiments/haversine/ under MIT license
# returns array of Euclidean distance between police stations and each crime in sampled crimes
# redefined for each bootstrap since the sampled crime location data is an implicit argument
def GeoDistPolice(police_lat, police_lon):
lat_diff = sample_lat - police_lat
lon_diff = sample_lon - police_lon
a = np.square(np.sin(lat_diff / 2)) + np.cos(police_lat) * np.cos(sample_lat) * np.square(np.sin(lon_diff / 2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return Earth_Radius * c
inner_radius = np.zeros(len(police.index))
outer_radius = np.zeros(len(police.index))
# Count crimes in both radii for each station
for station in police.index:
[police_lat, police_lon] = police.loc[station, ['Latitude Radians', 'Longitude Radians']]
distances = GeoDistPolice(police_lat, police_lon)
inner_radius[station] = np.sum(distances <= 1) / sample_proportion # to get them on the same scale as if all crimes were used
outer_radius[station] = np.sum((distances > 1) & (distances <= np.sqrt(2))) / sample_proportion
paired_mean_differences[bootstrap] = np.mean(inner_radius - outer_radius)
# Bootstrap confidence interval
print(np.percentile(paired_mean_differences,0.025))
print(np.percentile(paired_mean_differences,0.975))
The value computed is inner radius − outer radius, which gives a bootstrap 95% confidence interval of (1586, 1631). Thus, I reject the null hypothesis and conclude that more crimes are reported close to a police station than farther away. This is not unreasonable, since presumably there are more police present in the inner radius to bear witness to a crime than the outer radius.
Conclusions
Ultimately, I answered three questions from the data available, all at a 95% significance level. The first of these is which covariates influence the severity of crimes committed, which I determined to be whether the crime led to an arrest, where the crime occurred, and the time of year when the crime occurred. Next, I found the number of crimes committed in different radii around each public high school. For the schools with missing data, I used multiple imputation to fill in the survey ranking of school safety. The imputed data led me to conclude that there is no evidence that the number of crimes committed near a school impacts how students and teachers rate the school’s safety. I also computed how many crimes were committed in a radius directly around each police station and a doughnut shape for the strip just outside this inner radius. I concluded that there are significantly more crimes committed close to the police station than just outside the inner radius.
To me, the most interesting results of this analysis involved the location data. Future analysis might involve more visualizations, and perhaps fitting something like a poisson point process to measure the dispersion of crimes. In addition to the statistical analysis, I learned how to compute these models using Python, which provided just as much versatility as R with enough experimentation.