*Blair Bilodeau*

### Introduction

This work is adapted from my final project for STA2101 (Applied Statistics I) completed in Fall 2018. All the code and data is available on my Github.

For this project I chose the option of analyzing some interesting data with Python. Using 3 datasets available from Kaggle, I investigated crime in Chicago from 2012 to 2016. The main goal of this is to explore data analysis in Python, specifically spatial plotting and multinomial logistic regression for determining relationships between crimes and the explanatory variables. Some questions I address are where crime occurs, what type of crime occurs, how crime in the area influences school quality, and how police station locations are related to crime hotspots.

The primary dataset used is crimes, which contains all reported incidents of crime in Chicago from 2012 to 2016. To supplement this, there is schools, which contains the progress report card of each public Chicago high school from the 2013-14 school year, and police, which contains location info for all Chicago police stations. These datasets have many fields, so I will only mention the critical ones that are used in a model.

First, I import the libraries I will require.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import statsmodels.api as sm
from scipy.stats import chi2
from scipy.stats import norm
from scipy.special import logsumexp
from colour import Color
from random import choice
```

### Crime Severity

The initial crimes file contains 1,456,714 rows, where each row corresponds to a crime being reported, but I removed 37,174 of these for missing location data. I also excluded crimes from 2017 since there were only 30, so it was logical to just consider years which were fully reported. The field crimes[Primary Type] details the main reason for a crime being reported, and I cleaned this up by removing some white space and grouping types such as “NARCOTICS” and “OTHER NARCOTIC VIOLATION”. I split crimes[Date] into three fields for day, month, and year, and then dropped some redundant fields. A field that may have been useful is crimes[Description], which contains a sentence describing the crime in more depth, but I decided there was too much variability in them to be practical for this analysis.

```
## Crimes Data Exploration
# We have 1 456 714 rows total
crimes = crimes[pd.notnull(crimes['X Coordinate'])] # 37 083 have NaN for location
crimes = crimes[crimes['X Coordinate'] > 0] # 77 have 0 for location
crimes = crimes[pd.notnull(crimes.Ward)] # 14 have NaN for ward
crimes = crimes[pd.notnull(crimes.District)] # 1 have NaN for district
crimes = crimes[pd.notnull(crimes['Community Area'])] # 22 have NaN for district
crimes = crimes[crimes.Year < 2017] # Only 30 crimes in 2017
# Clean up primary types
crimes = crimes.replace({'Primary Type':['NON - CRIMINAL', 'NON-CRIMINAL (SUBJECT SPECIFIED)']}, {'Primary Type':'NON-CRIMINAL'})
crimes = crimes.replace({'Primary Type':'OTHER NARCOTIC VIOLATION'}, {'Primary Type':'NARCOTICS'})
crimes = crimes.replace({'Primary Type':'CONCEALED CARRY LICENSE VIOLATION'}, {'Primary Type':'WEAPONS VIOLATION'})
# Time of event fields
crimes['Month'] = [int(date[0:2]) for date in crimes.Date]
crimes['Day'] = [int(date[3:5]) for date in crimes.Date]
crimes['Year'] = [int(date[6:10]) for date in crimes.Date]
# Convert boolean to integer
crimes.Arrest = crimes.Arrest.astype(int)
crimes.Domestic = crimes.Domestic.astype(int)
# Lots of columns we don't care about
crimes.drop(['Unnamed: 0', 'Date', 'ID', 'Case Number', 'IUCR', 'Description', 'Location', 'Location Description', 'Updated On', 'Block', 'FBI Code'], axis=1, inplace=True)
```

For this part of the analysis, I grouped the crime data by the field crimes[Primary Type] using my own subjective scale of severity. Some of these were easy to rank (homicide most severe, non-criminal least severe), while others reveal my personal biases (some may rank prostitution offenses as more severe than I do). Nonetheless, I believe it worthwhile to determine how the covariates impact the severity of the crime committed.

```
## Crime Severity
# 1 - HOMICIDE, CRIM SEXUAL ASSAULT, KIDNAPPING, HUMAN TRAFFICKING, OFFENSE INVOLVING CHILDREN
# 2 - BURGLARY, THEFT, MOTOR VEHICLE THEFT, ROBBERY, ASSAULT, ARSON, SEX OFFENSE, BATTERY
# 3 - NARCOTICS, STALKING, WEAPONS VIOLATION, CRIMINAL DAMAGE, CRIMINAL TRESPASS
# 4 - GAMBLING, PROSTITUTION, OBSCENITY, LIQUOR LAW VIOLATION, PUBLIC PEACE VIOLATION, PUBLIC INDECENCY
# 5 - INTIMIDATION, INTERFERENCE WITH PUBLIC OFFICER, DECEPTIVE PRACTICE
# 6 - NON-CRIMINAL, OTHER OFFENSE
crime_severity = {'HOMICIDE':1, 'CRIM SEXUAL ASSAULT':1, 'KIDNAPPING':1, 'HUMAN TRAFFICKING':1,
'OFFENSE INVOLVING CHILDREN':1, 'BURGLARY':2, 'THEFT':2, 'MOTOR VEHICLE THEFT':2, 'ROBBERY':2,
'ASSAULT':2, 'ARSON':2, 'SEX OFFENSE':2, 'BATTERY':2, 'NARCOTICS':3, 'STALKING':3, 'WEAPONS VIOLATION':3,
'CRIMINAL DAMAGE':3, 'CRIMINAL TRESPASS':3, 'GAMBLING':4, 'PROSTITUTION':4, 'OBSCENITY':4,
'LIQUOR LAW VIOLATION':4, 'PUBLIC PEACE VIOLATION':4, 'PUBLIC INDECENCY':4, 'INTIMIDATION':5,
'INTERFERENCE WITH PUBLIC OFFICER':5, 'DECEPTIVE PRACTICE':5, 'NON-CRIMINAL':6, 'OTHER OFFENSE':6}
crimes['Crime Severity'] = [crime_severity[crime] for crime in crimes['Primary Type']]
print(crimes.groupby('Crime Severity').agg('count').Latitude)
crime_severity_year = crimes.groupby(['Year', 'Crime Severity']).agg('count').Latitude
crime_year = crimes.groupby('Year').agg('count').Latitude
print(crime_severity_year.div(crime_year, level='Year')*100) # percentage of crimes by severity for each year
```

This results in the following table of the percentage of severe crimes per year. As expected, very sever crimes are rare (yay!) and less severe crimes are likely not often reported.

```
Year Crime Severity
2012 1 1.254848
2 62.336945
3 24.987003
4 1.956367
5 4.248257
6 5.216581
2013 1 1.319106
2 61.259230
3 25.107707
4 1.931008
5 4.516969
6 5.865980
2014 1 1.500249
2 59.783908
3 24.615288
4 1.978309
5 5.897440
6 6.224807
2015 1 1.584723
2 60.200236
3 23.586856
4 1.686728
5 6.228805
6 6.712652
2016 1 1.659152
2 64.709698
3 20.153777
4 1.119500
5 6.007012
6 6.350861
```

I then fit a multinomial logistic model to determine how severe a crime will be given covariates such as location, whether an arrest occurs, whether the crime is domestic related, where the crime occurred, and what time of year the crime occurred during. The results of this model reveal that with “most severe” as the reference category all of the covariates are significant for at least of one the possible outcomes.

```
MNLogit Regression Results
==============================================================================
Dep. Variable: Crime Severity No. Observations: 1419517
Model: MNLogit Df Residuals: 1419482
Method: MLE Df Model: 30
Date: Sat, 24 Nov 2018 Pseudo R-squ.: 0.09873
Time: 12:08:38 Log-Likelihood: -1.4036e+06
converged: True LL-Null: -1.5574e+06
LLR p-value: 0.000
====================================================================================
Crime Severity=2 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Arrest -0.1151 0.019 -6.093 0.000 -0.152 -0.078
Domestic -0.8151 0.015 -54.260 0.000 -0.845 -0.786
Year -0.0070 0.004 -1.929 0.054 -0.014 0.000
Latitude 1.6651 0.099 16.817 0.000 1.471 1.859
Longitude 0.5903 0.104 5.683 0.000 0.387 0.794
Month 0.0131 0.002 6.180 0.000 0.009 0.017
Day 0.0064 0.001 7.977 0.000 0.005 0.008
------------------------------------------------------------------------------------
Crime Severity=3 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Arrest 1.7122 0.019 89.942 0.000 1.675 1.749
Domestic -2.2771 0.017 -135.238 0.000 -2.310 -2.244
Year -0.0682 0.004 -18.253 0.000 -0.076 -0.061
Latitude -0.3517 0.102 -3.460 0.001 -0.551 -0.152
Longitude -1.7625 0.107 -16.519 0.000 -1.972 -1.553
Month 0.0013 0.002 0.579 0.562 -0.003 0.006
Day 0.0058 0.001 7.094 0.000 0.004 0.007
------------------------------------------------------------------------------------
Crime Severity=4 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Arrest 3.4872 0.027 131.002 0.000 3.435 3.539
Domestic -3.3843 0.050 -67.178 0.000 -3.483 -3.286
Year -0.1063 0.005 -21.222 0.000 -0.116 -0.097
Latitude -0.6723 0.137 -4.900 0.000 -0.941 -0.403
Longitude -2.7475 0.143 -19.169 0.000 -3.028 -2.467
Month 0.0078 0.003 2.698 0.007 0.002 0.013
Day 0.0063 0.001 5.812 0.000 0.004 0.008
------------------------------------------------------------------------------------
Crime Severity=5 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Arrest 0.2259 0.021 10.807 0.000 0.185 0.267
Domestic -3.9168 0.040 -98.234 0.000 -3.995 -3.839
Year 0.0702 0.004 17.190 0.000 0.062 0.078
Latitude 3.5473 0.110 32.253 0.000 3.332 3.763
Longitude 3.2869 0.116 28.340 0.000 3.060 3.514
Month 0.0003 0.002 0.122 0.903 -0.004 0.005
Day 0.0009 0.001 0.964 0.335 -0.001 0.003
------------------------------------------------------------------------------------
Crime Severity=6 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Arrest 0.2904 0.020 14.228 0.000 0.250 0.330
Domestic -0.1294 0.017 -7.831 0.000 -0.162 -0.097
Year -0.0338 0.004 -8.467 0.000 -0.042 -0.026
Latitude 0.5739 0.109 5.272 0.000 0.361 0.787
Longitude -0.5190 0.114 -4.546 0.000 -0.743 -0.295
Month -0.0056 0.002 -2.382 0.017 -0.010 -0.001
Day 0.0023 0.001 2.574 0.010 0.001 0.004
====================================================================================
```

Interestingly, we see evidence that an arrest will increase the probability of a lower severity, which I would guess is actually a reverse relationship and that low severity crimes are only reported when there is an easy arrest to be made. There is also a relationship to explore between crime location and severity, but I have yet to explore these further.

Most surprising is that there is an effect on severity by the date that the crime occurs. The sign of the coefficients suggests that a crime that is committed later in the year is more likely to be severe. Of course, one might expect this relationship to be non-linear, with a peak in severity occurring in the summer (see Toronto’s “Summer of Murder” in 2018), and this cannot be captured by the current form of the model.

### School Safety

Now, consider the schools dataset. Each row of this dataset corresponds to one of the 188 public high schools in Chicago. There are multiple fields that give a qualitative review of the school based on student and teacher surveys, as well as quantitative fields for test scores, attendance levels, and dropout rates. Most of the quantitative fields had very little data and so were not considered for this analysis. The fields schools[Latitude] and schools[Longitude] are labelled the other way around in the original file, so I swapped those back.

The only survey question that got enough data from a significant proportion of schools is schools[Safe], which rates how safe the school is on a scale of VERY WEAK, WEAK, NEUTRAL, STRONG, or VERY STRONG. After removing the schools with minimal data in the data cleaning process, I was left with 136. However, 18 of these still have NOT ENOUGH DATA in schools[Safe]. The question I am interested in is whether the number of crimes committed “close” to the school impacts how students and teachers rate the safety of the school. As I’ve already removed a significant number of schools for having many missing fields, I will use multiple imputation on these 18 schools.

```
## Compare crime to school quality
# We have 188 schools total
schools_cols = ['School ID', 'Student Response Rate', 'Teacher Response Rate', 'Involved Family',
'Supportive Environment', 'Ambitious Instruction', 'Effective Leaders', 'Collaborative Teachers',
'Safe', 'School Community', 'Parent-Teacher Partnership', 'Quality of Facilities',
'Healthy Schools Certification', 'Creative Schools Certification', 'EPAS Growth Percentile',
'EPAS Attainment Percentile', 'Grade ACT Attainment Percentile Grade 11', 'ACT Spring 2013 Average Grade 11',
'Student Attendance Percentage 2013', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude']
schools = schools[schools_cols] # Only want to keep the columns that have a reasonable amount of data
schools = schools[list(~(np.array(np.isnan(schools['Student Attendance Percentage 2013']) & np.isnan(schools['One-Year DropOut Rate Percentage 2013']) & np.isnan(schools['ACT Spring 2013 Average Grade 11']))))] # remove 19 that have nearly no data
schools = schools[~np.isnan(schools['ACT Spring 2013 Average Grade 11']) & ~np.isnan(schools['EPAS Growth Percentile']) & ~np.isnan(schools['Grade ACT Attainment Percentile Grade 11']) & np.array(schools['One-Year DropOut Rate Percentage 2013'] > 0)] # drop 33 more schools with no ACT data
# Have to swap Longitude and Latitude
schools.rename(columns = {'Latitude': 'temp'}, inplace=True)
schools.rename(columns = {'Longitude':'Latitude', 'temp':'Longitude'}, inplace=True)
# View breakdown of categorical data
print(schools.groupby('Collaborative Teachers').agg('count')['School ID'])
print(schools.groupby('Safe').agg('count')['School ID']) # only useful one that has a reasonable percentage of missing values
print(schools.groupby('School Community').agg('count')['School ID'])
print(schools.groupby('Parent-Teacher Partnership').agg('count')['School ID'])
print(schools.groupby('Quality of Facilities').agg('count')['School ID'])
crimes['Latitude Radians'] = np.radians(crimes.Latitude)
crimes['Longitude Radians'] = np.radians(crimes.Longitude)
schools['Latitude Radians'] = np.radians(schools.Latitude)
schools['Longitude Radians'] = np.radians(schools.Longitude)
```

In order to determine the proximity of a crime to a school, I use an approximation of the Euclidean distance between two points determined by latitude and longitude as described by https://andrew.hedges.name/experiments/haversine/.

```
Earth_Radius = 6373 # approximate in km
# algorithm from https://andrew.hedges.name/experiments/haversine/ under MIT license
# returns array of Euclidean distance between school and each crime in crimes
def GeoDistSchool(school_lat, school_lon):
lat_diff = crimes['Latitude Radians'] - school_lat
lon_diff = crimes['Longitude Radians'] - school_lon
a = np.square(np.sin(lat_diff / 2)) + np.cos(school_lat) * np.cos(crimes['Latitude Radians']) * np.square(np.sin(lon_diff / 2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return Earth_Radius * c
for school in schools.index:
[school_lat, school_lon] = schools.loc[school, ['Latitude Radians', 'Longitude Radians']]
distances = GeoDistSchool(school_lat, school_lon)
schools.loc[school, 'Crimes Committed 0.25km'] = np.sum(distances <= 0.25)
schools.loc[school, 'Crimes Committed 0.5km'] = np.sum(distances <= 0.5)
schools.loc[school, 'Crimes Committed 0.75km'] = np.sum(distances <= 0.75)
schools.loc[school, 'Crimes Committed 1km'] = np.sum(distances <= 1)
schools.loc[school, 'Crimes Committed 1.5km'] = np.sum(distances <= 1.5)
schools.loc[school, 'Crimes Committed 3km'] = np.sum(distances <= 3)
schools.loc[school, 'Crimes Committed 5km'] = np.sum(distances <= 5)
```

Then, for each school I counted how many crimes were committed from 2012-2016 in a 0.5km and 1km radius around the school, and used this along with the other covariates to fit a multinomial logistic regression model with schools[Safe] as the response variable. You can see that lots of the covariates are not significant, including the number of crimes committed in a certain radius.

```
# Fit multinomial logit model to predict 'Safe' categorization
schools_exists = schools[~(schools.Safe=='NOT ENOUGH DATA')] # 118 schools to build initial model
schools_exists_y = schools_exists.Safe
schools_exists_X = schools_exists[['Student Response Rate', 'Teacher Response Rate', 'EPAS Growth Percentile',
'EPAS Attainment Percentile', 'Grade ACT Attainment Percentile Grade 11',
'ACT Spring 2013 Average Grade 11', 'Student Attendance Percentage 2013',
'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude',
'Crimes Committed 0.5km', 'Crimes Committed 1km']]
schools_exists_model = sm.MNLogit(schools_exists_y, schools_exists_X).fit()
print(schools_exists_model.summary())
```

```
MNLogit Regression Results
==============================================================================
Dep. Variable: Safe No. Observations: 118
Model: MNLogit Df Residuals: 70
Method: MLE Df Model: 44
Date: Sat, 24 Nov 2018 Pseudo R-squ.: 0.4531
Time: 21:26:23 Log-Likelihood: -84.131
converged: True LL-Null: -153.83
LLR p-value: 7.608e-12
============================================================================================================
Safe=STRONG coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
Student Response Rate -0.0038 0.027 -0.142 0.887 -0.056 0.048
Teacher Response Rate -0.0130 0.022 -0.604 0.546 -0.055 0.029
EPAS Growth Percentile -0.0061 0.013 -0.466 0.642 -0.032 0.020
EPAS Attainment Percentile 0.2829 0.110 2.571 0.010 0.067 0.499
Grade ACT Attainment Percentile Grade 11 -0.1077 0.145 -0.743 0.458 -0.392 0.176
ACT Spring 2013 Average Grade 11 -1.1330 1.057 -1.072 0.284 -3.204 0.938
Student Attendance Percentage 2013 0.0440 0.071 0.623 0.534 -0.094 0.182
One-Year DropOut Rate Percentage 2013 0.0667 0.038 1.755 0.079 -0.008 0.141
Latitude 12.5131 4.802 2.606 0.009 3.101 21.925
Longitude 5.8556 2.299 2.547 0.011 1.350 10.361
Crimes Committed 0.5km -3.49e-05 0.000 -0.128 0.898 -0.001 0.000
Crimes Committed 1km -2.876e-05 0.000 -0.221 0.825 -0.000 0.000
------------------------------------------------------------------------------------------------------------
Safe=VERY STRONG coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
Student Response Rate -0.0117 0.086 -0.136 0.892 -0.181 0.157
Teacher Response Rate 0.0591 0.108 0.547 0.584 -0.153 0.271
EPAS Growth Percentile -0.0287 0.033 -0.870 0.384 -0.093 0.036
EPAS Attainment Percentile -0.0470 0.331 -0.142 0.887 -0.696 0.602
Grade ACT Attainment Percentile Grade 11 -0.2014 0.353 -0.570 0.569 -0.894 0.491
ACT Spring 2013 Average Grade 11 3.3370 2.203 1.515 0.130 -0.981 7.655
Student Attendance Percentage 2013 0.1301 0.260 0.499 0.618 -0.381 0.641
One-Year DropOut Rate Percentage 2013 0.2328 0.132 1.757 0.079 -0.027 0.492
Latitude -1.3604 10.328 -0.132 0.895 -21.603 18.882
Longitude 0.1280 4.949 0.026 0.979 -9.572 9.828
Crimes Committed 0.5km -0.0008 0.001 -0.618 0.537 -0.003 0.002
Crimes Committed 1km -0.0001 0.000 -0.297 0.766 -0.001 0.001
------------------------------------------------------------------------------------------------------------
Safe=VERY WEAK coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
Student Response Rate -0.1712 0.148 -1.158 0.247 -0.461 0.119
Teacher Response Rate -0.0121 0.199 -0.061 0.951 -0.402 0.378
EPAS Growth Percentile 0.0730 0.071 1.022 0.307 -0.067 0.213
EPAS Attainment Percentile -2.8732 2.257 -1.273 0.203 -7.297 1.551
Grade ACT Attainment Percentile Grade 11 2.7547 2.388 1.153 0.249 -1.926 7.436
ACT Spring 2013 Average Grade 11 0.0001 4.945 3e-05 1.000 -9.692 9.692
Student Attendance Percentage 2013 -0.3628 0.450 -0.806 0.420 -1.245 0.519
One-Year DropOut Rate Percentage 2013 -0.1222 0.279 -0.439 0.661 -0.669 0.424
Latitude -58.9888 39.143 -1.507 0.132 -135.707 17.729
Longitude -28.5170 18.862 -1.512 0.131 -65.485 8.451
Crimes Committed 0.5km 0.0018 0.002 1.079 0.281 -0.002 0.005
Crimes Committed 1km -0.0004 0.001 -0.749 0.454 -0.001 0.001
------------------------------------------------------------------------------------------------------------
Safe=WEAK coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------
Student Response Rate -0.0385 0.026 -1.469 0.142 -0.090 0.013
Teacher Response Rate -0.0082 0.025 -0.323 0.746 -0.058 0.042
EPAS Growth Percentile 0.0102 0.015 0.695 0.487 -0.019 0.039
EPAS Attainment Percentile -0.1710 0.137 -1.244 0.214 -0.440 0.099
Grade ACT Attainment Percentile Grade 11 0.0803 0.221 0.363 0.716 -0.353 0.513
ACT Spring 2013 Average Grade 11 -0.2075 1.239 -0.167 0.867 -2.636 2.221
Student Attendance Percentage 2013 -0.0524 0.052 -1.002 0.316 -0.155 0.050
One-Year DropOut Rate Percentage 2013 -0.0986 0.040 -2.443 0.015 -0.178 -0.019
Latitude -11.3823 5.074 -2.243 0.025 -21.327 -1.438
Longitude -5.5696 2.442 -2.280 0.023 -10.356 -0.783
Crimes Committed 0.5km 5.189e-05 0.000 0.121 0.903 -0.001 0.001
Crimes Committed 1km -3.292e-06 0.000 -0.028 0.977 -0.000 0.000
============================================================================================================
```

So, I fit a reduced model, which now has all coefficients significantly non-zero.

```
# Get rid of extraneous covariates and fit reduced model
schools_exists_X_reduced = schools_exists[['Student Response Rate', 'EPAS Attainment Percentile', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude']]
schools_exists_model_reduced = sm.MNLogit(schools_exists_y, schools_exists_X_reduced).fit()
print(schools_exists_model_reduced.summary())
```

```
MNLogit Regression Results
==============================================================================
Dep. Variable: Safe No. Observations: 118
Model: MNLogit Df Residuals: 98
Method: MLE Df Model: 16
Date: Sat, 24 Nov 2018 Pseudo R-squ.: 0.3224
Time: 21:43:25 Log-Likelihood: -104.24
converged: True LL-Null: -153.83
LLR p-value: 4.914e-14
=========================================================================================================
Safe=STRONG coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Student Response Rate -0.0092 0.022 -0.415 0.678 -0.053 0.034
EPAS Attainment Percentile 0.0467 0.015 3.116 0.002 0.017 0.076
One-Year DropOut Rate Percentage 2013 0.0551 0.029 1.919 0.055 -0.001 0.111
Latitude 7.6611 3.974 1.928 0.054 -0.128 15.450
Longitude 3.6813 1.901 1.937 0.053 -0.044 7.407
---------------------------------------------------------------------------------------------------------
Safe=VERY STRONG coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Student Response Rate 0.0268 0.048 0.561 0.575 -0.067 0.120
EPAS Attainment Percentile 0.1235 0.032 3.881 0.000 0.061 0.186
One-Year DropOut Rate Percentage 2013 0.1854 0.066 2.810 0.005 0.056 0.315
Latitude -3.0168 6.593 -0.458 0.647 -15.939 9.906
Longitude -1.3134 3.140 -0.418 0.676 -7.469 4.842
---------------------------------------------------------------------------------------------------------
Safe=VERY WEAK coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Student Response Rate -0.0025 0.056 -0.045 0.964 -0.112 0.107
EPAS Attainment Percentile -0.2092 0.136 -1.534 0.125 -0.476 0.058
One-Year DropOut Rate Percentage 2013 -0.0458 0.073 -0.625 0.532 -0.189 0.098
Latitude -22.0762 11.050 -1.998 0.046 -43.734 -0.419
Longitude -10.5304 5.270 -1.998 0.046 -20.860 -0.201
---------------------------------------------------------------------------------------------------------
Safe=WEAK coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Student Response Rate -0.0412 0.021 -1.994 0.046 -0.082 -0.001
EPAS Attainment Percentile -0.1356 0.039 -3.509 0.000 -0.211 -0.060
One-Year DropOut Rate Percentage 2013 -0.0828 0.035 -2.341 0.019 -0.152 -0.013
Latitude -10.6719 4.637 -2.301 0.021 -19.761 -1.583
Longitude -5.1489 2.217 -2.322 0.020 -9.494 -0.804
=========================================================================================================
```

I performed a likelihood ratio test on the reduced model and the full model, obtaining a test statistic of G2 = 40.21 which on 28 degrees of freedom corresponds to a p value of 0.0633. Thus, I do not have evidence to conclude that the reduced model is different from the full model, so I will use the reduced model.

```
# Likelihood ratio test is 2(LL_full - LL_restricted)
G2 = 2 * (schools_exists_model.llf - schools_exists_model_reduced.llf)
df = schools_exists_model_reduced.df_resid - schools_exists_model.df_resid
pval = 1 - chi2.cdf(G2,df)
```

With this, I impute the value of schools[Safe] for the 18 schools which didn’t have enough data by simulating a multinomial random variable with the respective predicted probabilities. In order to account for the additional error in the imputation, I repeat this process 1000 times, each time recalculating the full model on all 136 datapoints. From each iteration, I obtain confidence intervals for each coefficient, and then I use the bootstrap average lower and upper bound to get my final confidence intervals for the coefficients.

```
# Impute values of schools.Safe using reduced model
# Rows used to build the imputation model
schools_impute = schools[schools.Safe=='NOT ENOUGH DATA']
# Columns used to impute the data
schools_impute_X = schools_impute[['Student Response Rate', 'EPAS Attainment Percentile', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude']]
# Columns used to fit model on imputed data
schools_imputed_X = schools[['Student Response Rate', 'Teacher Response Rate', 'EPAS Growth Percentile', 'EPAS Attainment Percentile', 'Grade ACT Attainment Percentile Grade 11', 'ACT Spring 2013 Average Grade 11', 'Student Attendance Percentage 2013', 'One-Year DropOut Rate Percentage 2013', 'Latitude', 'Longitude', 'Crimes Committed 0.5km', 'Crimes Committed 1km']]
schools_imputed_conf_ints = []
# order of params output by model
# 0 - NEUTRAL
# 1 - STRONG
# 2 - VERY STRONG
# 3 - VERY WEAK
# 4 - WEAK
params = {0:'NEUTRAL', 1:'STRONG', 2:'VERY STRONG', 3:'VERY WEAK', 4:'WEAK'}
# Extracts parameters for distribution of coefficient estimates
normal_95 = norm.ppf(0.975) # more decimals than 1.96
schools_exists_model_reduced_coefs = np.transpose(np.array(schools_exists_model_reduced.params)) # mean values for betas
schools_exists_model_reduced_se = np.zeros((4,5)) # standard deviations for betas
for i in range(4):
for j in range(5):
schools_exists_model_reduced_se[i,j] = (schools_exists_model_reduced._results.conf_int()[i][j][1] - schools_exists_model_reduced_coefs[i][j]) / normal_95
np.random.seed(123456) # set seed for reproducibility
# Multiple imputation ran 1000 times
num_iters = 1000
for iter in range(num_iters):
# Matrix of sampled coefficient estimates
# Each row corresponds to an outcome (response)
# Each column corresponds to a covariate
schools_exists_model_reduced_coefs_sample = np.zeros((4,5))
for i in range(4):
for j in range(5):
# Sample each beta from the Normal distribution
schools_exists_model_reduced_coefs_sample[i,j] = np.random.normal(schools_exists_model_reduced_coefs[i,j],schools_exists_model_reduced_se[i,j],1)
# Compute L vector for each row to impute
schools_imputed_temp = schools.copy()
for school in schools_impute.index:
L = np.dot(schools_exists_model_reduced_coefs_sample, schools_impute_X.loc[school]) # L1, L2, etc from lecture notes
L_one = np.append(np.array([0]), L) # L for the reference value is 0, since need exp(L)=1
log_denom = logsumexp(L_one) # computationally better than computing exponentials first and dividing
probs = np.array([np.exp(L_one[i] - log_denom) for i in range(5)])
schools_imputed_temp.loc[school, 'Safe'] = params[np.where(np.random.multinomial(1,probs) == 1)[0][0]]
schools_imputed_temp_y = schools_imputed_temp.Safe
schools_imputed_temp_model = sm.MNLogit(schools_imputed_temp_y, schools_imputed_X).fit(disp=0)
schools_imputed_conf_ints.append(schools_imputed_temp_model._results.conf_int())
# Now access the confidence intervals for each param and find average lower and upper bounds
schools_imputed_bootstrap_conf_ints = np.zeros((4,12,2))
for response in range(4):
for coef in range(12):
conf_lower = np.array([val[response][coef][0] for val in schools_imputed_conf_ints])
conf_upper = np.array([val[response][coef][1] for val in schools_imputed_conf_ints])
schools_imputed_bootstrap_conf_ints[response][coef][0] = np.mean(conf_lower)
schools_imputed_bootstrap_conf_ints[response][coef][1] = np.mean(conf_upper)
```

The main result from this process is that the bootstrapped confidence interval for number of crimes committed within 0.5km and 1km still contains zero after imputation. Thus, I do not have evidence to conclude that this is significant in the model, which suggests that it does not impact how students and teachers rate the safety of their school in a survey. Python’s statsmodels.MNLogit function mysteriously did not like when I added in more columns at different radius sizes, but I ran it a few times swapping them out for each other and got the same results.

```
==================================================================
Safe=STRONG [0.025 0.975]
------------------------------------------------------------------
Student Response Rate -0.0553 0.0361
Teacher Response Rate -0.0505 0.0303
EPAS Growth Percentile -0.0306 0.0182
EPAS Attainment Percentile 0.0536 0.4682
Grade ACT Attainment Percentile Grade 11 -0.3319 0.1973
ACT Spring 2013 Average Grade 11 -3.2238 0.5732
Student Attendance Percentage 2013 -0.0928 0.1792
One-Year DropOut Rate Percentage 2013 -0.0127 0.1295
Latitude 2.2649 20.2106
Longitude 0.928 9.4907
Crimes Committed 0.5km -0.0005 0.0005
Crimes Committed 1km -0.0003 0.0002
------------------------------------------------------------------
Safe=VERY STRONG [0.025 0.975]
------------------------------------------------------------------
Student Response Rate -0.1785 0.0885
Teacher Response Rate -0.1156 0.2305
EPAS Growth Percentile -0.0738 0.0334
EPAS Attainment Percentile -0.688 0.4663
Grade ACT Attainment Percentile Grade 11 -0.6902 0.5267
ACT Spring 2013 Average Grade 11 -0.8881 6.2761
Student Attendance Percentage 2013 -0.3395 0.5248
One-Year DropOut Rate Percentage 2013 -0.0371 0.403
Latitude -22.5398 14.6637
Longitude -10.1751 7.6211
Crimes Committed 0.5km -0.0029 0.0015
Crimes Committed 1km -0.0007 0.0006
------------------------------------------------------------------
Safe=VERY WEAK [0.025 0.975]
------------------------------------------------------------------
Student Response Rate -0.182 0.0589
Teacher Response Rate -0.1802 0.1666
EPAS Growth Percentile -0.031 0.0819
EPAS Attainment Percentile -1.8504 0.0805
Grade ACT Attainment Percentile Grade 11 -0.7686 2.0723
ACT Spring 2013 Average Grade 11 -5.7878 8.1617
Student Attendance Percentage 2013 -0.573 0.1335
One-Year DropOut Rate Percentage 2013 -0.393 0.1686
Latitude -71.4799 5.0597
Longitude -34.1409 2.3914
Crimes Committed 0.5km -0.0015 0.0037
Crimes Committed 1km -0.001 0.0005
------------------------------------------------------------------
Safe=WEAK [0.025 0.975]
------------------------------------------------------------------
Student Response Rate -0.0817 0.0081
Teacher Response Rate -0.0553 0.0368
EPAS Growth Percentile -0.0175 0.0352
EPAS Attainment Percentile -0.3826 0.1029
Grade ACT Attainment Percentile Grade 11 -0.3588 0.4125
ACT Spring 2013 Average Grade 11 -2.1312 2.2394
Student Attendance Percentage 2013 -0.1619 0.0387
One-Year DropOut Rate Percentage 2013 -0.179 -0.0231
Latitude -20.1468 -0.9197
Longitude -9.7418 -0.5183
Crimes Committed 0.5km -0.0007 0.0008
Crimes Committed 1km -0.0002 0.0002
==================================================================
```

### Police Stations

Finally, consider the police dataset. This is the smallest dataset, with only 23 rows – one for each police station in Chicago. The only fields I care about from this dataset are police[Latitude] and police[Longitude]. The null hypothesis I wish to test here is that crime location is not influenced by where police stations are. This is a little nebulous, so first I compare these visually.

```
ward_crime_percentages = crimes.groupby('Ward').agg('count').Latitude / len(crimes) * 100
colours = [str(col) for col in list(Color("lightsalmon").range_to(Color("darkred"),len(ward_crime_percentages)))]
sorted_crime_percentages = np.sort(ward_crime_percentages)
ward_colours = dict()
for index in range(len(colours)):
ward_colours[ward_crime_percentages.index[np.where(ward_crime_percentages == sorted_crime_percentages[index])[0][0]]] = colours[index]
crimes['Ward Colour'] = [ward_colours[ward] for ward in crimes.Ward]
plt.figure('ward')
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.title('Crimes Heat Map by Ward + Police Stations', fontsize=12)
legend_handles = []
for index in range(len(colours)):
colour = colours[index]
plt.scatter(crimes[crimes['Ward Colour']==colour]['Latitude'], crimes[crimes['Ward Colour']==colour]['Longitude'], c=colour, s=2)
legend_handles = [mpatches.Patch(color=colours[i], label=str(round(sorted_crime_percentages[i],3)) + '%') for i in [0,9,19,29,39,49]]
plt.legend(handles = legend_handles)
plt.scatter(police.LATITUDE, police.LONGITUDE, c='navy', s=24)
plt.show('ward')
```

It’s clear that more crimes are committed in the very center of the city as well as to the West. The police stations (blue dots) also seem to be slightly more concentrated in the center of the city than at the edges.

Explicitly, I calculate the number of crimes committed within 1km (inner radius) of each police station and the number of crimes between 1km and √2km (outer radius) from each police station. The √2 is so that this strip has the same area as a 1km circle, that is πkm2. I repeat this for 100 iterations of bootstrapping the crime location data, each time computing the mean paired difference of counts.

```
## Crimes in radius around police stations
police['Latitude Radians'] = np.radians(police.LATITUDE)
police['Longitude Radians'] = np.radians(police.LONGITUDE)
# Bootstrap locations
crimes_locations = list(zip(crimes['Latitude Radians'], crimes['Longitude Radians'])) # collection of all crime locations
num_crimes = len(crimes_locations)
num_bootstraps = 100
paired_mean_differences = np.zeros(num_bootstraps)
sample_size = len(crimes_locations) # experimented with not sampling the same number of crimes
sample_proportion = sample_size / len(crimes_locations)
np.random.seed(654321)
# Each bootstrap samples with replacement from the crime location daat
for bootstrap in range(num_bootstraps):
sample_locations = [choice(crimes_locations) for sample in range(sample_size)]
[sample_lat, sample_lon] = [np.array(tup) for tup in zip(*sample_locations)]
# algorithm from https://andrew.hedges.name/experiments/haversine/ under MIT license
# returns array of Euclidean distance between police stations and each crime in sampled crimes
# redefined for each bootstrap since the sampled crime location data is an implicit argument
def GeoDistPolice(police_lat, police_lon):
lat_diff = sample_lat - police_lat
lon_diff = sample_lon - police_lon
a = np.square(np.sin(lat_diff / 2)) + np.cos(police_lat) * np.cos(sample_lat) * np.square(np.sin(lon_diff / 2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return Earth_Radius * c
inner_radius = np.zeros(len(police.index))
outer_radius = np.zeros(len(police.index))
# Count crimes in both radii for each station
for station in police.index:
[police_lat, police_lon] = police.loc[station, ['Latitude Radians', 'Longitude Radians']]
distances = GeoDistPolice(police_lat, police_lon)
inner_radius[station] = np.sum(distances <= 1) / sample_proportion # to get them on the same scale as if all crimes were used
outer_radius[station] = np.sum((distances > 1) & (distances <= np.sqrt(2))) / sample_proportion
paired_mean_differences[bootstrap] = np.mean(inner_radius - outer_radius)
# Bootstrap confidence interval
print(np.percentile(paired_mean_differences,0.025))
print(np.percentile(paired_mean_differences,0.975))
```

The value computed is inner radius − outer radius, which gives a bootstrap 95% confidence interval of (1586, 1631). Thus, I reject the null hypothesis and conclude that more crimes are reported close to a police station than farther away. This is not unreasonable, since presumably there are more police present in the inner radius to bear witness to a crime than the outer radius.

### Conclusions

Ultimately, I answered three questions from the data available, all at a 95% significance level. The first of these is which covariates influence the severity of crimes committed, which I determined to be whether the crime led to an arrest, where the crime occurred, and the time of year when the crime occurred. Next, I found the number of crimes committed in different radii around each public high school. For the schools with missing data, I used multiple imputation to fill in the survey ranking of school safety. The imputed data led me to conclude that there is no evidence that the number of crimes committed near a school impacts how students and teachers rate the school’s safety. I also computed how many crimes were committed in a radius directly around each police station and a doughnut shape for the strip just outside this inner radius. I concluded that there are significantly more crimes committed close to the police station than just outside the inner radius.

To me, the most interesting results of this analysis involved the location data. Future analysis might involve more visualizations, and perhaps fitting something like a poisson point process to measure the dispersion of crimes. In addition to the statistical analysis, I learned how to compute these models using Python, which provided just as much versatility as R with enough experimentation.