The objective of this article is to explore and model the dataset of homicides in the state of Santa Catarina. After request, the state’s security bureau of Santa Catarina gave us the data. Ian Fukushima, my colleague, inspired me to make the maps and analysis of this data, I’m grateful for his support.
Let’s start by loading packages:
#preliminaries -----------------------------------------
library(knitr) #for kable, stylized table in r markdown
library(tidyverse) #for pipe
library(dplyr) #select() function
library(sp)
library(spdep)
library(rgdal)
library(Amelia) #missmap()
library(ggmap)
library(rgeos) #gUnaryUnion
library(xts) #extended time series
library(magick) # to make gifs
library(randomForest)
library(gstat) #idw function
library(tmap) #to use tmap function
library(randomForest)
library(grid) #to plot two tmaps together
library(raster) #raster()
Include some paths to save images. You must change those paths for your machine
path1 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/"
path_root = "/home/william/Dropbox/working/projects_git/Crime-SC/"
path_data = "/home/william/Dropbox/working/projects_git/Crime-SC/Dados"
pathExport = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/export"
load.csv()
will load the data. Ian used Google Maps’ API to find the geographic coordinates. The raw data that came only with the address of each homicide incident.
#prepare base----------------------------------------
#load base
crime = read.csv(file=paste0(path_data,"/df_sc_crime_geocoded.csv"), header=TRUE, sep=",", fileEncoding="latin1")
crime %>% class
## [1] "data.frame"
The data have 10745 incidents of homicide.
crime %>% dim()
## [1] 10745 25
The following are the variables. We are going to use only "Data"
, "lat"
,"long"
.
crime %>% names()
## [1] "X" "Crime"
## [3] "Data" "Hora"
## [5] "ID" "Endereço"
## [7] "Município" "Bairro"
## [9] "Tipo.Logradouro" "Logradouro"
## [11] "Número" "Tipo.de.local"
## [13] "Localidade" "PosiTime"
## [15] "lat" "long"
## [17] "accuracy" "formatted_address"
## [19] "address_type" "status"
## [21] "index" "MOTIVAÇÃO"
## [23] "VIOLÊNCIA.DOMÉSTICA" "MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER"
## [25] "ESTUPRO..ANTECEDENTE"
Here is the summary of our data. We are going to use "Data"
, "lat"
,"long"
. But variables as "Municipio"
,"Número"
,"Endereço"
,"Bairro"
are important when using Google’s API for geocoding.
summary(crime)
## X Crime
## Min. : 1 HOMICÍDIO :9162
## 1st Qu.: 2687 LATROCÍNIO : 665
## Median : 5373 PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO : 619
## Mean : 5373 LESÃO CORPORAL SEGUIDA DE MORTE : 170
## 3rd Qu.: 8059 PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO : 63
## Max. :10745 PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO: 50
## (Other) : 16
## Data Hora ID
## 6/12/2015: 12 21:00 : 638 Min. : 1
## 13/3/2010: 11 23:00 : 604 1st Qu.: 2687
## 13/5/2018: 11 22:00 : 579 Median : 5373
## 31/3/2018: 11 20:00 : 530 Mean : 5373
## 1/1/2013 : 10 0:00 : 512 3rd Qu.: 8059
## 11/7/2010: 10 19:00 : 438 Max. :10745
## (Other) :10680 (Other):7444
## Endereço
## , JOINVILLE, Santa Catarina, BRASIL : 50
## , FLORIANÓPOLIS, Santa Catarina, BRASIL : 22
## RUA SEIS DE JANEIRO , JOINVILLE, Santa Catarina, BRASIL: 21
## , SÃO JOSÉ, Santa Catarina, BRASIL : 15
## , ITAJAÍ, Santa Catarina, BRASIL : 13
## , CRICIÚMA, Santa Catarina, BRASIL : 12
## (Other) :10612
## Município Bairro Tipo.Logradouro
## JOINVILLE :1133 CENTRO : 990 RUA :7234
## FLORIANÓPOLIS:1063 INTERIOR : 321 ESTRADA :1105
## ITAJAÍ : 561 : 216 RODOVIA : 701
## CHAPECÓ : 487 RURAL : 193 AVENIDA : 642
## SÃO JOSÉ : 467 MONTE ALEGRE : 161 : 311
## CRICIÚMA : 380 PARANAGUAMIRIM: 142 SERVIDÃO: 282
## (Other) :6654 (Other) :8722 (Other) : 470
## Logradouro Número Tipo.de.local
## : 322 :7686 VIA PÚBLICA :5728
## GERAL : 185 20 : 21 RESIDÊNCIA :2956
## BR 101 : 71 200 : 20 OUTRO : 848
## BR 470 : 35 30 : 15 ESTABELECIMENTO COMERCIAL: 489
## SANTA CATARINA: 30 55 : 14 BARES E SIMILARES : 452
## GETÚLIO VARGAS: 28 12 : 13 ESTABELECIMENTO PENAL : 122
## (Other) :10074 (Other):2976 (Other) : 150
## Localidade PosiTime
## ZONA URBANA :9851 2014-08-31 03:00:00: 6
## ZONA RURAL : 882 2014-09-06 22:00:00: 6
## NÃO INFORMADA : 4 ZONA URBANA : 6
## \\"CANTINA PER TUTI\\",","VIA PÚBLICA: 1 2014-03-20 19:00:00: 5
## \\"E\\",","VIA PÚBLICA : 1 2015-02-26 04:00:00: 5
## 406, \\"G\\",","VIA PÚBLICA : 1 2016-06-12 06:00:00: 5
## (Other) : 5 (Other) :10712
## lat long accuracy
## -28.6010292: 91 -49.151029 : 91 route :5971
## -26.2834213: 73 -48.8452269: 73 street_address:2136
## -26.6140628: 41 -53.7265491: 41 locality : 790
## -27.6140791: 27 -48.6370861: 27 establishment : 781
## -26.0713285: 25 -48.6177384: 25 premise : 371
## (Other) :10471 (Other) :10471 (Other) : 679
## NA's : 17 NA's : 17 NA's : 17
## formatted_address
## Estrada Geral, Santa Catarina, Brazil : 91
## Joinville - State of Santa Catarina, Brazil : 74
## BR-101, Santa Catarina, Brazil : 41
## Estrada de Linha, Paraíso - SC, 89906-000, Brazil : 41
## São José, Santa Catarina - Barreiros, São José - State of Santa Catarina, Brazil: 27
## (Other) :10454
## NA's : 17
## address_type
## route :5971
## street_address :2136
## locality,political : 790
## premise : 371
## administrative_area_level_2,political: 321
## (Other) :1139
## NA's : 17
## status
## OK :10720
## Error - NA - Verifify : 17
## route : 3
## Chapecó, State of Santa Catarina, Brazil : 1
## clothing_store,establishment,home_goods_store,point_of_interest,store: 1
## establishment,point_of_interest : 1
## (Other) : 2
## index MOTIVAÇÃO VIOLÊNCIA.DOMÉSTICA
## OK : 5 NÃO INFORMADA :3480 NÃO :10178
## 1 : 1 DESAVENÇA :2853 NÃO INFORMADO: 1
## 10 : 1 TRÁFICO DE DROGAS:1726 SIM : 566
## 100 : 1 PASSIONAL : 846
## 1000 : 1 AÇÃO POLICIAL : 745
## 10000 : 1 ROUBO : 655
## (Other):10735 (Other) : 440
## MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER ESTUPRO..ANTECEDENTE
## NÃO :4489 NÃO :10559
## NÃO INFORMADO:6232 NÃO INFORMADO: 160
## SIM : 24 SIM : 26
##
##
##
##
kable(crime[1:3,], caption="original data of homicides") #shows stylized table of data frame
X | Crime | Data | Hora | ID | Endereço | Município | Bairro | Tipo.Logradouro | Logradouro | Número | Tipo.de.local | Localidade | PosiTime | lat | long | accuracy | formatted_address | address_type | status | index | MOTIVAÇÃO | VIOLÊNCIA.DOMÉSTICA | MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER | ESTUPRO..ANTECEDENTE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | HOMICÍDIO | 1/1/2008 | 0:00 | 1 | RUA MONTEVIDÉU , CHAPECÓ, Santa Catarina, BRASIL | CHAPECÓ | FORTES | RUA | MONTEVIDÉU | VIA PÚBLICA | ZONA URBANA | 2008-01-01 00:00:00 | -27.0996274 | -52.5998438 | route | R. Montevidéu, Chapecó - SC, Brazil | route | OK | 1 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO | |
2 | HOMICÍDIO | 1/1/2008 | 0:00 | 2 | RUA PEGASUS 187, JOINVILLE, Santa Catarina, BRASIL | JOINVILLE | JARDIM PARAÍSO | RUA | PEGASUS | 187 | OUTRO | ZONA URBANA | 2008-01-01 00:00:00 | -26.2144603 | -48.8141942 | route | R. Pegasus - Jardim Paraiso, Joinville - SC, 89240-000, Brazil | route | OK | 2 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO |
3 | HOMICÍDIO | 1/1/2008 | 6:00 | 3 | AVENIDA AMÂNDIO CABRAL , BALNEÁRIO BARRA DO SUL, Santa Catarina, BRASIL | BALNEÁRIO BARRA DO SUL | CENTRO | AVENIDA | AMÂNDIO CABRAL | VIA PÚBLICA | ZONA URBANA | 2008-01-01 06:00:00 | -26.4594161 | -48.6030489 | route | Av. Amândio Cabral, Balneário Barra do Sul - SC, 89247-000, Brazil | route | OK | 3 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO |
Many variables came in factor
format. We transform all variables that are factor
into character
.
#transform all variables that are factor into character
crime2 = crime
len1 = length(names(crime2))
ii = 1
for (ii in 1:len1) {
vv = crime[,ii]
if (is.factor(vv)) {
vv1 <- as.character(vv)
crime2[,ii] = vv1
}
}
#X and ID show the same thing
identical(crime2$X,crime2$ID)
## [1] TRUE
#years and days, and hours of crime
crime2$Data %>% class
## [1] "character"
#we can aggregate the data into months or trimesters
Take a look at the types of homicide. We’ll consider all of them as just homicide.
kable(crime2$Crime %>% table)
. | Freq |
---|---|
HOMICÍDIO | 9162 |
INFANTICÍDIO | 8 |
LATROCÍNIO | 665 |
LESÃO CORPORAL SEGUIDA DE MORTE | 170 |
PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO | 63 |
PESSOA MORTA POR POLICIAL CIVIL FORA DE SERVIÇO | 8 |
PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO | 619 |
PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO | 50 |
There are only some lines that are missing, but for variables that we don’t use.
#map of missing in crime2 from Amelia
missmap(crime2)
#transform lat and long into numeric
crime2$lat=as.numeric(crime2$lat)
crime2$long=as.numeric(crime2$long)
#there are a few missings, 25 and 19
is.na(crime2$lat) %>% sum
## [1] 25
is.na(crime2$long) %>% sum
## [1] 19
#drop places with missing lat and long
mis_lat = is.na(crime2$lat)
mis_lon = is.na(crime2$lat)
crime3 = crime2[!(mis_lat|mis_lon),]
crime3 %>% dim()
## [1] 10720 25
#we drop 25 observations
There are some outliers for latitude, for example, -27027.
summary(crime3$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27027.00 -27.59 -27.05 -34.48 -26.74 45.14
#there are some outliers in lat and long
#proportion of points that does not make sense
summary(crime3$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -119.70 -49.58 -48.83 -49.40 -48.65 35.37
len2 = dim(crime3)[1]
len2
## [1] 10720
#10720 number of lines in data
sum(crime3$long < -54)
## [1] 8
#8 number of outliers
sum(crime3$long < -54)/len2
## [1] 0.0007462687
#0.0007462687
#-54 long is more than the most western frontiers of Santa Catarina
#-53.8376609 most west of SC
#-47.2330741 most east of SC
sum(crime3$long > -47)
## [1] 63
#latitude explore outliers
summary(crime3$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27027.00 -27.59 -27.05 -34.48 -26.74 45.14
#look at google maps to see which values makes sense
# -25.3345494 most north of SC
# -29.5 most south of SC
We use function get_map()
from ggmap
to get a “screen shot” from Google maps.
# maps ------------------------------------
#google maps
map = get_map(location=c(-53.9, -29.5 ,-47,-25.33),maptype = 'roadmap')
plot(map)
#transform crime3 into %>% points data.frame
coordinates(crime3) = ~long+lat
plot(crime3$long,crime3$lat )
#plot of coordinates doesn't make sense because of outliers
plot(crime3)
Those strange results are important to be shown. Those are some examples of problems that we face using real world data.
crime3 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
The projection of a spatial object is important. Below we see that crime3
doesn’t have a projection.
crime3@proj4string #there is no projection
## CRS arguments: NA
Here we give a projection for crime3
crime3@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(crime3) #there are some outliers in the west, it doesn't make sense
points_occur = data.frame(lon=coordinates(crime3)[,1],
lat=coordinates(crime3)[,2])
(crime_sc1=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
#geom_polygon(aes(x=long, y=lat, group=group),
# fill='grey', size=.2, color='blue', data=, alpha=0)+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.5, color='red', alpha=.5)
)
#save image to disk
ggsave("crime_sc1.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#import data of municipalities:
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields
#pick up only municipalities that are in the state of Santa Catarina
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='42',]
munic@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(munic)
#plot map with municipaility borders and points of crime and google maps
munic_fortify = fortify(munic, region="CODIGOIB1")
munic_fortify %>% class
## [1] "data.frame"
#we first transform spatial polygons into data frame
(crime_sc2=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=munic_fortify, color='black', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)
#save image
ggsave("crime_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#we need to exclude points outside santa catarina's borders
#get outer border of Santa Catarina
munic_dissolve = gUnaryUnion(munic)
plot(munic_dissolve)
Note that munic_dissolve
is a spatial polygon, then we use it to select values of crime3
which is spatial points. So this is a way to crop values in a spatial points.
#exclude outliers, we solve the problem
crime4 = crime3[munic_dissolve,]
plot(crime4)
Now that we excluded all outliers we have a recognizable map.
Our raw data is a spatial point process. But to model, we analyze a discrete number of occurrences, which is how many homicides happened in a given region or area.
We use a hexagon area to count the number of homicides that happened inside of it. There are many options in this step. We could use the municipalities area or a squared grid. Just for fun, I used a hexagon gird. But in hindsight, a better option would be the municipal borders. Because it makes much more sense to predict the number of homicides in each municipality and not a general fixed area of a hexagon. And another problem with a fixed geometric grid is that some points at the borders of the state are lost because the polygons can’t reach all corners of the map of Santa Catarina.
#create hexagon polygons --------------------------------------------------
#code from:
#http://strimas.com/spatial/hexagonal-grids/#f1
size = 0.3
set.seed(123)
hex_points = spsample(munic_dissolve, type = "hexagonal", cellsize = size)
hex_points %>% class #spatial points
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
hex_points %>% plot #just make points in a hexagon fashion
hex_points %>% length #there are 109 points
## [1] 109
hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% class #from spatial points to spatial polygons
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"
plot(hex_grid)
#number of hexagons in map
len_hex = length(hex_grid) ; len_hex
## [1] 109
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
#to save image
jpeg("hexagon_sc.jpg")
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
dev.off()
## png
## 2
hex_fortify = fortify(hex_grid)
(hex_sc2=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=hex_fortify, color='black', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)
Note that there are some points that are not inside any hexagon. Those points will not be accounted in the final data frame. This is a drawback in using hexagon polygons.
ggsave("hex_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#another view of the data
plot(hex_grid)
plot(crime4, col="red" , add=TRUE)
#prepare spatial polygons data frame
crime4$count = 1
names(crime4)[24] #"count"
## [1] "count"
sp1 = crime4[,24] #take just the "count" variable
sp1 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
res = over(hex_grid, sp1 , fn=sum) #count nº of homicides in each hexagon
res %>% dim
## [1] 109 1
is.na(res$count) %>% sum
## [1] 5
res$count[is.na(res$count)] = 0 #hexagons with no crime have 0 crimes
hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:length(hex_grid)), match.ID = FALSE)
hex_sp$crime = res$count
#plot of crime for all periods
spplot(hex_sp,c("crime"))
This map shows for each hexagon how many homicides happened. The numbers shown are for all periods.
#separate for each month ------------------------------------
#the previous maps was built with all periods
#now we need to separate the base for different months
crime5 = crime4 #rename to manipulate
crime5$Data = as.Date(crime5$Data,format="%d/%m/%Y")
summary(crime5$Data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2008-01-01" "2011-02-09" "2014-03-16" "2014-01-01" "2016-12-06" "2019-08-22"
#data max 22/08/2019 almost at the end of the period
#but I expect that this month the prediction will be bad
crime5$Data %>% class
## [1] "Date"
crime5$Data[sample(1:10000,5)]
## [1] "2017-08-21" "2011-07-06" "2010-03-14" "2018-03-03" "2011-12-27"
#create only month data
crime5$Date2 = paste0(substr(as.character(crime5$Data),1,7),"-01")
crime5$Date2[sample(1:10000,5)] %>% kable
x |
---|
2013-08-01 |
2015-10-01 |
2018-09-01 |
2011-04-01 |
2014-01-01 |
crime5$Date2 %>% class
## [1] "character"
crime5$Date2 = as.Date(crime5$Date2,"%Y-%m-%d")
crime5$Date2 %>% class
## [1] "Date"
kable(crime5$Date2[sample(1:10000,5)])
x |
---|
2018-01-01 |
2018-01-01 |
2011-05-01 |
2015-02-01 |
2011-01-01 |
#there are specially few homicides in august 2019, because it is not the full month
crime_c1 = as.numeric(table(crime5$Date2))
timeid = unique(crime5$Date2) #take only unique dates
timeid[1:5]
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
length(timeid) #there are 140 months
## [1] 140
crime_ts = xts(crime_c1, order.by=timeid)
jpeg("time_crime.jpg", width = 800, height = 300)
plot(crime_ts)
dev.off()
## png
## 2
#idea: build a spatial VAR model with time series of crimes
This is the time series for number of homicides in Santa Catarina since January 2008 till August 2019. There seems to be a seasonal behaviour in the series. In our dataset, August ends in the day 22, so the number of homicides is subreported.
#need to build a data frame with each line a hexagon-region for each month
#create neighboors link
nb1 = poly2nb(hex_sp, queen=T)
#connections between hexagon-regions
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
This map shows the hexagon polygons and the links. Those links are used to determine the neighborhood and spatially lagged values.
jpeg("links.jpg", width = 1000, height = 1000)
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
dev.off()
## png
## 2
W = nb2listw(nb1, glist=NULL, style ="W") #normlize weights to 1
#create list with spatial data frames, with each month
df_crime = data.frame() #general data frame for models
list_poly = list() #list to input spatial polygon data frame
len_hex = length(hex_grid) #variable that contain length of spatial dataframe
ii = 1 #just a test
for (ii in 1:length(timeid)) {
var1 = crime5$Date2 == timeid[ii] #obs of interest
crime6 = crime5[var1,]
res = over(hex_grid, crime6[,24] , fn=sum)
res$count[is.na(res$count)] = 0
hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:len_hex), match.ID = FALSE)
hex_sp$crime = res$count
hex_sp$period = ii
hex_sp$date = timeid[ii]
hex_sp$lag_crime = lag.listw(W,hex_sp$crime)
hex_sp$long = coordinates(hex_sp)[,1]
hex_sp$lat = coordinates(hex_sp)[,2]
list_poly[[ii]] = hex_sp #input the spatial polygons in the list
names(list_poly)[ii] = ii #input name in the list for the spatial polygon
df_crime = rbind(df_crime,hex_sp@data)
#print(ii) #just to see progress
}
total1=length(timeid)*len_hex; total1
## [1] 15260
df_crime %>% dim
## [1] 15260 7
#they should match
#df_crime %>% View()
df_crime[sample(1:total1,5),] %>% kable
ID | crime | period | date | lag_crime | long | lat | |
---|---|---|---|---|---|---|---|
9642 | 50 | 0 | 89 | 2015-05-01 | 0.5000000 | -53.63526 | -27.05875 |
13961 | 9 | 0 | 129 | 2018-09-01 | 0.0000000 | -49.88526 | -28.35779 |
9982 | 63 | 0 | 92 | 2015-08-01 | 0.1666667 | -49.73526 | -27.05875 |
12499 | 73 | 0 | 115 | 2017-07-01 | 0.5000000 | -51.68526 | -26.79894 |
2980 | 37 | 8 | 28 | 2010-04-01 | 1.3333333 | -48.53526 | -27.57836 |
missmap(df_crime)
#moran's I
#there is spatial correlation in homicide
var2 = list_poly[[1]]$crime
moran.test(var2,W)
##
## Moran I test under randomisation
##
## data: var2
## weights: W
##
## Moran I statistic standard deviate = 2.4378, p-value = 0.007388
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.118433931 -0.009259259 0.002743691
This is a Moran’s I test for homicides for one month. Note the p-value, there is spatial correlation in homicide.
The following function creates a ggplot for nº of homicides and the spatial lag of homicides.
#plot graph with vec1 and spatial lag of vec1
scatLag = function(vec1,W){
vec1lag = lag.listw(W,vec1)
h2.df = as.data.frame(cbind(vec1,vec1lag))
plot = #scatterplot x=vec, y=lagged vec1
ggplot(data=h2.df, aes(y=vec1lag, x=vec1)) +
geom_point(size=.5,alpha=0.5)+
coord_equal()+
ggtitle("scatterplot x=homicide,y=lagged homicide")+
xlab("homicide") + ylab("lagged homicide")
plot
}
scatLag(var2,W)
The graph is not very informative. But it is a visual way to identify what the Moran’s I test does. We note a positive relationship between homicides and homicides in neighboring regions.
Here we plot how much crime there is and plot them in a GIF.
# maps ang gifs -----------------------------------------
#idea: build maps of how much crime is there each regions
#build maps for each month then plot them in a gif
hex_fortify = fortify(hex_grid) #transforms spatial polygons into data frame,
# fortigy is important for ggplot
path_images = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images/"
The following loop creates all images for GIF.
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
spdf = list_poly[[ii]][,"crime"]
spdf$piece = 1:len_hex
crime_points = data.frame(coordinates(spdf), as.data.frame(spdf$crime))
names(crime_points) = c('lon', 'lat', 'crime')
crime_hex = ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=hex_fortify, color='black', alpha=0, size=.1,
aes(y=lat, x=long, group=group, fill=''))+
geom_point(data=crime_points, aes(x=lon, y=lat, color=crime), size=6, alpha=.9)+
scale_colour_distiller(palette = "Spectral", limits=c(0,20))+
ggtitle(paste('crime', timeid[ii]))
ggsave(paste0(path_images,"crime_",1000+ii,".jpg"), width = 20, height = 20, units = "cm",dpi = 70)
#print(ii)
}
This code chunck creates the GIF.
getwd() #where gif will be saved
## [1] "/home/william/Dropbox/working/projects_git/Crime-SC/tests william"
list.files(path = path_images, pattern = "*.jpg", full.names = T) %>%
map(image_read) %>% # reads each path file
image_join() %>% # joins image
image_animate(fps=1) %>% # animates, can opt for number of loops
image_write("crime_gif.gif") # write to current dir
This GIF show how number of homicides change in each hexagon, it shows how much homicide changed in each hexagon. The periods are changing in 6 month intervals.
#GIF with raster ----------------------------------------------------
#build another type of GIF now with IDW function and raster format
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 90000)
gridded(grid) = TRUE
plot(grid)
We can’t see in this map. But there are 900000 squares in it. This is used in making a IDW (inverse distance weight) interpolation to create continuous maps.
grid %>% class
## [1] "SpatialPixels"
## attr(,"package")
## [1] "sp"
#this procedure is important to make the raster map
#idea: make all the analysis here using a square grid instead hexagonal
path_images3 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images3"
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
spdf = list_poly[[ii]][,"crime"]
#build raster with idw
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title=timeid[ii],
title.size=.8,
title.position = c(0.25,.3))
tmap_save(tm = tm1,
filename = paste0(path_images3,"/crime_",1000+ii,".jpg"),
width = 3, height = 3)
#print(ii) #just to follow progress
}
#to build GIF for munic raster
getwd()
list.files(path = path_images3, pattern = "*.jpg", full.names = T) %>%
map(image_read) %>% # reads each path file
image_join() %>% # joins image
image_animate(fps=1) %>% # animates, can opt for number of loops
image_write("crime_gif_raster.gif") # write to current dir
This GIF is another way to see how homicides change in time. The intervals are of 6 months. Using raster and a continuous representation is better than using a discrete point representation, as in the previous GIF.
#prepare data frame for modelling --------------------------------
#build lag variables and prepare data frame for modelling
library(dplyr)
df_temp1 = df_crime[,c("period", "crime", "lag_crime")] #select variables of interest
df_temp1[sample(1:total1,10),] %>% kable()
period | crime | lag_crime | |
---|---|---|---|
555 | 6 | 0 | 0.2000000 |
14183 | 131 | 0 | 0.2500000 |
4469 | 41 | 0 | 2.3333333 |
15180 | 140 | 0 | 0.0000000 |
9359 | 86 | 0 | 0.0000000 |
10784 | 99 | 0 | 0.2500000 |
10730 | 99 | 0 | 0.8333333 |
7789 | 72 | 0 | 0.5000000 |
9991 | 92 | 0 | 0.0000000 |
9097 | 84 | 0 | 0.5000000 |
#the following loop will make temporal lags as new features
jj = 1
df_crime2 = df_crime
for (jj in 1:12) {
#create columns with spatial and temporal lags of rows
#just a filler with NAs for rbind later
toPut = data.frame(matrix(NA,nrow=len_hex*jj,ncol=dim(df_temp1)[2]))
names(toPut) = names(df_temp1)
df_temp2 = df_temp1[1:(total1-len_hex*jj),]
df_temp3 = rbind(toPut,df_temp2)
df_temp4 = df_temp3
names(df_temp4) = c(paste0("period_",jj), paste0("crime_",jj),
paste0("lag_crime_",jj))
df_crime2 = cbind(df_crime2, df_temp4)
#df_crime2 %>% dim
#print(jj)
}
df_crime2[sample(1:total1,3),] %>% kable
ID | crime | period | date | lag_crime | long | lat | period_1 | crime_1 | lag_crime_1 | period_2 | crime_2 | lag_crime_2 | period_3 | crime_3 | lag_crime_3 | period_4 | crime_4 | lag_crime_4 | period_5 | crime_5 | lag_crime_5 | period_6 | crime_6 | lag_crime_6 | period_7 | crime_7 | lag_crime_7 | period_8 | crime_8 | lag_crime_8 | period_9 | crime_9 | lag_crime_9 | period_10 | crime_10 | lag_crime_10 | period_11 | crime_11 | lag_crime_11 | period_12 | crime_12 | lag_crime_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14154 | 93 | 0 | 130 | 2018-10-01 | 0.3333333 | -50.33526 | -26.53913 | 129 | 0 | 0.1666667 | 128 | 0 | 0.0000000 | 127 | 0 | 0.0 | 126 | 0 | 0.1666667 | 125 | 0 | 0.0 | 124 | 0 | 0.0 | 123 | 0 | 0.5000000 | 122 | 0 | 0.1666667 | 121 | 0 | 0.0000000 | 120 | 0 | 0.0000000 | 119 | 0 | 0.5 | 118 | 1 | 0.1666667 |
1047 | 66 | 1 | 10 | 2008-10-01 | 1.4000000 | -48.83526 | -27.05875 | 9 | 0 | 2.2000000 | 8 | 1 | 3.8000000 | 7 | 1 | 1.4 | 6 | 3 | 1.8000000 | 5 | 1 | 1.6 | 4 | 1 | 1.4 | 3 | 2 | 4.6000000 | 2 | 1 | 3.6000000 | 1 | 0 | 2.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
7067 | 91 | 0 | 65 | 2013-05-01 | 0.3333333 | -50.93526 | -26.53913 | 64 | 0 | 0.3333333 | 63 | 0 | 0.1666667 | 62 | 0 | 0.0 | 61 | 0 | 0.5000000 | 60 | 0 | 0.5 | 59 | 0 | 0.0 | 58 | 0 | 0.1666667 | 57 | 0 | 0.0000000 | 56 | 0 | 0.6666667 | 55 | 0 | 0.3333333 | 54 | 0 | 0.0 | 53 | 0 | 0.8333333 |
Note that there are some missing values in the beginning of the data frame. This happens because we created columns that are lagged values of the table.
missmap(df_crime2)
#drop missing lines
mis1 = apply(df_crime2,1,function(x){anyNA(x)})
#they should match
12*len_hex; sum(mis1)
## [1] 1308
## [1] 1308
We exclude all lines that have missing. Note that the missmap doesn’t have any missing.
df_crime3 = df_crime2[!mis1,]
df_crime3 %>% dim
## [1] 13952 43
missmap(df_crime3) #there are no more missing
anyNA(df_crime3)
## [1] FALSE
df_crime3 %>% names
## [1] "ID" "crime" "period" "date" "lag_crime"
## [6] "long" "lat" "period_1" "crime_1" "lag_crime_1"
## [11] "period_2" "crime_2" "lag_crime_2" "period_3" "crime_3"
## [16] "lag_crime_3" "period_4" "crime_4" "lag_crime_4" "period_5"
## [21] "crime_5" "lag_crime_5" "period_6" "crime_6" "lag_crime_6"
## [26] "period_7" "crime_7" "lag_crime_7" "period_8" "crime_8"
## [31] "lag_crime_8" "period_9" "crime_9" "lag_crime_9" "period_10"
## [36] "crime_10" "lag_crime_10" "period_11" "crime_11" "lag_crime_11"
## [41] "period_12" "crime_12" "lag_crime_12"
#tree models --------------------------------------------
#ideas
#one of the advantages of using a random forest
#algorithm is that we do not assume any form of
#distribution compared for example to the log gaussian
#cox process which assumes a Poisson distribution for
#the point process.
#for our dataset we have 14000 observations
#we have around 100 regions and time is in months
#for future work, we could change
#more, or less, regions. a larger period.
#trimester, or semester, for example.
#instead of using hexagons, we could just use the municipality
#borders as polygons. It makes much more sense.
#there are 293 municipalities in Santa Catarina.
#there will be many more rows in the table.
df_crime3[sample(1:10000,3),] %>% kable
ID | crime | period | date | lag_crime | long | lat | period_1 | crime_1 | lag_crime_1 | period_2 | crime_2 | lag_crime_2 | period_3 | crime_3 | lag_crime_3 | period_4 | crime_4 | lag_crime_4 | period_5 | crime_5 | lag_crime_5 | period_6 | crime_6 | lag_crime_6 | period_7 | crime_7 | lag_crime_7 | period_8 | crime_8 | lag_crime_8 | period_9 | crime_9 | lag_crime_9 | period_10 | crime_10 | lag_crime_10 | period_11 | crime_11 | lag_crime_11 | period_12 | crime_12 | lag_crime_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4312 | 61 | 0 | 40 | 2011-04-01 | 0.3333333 | -50.33526 | -27.05875 | 39 | 0 | 0.1666667 | 38 | 0 | 0.1666667 | 37 | 0 | 0.0000000 | 36 | 0 | 0.0000000 | 35 | 0 | 0.3333333 | 34 | 2 | 0.1666667 | 33 | 0 | 0.0000000 | 32 | 0 | 0.0000000 | 31 | 0 | 0.1666667 | 30 | 1 | 0.0000000 | 29 | 0 | 0.1666667 | 28 | 0 | 0.1666667 |
4515 | 46 | 0 | 42 | 2011-06-01 | 0.0000000 | -49.58526 | -27.31856 | 41 | 0 | 0.6666667 | 40 | 2 | 0.1666667 | 39 | 1 | 0.0000000 | 38 | 0 | 0.0000000 | 37 | 0 | 0.0000000 | 36 | 0 | 0.0000000 | 35 | 0 | 0.1666667 | 34 | 1 | 0.0000000 | 33 | 0 | 0.3333333 | 32 | 0 | 0.5000000 | 31 | 0 | 0.0000000 | 30 | 0 | 0.1666667 |
9297 | 32 | 1 | 86 | 2015-02-01 | 0.1666667 | -50.03526 | -27.57836 | 85 | 0 | 0.5000000 | 84 | 0 | 0.6666667 | 83 | 0 | 0.1666667 | 82 | 0 | 0.6666667 | 81 | 0 | 0.0000000 | 80 | 0 | 0.1666667 | 79 | 0 | 0.0000000 | 78 | 0 | 0.3333333 | 77 | 1 | 0.5000000 | 76 | 0 | 0.1666667 | 75 | 0 | 1.0000000 | 74 | 0 | 0.3333333 |
#for a first analysis we'll use only a subset of variables
data = df_crime3[, c("crime",
"crime_1","lag_crime_1",
"crime_2","lag_crime_2",
"crime_3",
"crime_4",
"crime_5",
"crime_6",
"crime_7",
"crime_8",
"crime_9",
"crime_10",
"crime_11",
"crime_12",
"long", "lat")]
The variable crime
is what we predict. crime_1
is the temporal lag for crime. lag_crime_1
is the spatial lag for crime_1
. So lag_crime_1
is the mean of the neighbors for a region but one period prior. crime_2
and lag_crime_2
are two temporal lags for crime and spatial lag of crime.
hist(data$crime, breaks=50)
This number is important for the division of training and test datasets.
12*109 #one year of forecasting
## [1] 1308
len_hex #number of hexagons in SC
## [1] 109
len_hex*12
## [1] 1308
set.seed(1)
nrow = nrow(data)
train = 1:(nrow-len_hex*12)
data %>% names %>% kable
x |
---|
crime |
crime_1 |
lag_crime_1 |
crime_2 |
lag_crime_2 |
crime_3 |
crime_4 |
crime_5 |
crime_6 |
crime_7 |
crime_8 |
crime_9 |
crime_10 |
crime_11 |
crime_12 |
long |
lat |
data %>% names %>% length #27 variables in data frame
## [1] 17
data %>% names %>% length %>% sqrt() #input in mtry
## [1] 4.123106
Create a dataset for training and test. We use a spatiotemporal model, so it is not a cross-section. Hence the division between training and test datasets has a temporal aspect.
trainX = data[train,]
trainY = data$crime[train]
testX = data[-train,]
testY = data$crime[-train]
We run the random forest model. Note that we use 100 trees in boostraping.
m1 = randomForest(crime~.,trainX,
mtry=5,importance=TRUE, ntree=300)
m1
##
## Call:
## randomForest(formula = crime ~ ., data = trainX, mtry = 5, importance = TRUE, ntree = 300)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.8378361
## % Var explained: 72.91
Export R objects to be used in another article.
saveRDS(trainX, file = paste0(pathExport,"/trainX.rds"))
saveRDS(trainY, file = paste0(pathExport,"/trainY.rds"))
saveRDS(testX, file = paste0(pathExport,"/testX.rds"))
saveRDS(testY, file = paste0(pathExport,"/testY.rds"))
saveRDS(m1, file = paste0(pathExport,"/modelRF.rds"))
This is the MSE for training set.
yh1 = predict(m1,trainX)
mean((trainY - yh1)^2)
## [1] 0.1965913
This is the MSE for test set.
yh = predict(m1,testX)
mean((testY - yh)^2)
## [1] 0.7008546
plot(yh, testY)
abline(0,1)
#importance(m1)
varImpPlot(m1)
This is an importance plot. This plot shows which variables are more important. lag_crime_1
,long
,crime_11
,crime_9
are the most relevant. That is the spatially lagged values on with one temporal lag is the most important variable to explain crime one period ahead. Longitude is important because the biggest cities of Santa Catarina are on the coast in the west. Number of homicides 11 and 9 periods prior is imporant maybe this is a cyclical effect. For it to make sense it should be 12 lagged values so that it would indicate some yearly cycle.
Here we plot maps showing the prediction and the ture values.
#plot of predict for 180901 -------------------------
#prepare a data frame for ploting of results
#predict Data Frame
yh[1:4] #vector of predictions
## 13953 13954 13955 13956
## 0.4598918 0.1214090 0.5837738 0.4330119
Here we plot residuals for some predictions.
12 #number of intervals to choose
## [1] 12
len_hex #number of hexagons in SC
## [1] 109
We plot the prediction for the first month of the test set.
period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1)) #interval to extract from yh
preDf = yh[inter1]
#they should be the same as
preDf %>% length; len_hex
## [1] 109
## [1] 109
#plot prediction and residuals
#present prediction and true values
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("predict ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
Here we prepare data for plot of actual value of homicides.
#plot of actual for 180901
period1
## [1] 1
#build map
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
tm2 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("true ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3
tmap_save(tm = tm3,
filename = paste0("crime_","comp1",".jpg"),
width = 3, height = 3)
#plot of predict for 190601 -------------------------
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
tm11 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("predict ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
#build map
period1
## [1] 10
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm21 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("true ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm11,tm21)
tmap_mode(current.mode)
tmap_save(tm = tm3,
filename = paste0("crime_","comp2",".jpg"),
width = 3, height = 3)
tm3
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
saveRDS(pol, file = paste0(pathExport,"/pol.rds")) #export to use in another article
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
period1
## [1] 1
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style='quantile',
n=7 ,palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .3,
legend.position = c("left","bottom"),
title= paste0("residuals ",timeid[period1+(140-12)]),
title.size=.6,
title.position = c(0.25,.3))
moran.test(v,W)
##
## Moran I test under randomisation
##
## data: v
## weights: W
##
## Moran I statistic standard deviate = -1.4624, p-value = 0.9282
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.095835052 -0.009259259 0.003504995
#for period 1 of prediction, there is no spatial correlation in the residuals
#now, for period 10
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) +
tm_raster("crime", style='quantile',
n=7 ,palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .3,
legend.position = c("left","bottom"),
title= paste0("residuals ",timeid[period1+(140-12)]),
title.size=.6,
title.position = c(0.25,.3))
moran.test(v,W)
##
## Moran I test under randomisation
##
## data: v
## weights: W
##
## Moran I statistic standard deviate = -1.4845, p-value = 0.9312
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.092495433 -0.009259259 0.003143964
#for period 10 of prediction, there is no spatial correlation in the residuals
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tmap_save(tm = tm3,
filename = paste0("crime_","res",".jpg"),
width = 3, height = 3)
tm3
# end -------------------------------------
Here we try to predict number of homicides in each polygon. For another article we can predict the number of homicide in each municipality. We can incorporate data about population, GDP per capita and inequality. We could find other crime variables. We can study larger periods, for example, trimester or semester.
Given that the data have coordinates we can study inside city occurances. Model phenomena in the street level.
It is relevant to find a way to interpret how variables (features) affect our variable of interest, \(y\). In a future article I’ll build a Shap like method to intrepret outcomes of black-box like models.