by William Suzuki and Ian Fukushima

The objective of this article is to explore and model the dataset of homicides in the state of Santa Catarina. After request, the state’s security bureau of Santa Catarina gave us the data. Ian Fukushima, my colleague, inspired me to make the maps and analysis of this data, I’m grateful for his support.

Load, Explore and Transform

Let’s start by loading packages:

#preliminaries -----------------------------------------
library(knitr) #for kable, stylized table in r markdown
library(tidyverse) #for pipe
library(dplyr) #select() function
library(sp)
library(spdep)
library(rgdal) 
library(Amelia) #missmap()
library(ggmap) 
library(rgeos) #gUnaryUnion
library(xts) #extended time series
library(magick) # to make gifs
library(randomForest)
library(gstat) #idw function
library(tmap) #to use tmap function
library(randomForest)
library(grid) #to plot two tmaps together
library(raster) #raster()

Include some paths to save images. You must change those paths for your machine

path1 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/"
path_root = "/home/william/Dropbox/working/projects_git/Crime-SC/"
path_data = "/home/william/Dropbox/working/projects_git/Crime-SC/Dados"
pathExport = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/export"

load.csv() will load the data. Ian used Google Maps’ API to find the geographic coordinates. The raw data that came only with the address of each homicide incident.

#prepare base----------------------------------------
#load base
crime = read.csv(file=paste0(path_data,"/df_sc_crime_geocoded.csv"), header=TRUE, sep=",", fileEncoding="latin1")
crime %>% class
## [1] "data.frame"

The data have 10745 incidents of homicide.

crime %>% dim()
## [1] 10745    25

The following are the variables. We are going to use only "Data", "lat","long".

crime %>% names()
##  [1] "X"                                 "Crime"                            
##  [3] "Data"                              "Hora"                             
##  [5] "ID"                                "Endereço"                         
##  [7] "Município"                         "Bairro"                           
##  [9] "Tipo.Logradouro"                   "Logradouro"                       
## [11] "Número"                            "Tipo.de.local"                    
## [13] "Localidade"                        "PosiTime"                         
## [15] "lat"                               "long"                             
## [17] "accuracy"                          "formatted_address"                
## [19] "address_type"                      "status"                           
## [21] "index"                             "MOTIVAÇÃO"                        
## [23] "VIOLÊNCIA.DOMÉSTICA"               "MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER"
## [25] "ESTUPRO..ANTECEDENTE"

Here is the summary of our data. We are going to use "Data", "lat","long". But variables as "Municipio","Número","Endereço","Bairro" are important when using Google’s API for geocoding.

summary(crime)
##        X                                                       Crime     
##  Min.   :    1   HOMICÍDIO                                        :9162  
##  1st Qu.: 2687   LATROCÍNIO                                       : 665  
##  Median : 5373   PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO     : 619  
##  Mean   : 5373   LESÃO CORPORAL SEGUIDA DE MORTE                  : 170  
##  3rd Qu.: 8059   PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO       :  63  
##  Max.   :10745   PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO:  50  
##                  (Other)                                          :  16  
##         Data            Hora            ID       
##  6/12/2015:   12   21:00  : 638   Min.   :    1  
##  13/3/2010:   11   23:00  : 604   1st Qu.: 2687  
##  13/5/2018:   11   22:00  : 579   Median : 5373  
##  31/3/2018:   11   20:00  : 530   Mean   : 5373  
##  1/1/2013 :   10   0:00   : 512   3rd Qu.: 8059  
##  11/7/2010:   10   19:00  : 438   Max.   :10745  
##  (Other)  :10680   (Other):7444                  
##                                                     Endereço    
##    , JOINVILLE, Santa Catarina, BRASIL                  :   50  
##    , FLORIANÓPOLIS, Santa Catarina, BRASIL              :   22  
##  RUA SEIS DE JANEIRO , JOINVILLE, Santa Catarina, BRASIL:   21  
##    , SÃO JOSÉ, Santa Catarina, BRASIL                   :   15  
##    , ITAJAÍ, Santa Catarina, BRASIL                     :   13  
##    , CRICIÚMA, Santa Catarina, BRASIL                   :   12  
##  (Other)                                                :10612  
##          Município               Bairro     Tipo.Logradouro
##  JOINVILLE    :1133   CENTRO        : 990   RUA     :7234  
##  FLORIANÓPOLIS:1063   INTERIOR      : 321   ESTRADA :1105  
##  ITAJAÍ       : 561                 : 216   RODOVIA : 701  
##  CHAPECÓ      : 487   RURAL         : 193   AVENIDA : 642  
##  SÃO JOSÉ     : 467   MONTE ALEGRE  : 161           : 311  
##  CRICIÚMA     : 380   PARANAGUAMIRIM: 142   SERVIDÃO: 282  
##  (Other)      :6654   (Other)       :8722   (Other) : 470  
##           Logradouro        Número                       Tipo.de.local 
##                :  322          :7686   VIA PÚBLICA              :5728  
##  GERAL         :  185   20     :  21   RESIDÊNCIA               :2956  
##  BR 101        :   71   200    :  20   OUTRO                    : 848  
##  BR 470        :   35   30     :  15   ESTABELECIMENTO COMERCIAL: 489  
##  SANTA CATARINA:   30   55     :  14   BARES E SIMILARES        : 452  
##  GETÚLIO VARGAS:   28   12     :  13   ESTABELECIMENTO PENAL    : 122  
##  (Other)       :10074   (Other):2976   (Other)                  : 150  
##                                  Localidade                  PosiTime    
##  ZONA URBANA                          :9851   2014-08-31 03:00:00:    6  
##  ZONA RURAL                           : 882   2014-09-06 22:00:00:    6  
##  NÃO INFORMADA                        :   4   ZONA URBANA        :    6  
##  \\"CANTINA PER TUTI\\",","VIA PÚBLICA:   1   2014-03-20 19:00:00:    5  
##  \\"E\\",","VIA PÚBLICA               :   1   2015-02-26 04:00:00:    5  
##  406, \\"G\\",","VIA PÚBLICA          :   1   2016-06-12 06:00:00:    5  
##  (Other)                              :   5   (Other)            :10712  
##           lat                 long                 accuracy   
##  -28.6010292:   91   -49.151029 :   91   route         :5971  
##  -26.2834213:   73   -48.8452269:   73   street_address:2136  
##  -26.6140628:   41   -53.7265491:   41   locality      : 790  
##  -27.6140791:   27   -48.6370861:   27   establishment : 781  
##  -26.0713285:   25   -48.6177384:   25   premise       : 371  
##  (Other)    :10471   (Other)    :10471   (Other)       : 679  
##  NA's       :   17   NA's       :   17   NA's          :  17  
##                                                                         formatted_address
##  Estrada Geral, Santa Catarina, Brazil                                           :   91  
##  Joinville - State of Santa Catarina, Brazil                                     :   74  
##  BR-101, Santa Catarina, Brazil                                                  :   41  
##  Estrada de Linha, Paraíso - SC, 89906-000, Brazil                               :   41  
##  São José, Santa Catarina - Barreiros, São José - State of Santa Catarina, Brazil:   27  
##  (Other)                                                                         :10454  
##  NA's                                                                            :   17  
##                                 address_type 
##  route                                :5971  
##  street_address                       :2136  
##  locality,political                   : 790  
##  premise                              : 371  
##  administrative_area_level_2,political: 321  
##  (Other)                              :1139  
##  NA's                                 :  17  
##                                                                    status     
##  OK                                                                   :10720  
##  Error - NA - Verifify                                                :   17  
##  route                                                                :    3  
##  Chapecó, State of Santa Catarina, Brazil                             :    1  
##  clothing_store,establishment,home_goods_store,point_of_interest,store:    1  
##  establishment,point_of_interest                                      :    1  
##  (Other)                                                              :    2  
##      index                   MOTIVAÇÃO       VIOLÊNCIA.DOMÉSTICA
##  OK     :    5   NÃO INFORMADA    :3480   NÃO          :10178   
##  1      :    1   DESAVENÇA        :2853   NÃO INFORMADO:    1   
##  10     :    1   TRÁFICO DE DROGAS:1726   SIM          :  566   
##  100    :    1   PASSIONAL        : 846                         
##  1000   :    1   AÇÃO POLICIAL    : 745                         
##  10000  :    1   ROUBO            : 655                         
##  (Other):10735   (Other)          : 440                         
##  MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER    ESTUPRO..ANTECEDENTE
##  NÃO          :4489                NÃO          :10559    
##  NÃO INFORMADO:6232                NÃO INFORMADO:  160    
##  SIM          :  24                SIM          :   26    
##                                                           
##                                                           
##                                                           
## 
kable(crime[1:3,], caption="original data of homicides") #shows stylized table of data frame
original data of homicides
X Crime Data Hora ID Endereço Município Bairro Tipo.Logradouro Logradouro Número Tipo.de.local Localidade PosiTime lat long accuracy formatted_address address_type status index MOTIVAÇÃO VIOLÊNCIA.DOMÉSTICA MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER ESTUPRO..ANTECEDENTE
1 HOMICÍDIO 1/1/2008 0:00 1 RUA MONTEVIDÉU , CHAPECÓ, Santa Catarina, BRASIL CHAPECÓ FORTES RUA MONTEVIDÉU VIA PÚBLICA ZONA URBANA 2008-01-01 00:00:00 -27.0996274 -52.5998438 route R. Montevidéu, Chapecó - SC, Brazil route OK 1 TRÁFICO DE DROGAS NÃO NÃO INFORMADO NÃO
2 HOMICÍDIO 1/1/2008 0:00 2 RUA PEGASUS 187, JOINVILLE, Santa Catarina, BRASIL JOINVILLE JARDIM PARAÍSO RUA PEGASUS 187 OUTRO ZONA URBANA 2008-01-01 00:00:00 -26.2144603 -48.8141942 route R. Pegasus - Jardim Paraiso, Joinville - SC, 89240-000, Brazil route OK 2 TRÁFICO DE DROGAS NÃO NÃO INFORMADO NÃO
3 HOMICÍDIO 1/1/2008 6:00 3 AVENIDA  AMÂNDIO CABRAL , BALNEÁRIO BARRA DO SUL, Santa Catarina, BRASIL BALNEÁRIO BARRA DO SUL CENTRO AVENIDA  AMÂNDIO CABRAL VIA PÚBLICA ZONA URBANA 2008-01-01 06:00:00 -26.4594161 -48.6030489 route Av. Amândio Cabral, Balneário Barra do Sul - SC, 89247-000, Brazil route OK 3 TRÁFICO DE DROGAS NÃO NÃO INFORMADO NÃO

Many variables came in factor format. We transform all variables that are factor into character.

#transform all variables that are factor into character
crime2 = crime
len1 = length(names(crime2)) 
ii = 1
for (ii in 1:len1) {
  vv = crime[,ii]
  if (is.factor(vv)) {
    vv1 <- as.character(vv)
    crime2[,ii] = vv1
  }
}
#X and ID show the same thing
identical(crime2$X,crime2$ID)
## [1] TRUE
#years and days, and hours of crime
crime2$Data %>% class
## [1] "character"
#we can aggregate the data into months or trimesters

Take a look at the types of homicide. We’ll consider all of them as just homicide.

kable(crime2$Crime %>% table)
. Freq
HOMICÍDIO 9162
INFANTICÍDIO 8
LATROCÍNIO 665
LESÃO CORPORAL SEGUIDA DE MORTE 170
PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO 63
PESSOA MORTA POR POLICIAL CIVIL FORA DE SERVIÇO 8
PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO 619
PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO 50

There are only some lines that are missing, but for variables that we don’t use.

#map of missing in crime2 from Amelia
missmap(crime2)

#transform lat and long into numeric
crime2$lat=as.numeric(crime2$lat) 
crime2$long=as.numeric(crime2$long)  
#there are a few missings, 25 and 19
is.na(crime2$lat) %>% sum
## [1] 25
is.na(crime2$long) %>% sum
## [1] 19
#drop places with missing lat and long
mis_lat =  is.na(crime2$lat)
mis_lon = is.na(crime2$lat)
crime3 = crime2[!(mis_lat|mis_lon),]
crime3 %>% dim()
## [1] 10720    25
#we drop 25 observations

There are some outliers for latitude, for example, -27027.

summary(crime3$lat)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -27027.00    -27.59    -27.05    -34.48    -26.74     45.14
#there are some outliers in lat and long
#proportion of points that does not make sense
summary(crime3$long)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -119.70  -49.58  -48.83  -49.40  -48.65   35.37
len2 = dim(crime3)[1]
len2
## [1] 10720
#10720 number of lines in data
sum(crime3$long < -54)
## [1] 8
#8 number of outliers
sum(crime3$long < -54)/len2
## [1] 0.0007462687
#0.0007462687
#-54 long is more than the most western frontiers of Santa Catarina
#-53.8376609 most west of SC
#-47.2330741 most east of SC
sum(crime3$long > -47)
## [1] 63
#latitude explore outliers
summary(crime3$lat)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -27027.00    -27.59    -27.05    -34.48    -26.74     45.14
#look at google maps to see which values makes sense
# -25.3345494 most north of SC
# -29.5 most south of SC

Maps

We use function get_map() from ggmap to get a “screen shot” from Google maps.

# maps ------------------------------------
#google maps
map = get_map(location=c(-53.9, -29.5 ,-47,-25.33),maptype = 'roadmap')
plot(map)

#transform crime3 into %>% points data.frame
coordinates(crime3) = ~long+lat
plot(crime3$long,crime3$lat )

#plot of coordinates doesn't make sense because of outliers
plot(crime3)

Those strange results are important to be shown. Those are some examples of problems that we face using real world data.

crime3 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

The projection of a spatial object is important. Below we see that crime3 doesn’t have a projection.

crime3@proj4string #there is no projection
## CRS arguments: NA

Here we give a projection for crime3

crime3@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(crime3) #there are some outliers in the west, it doesn't make sense

points_occur = data.frame(lon=coordinates(crime3)[,1],
                           lat=coordinates(crime3)[,2])
(crime_sc1=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  #geom_polygon(aes(x=long, y=lat, group=group), 
  #             fill='grey', size=.2, color='blue', data=, alpha=0)+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.5, color='red', alpha=.5)
)

#save image to disk
ggsave("crime_sc1.jpg", width = 20, height = 20, units = "cm",dpi = 600)

Municipal Borders

#import data of municipalities:
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields
#pick up only municipalities that are in the state of Santa Catarina
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='42',] 
munic@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(munic)

#plot map with municipaility borders and points of crime and google maps
munic_fortify = fortify(munic, region="CODIGOIB1")
munic_fortify %>% class
## [1] "data.frame"
#we first transform spatial polygons into data frame
(crime_sc2=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  geom_polygon(data=munic_fortify, color='black', alpha=0,
               aes(y=lat, x=long, group=group, fill='grey'))+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)

#save image
ggsave("crime_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#we need to exclude points outside santa catarina's borders
#get outer border of Santa Catarina
munic_dissolve = gUnaryUnion(munic)
plot(munic_dissolve)

Note that munic_dissolve is a spatial polygon, then we use it to select values of crime3 which is spatial points. So this is a way to crop values in a spatial points.

#exclude outliers, we solve the problem
crime4 =  crime3[munic_dissolve,]
plot(crime4)

Now that we excluded all outliers we have a recognizable map.

Hexagon Polygons

Our raw data is a spatial point process. But to model, we analyze a discrete number of occurrences, which is how many homicides happened in a given region or area.

We use a hexagon area to count the number of homicides that happened inside of it. There are many options in this step. We could use the municipalities area or a squared grid. Just for fun, I used a hexagon gird. But in hindsight, a better option would be the municipal borders. Because it makes much more sense to predict the number of homicides in each municipality and not a general fixed area of a hexagon. And another problem with a fixed geometric grid is that some points at the borders of the state are lost because the polygons can’t reach all corners of the map of Santa Catarina.

#create hexagon polygons --------------------------------------------------
#code from:
#http://strimas.com/spatial/hexagonal-grids/#f1
size = 0.3
set.seed(123)
hex_points = spsample(munic_dissolve, type = "hexagonal", cellsize = size)
hex_points %>% class #spatial points
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
hex_points %>% plot #just make points in a hexagon fashion

hex_points %>% length #there are 109 points 
## [1] 109
hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% class #from spatial points to spatial polygons
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"
plot(hex_grid)

#number of hexagons in map
len_hex = length(hex_grid) ; len_hex 
## [1] 109
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)

#to save image
jpeg("hexagon_sc.jpg")
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
dev.off() 
## png 
##   2
hex_fortify = fortify(hex_grid)
(hex_sc2=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  geom_polygon(data=hex_fortify, color='black', alpha=0,
               aes(y=lat, x=long, group=group, fill='grey'))+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)

Note that there are some points that are not inside any hexagon. Those points will not be accounted in the final data frame. This is a drawback in using hexagon polygons.

ggsave("hex_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#another view of the data
plot(hex_grid)
plot(crime4, col="red" , add=TRUE)

#prepare spatial polygons data frame
crime4$count = 1
names(crime4)[24] #"count"
## [1] "count"
sp1 = crime4[,24] #take just the "count" variable
sp1 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
res = over(hex_grid, sp1 , fn=sum) #count nº of homicides in each hexagon
res %>% dim
## [1] 109   1
is.na(res$count) %>% sum
## [1] 5
res$count[is.na(res$count)] = 0 #hexagons with no crime have 0 crimes
hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:length(hex_grid)), match.ID = FALSE)
hex_sp$crime = res$count
#plot of crime for all periods
spplot(hex_sp,c("crime"))

This map shows for each hexagon how many homicides happened. The numbers shown are for all periods.

Prepare Data Frame

#separate for each month ------------------------------------
#the previous maps was built with all periods
#now we need to separate the base for different months
crime5 = crime4 #rename to manipulate
crime5$Data = as.Date(crime5$Data,format="%d/%m/%Y")
summary(crime5$Data) 
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2008-01-01" "2011-02-09" "2014-03-16" "2014-01-01" "2016-12-06" "2019-08-22"
#data max 22/08/2019 almost at the end of the period
#but I expect that this month the prediction will be bad
crime5$Data %>% class
## [1] "Date"
crime5$Data[sample(1:10000,5)]
## [1] "2017-08-21" "2011-07-06" "2010-03-14" "2018-03-03" "2011-12-27"
#create only month data
crime5$Date2 =  paste0(substr(as.character(crime5$Data),1,7),"-01") 
crime5$Date2[sample(1:10000,5)] %>% kable
x
2013-08-01
2015-10-01
2018-09-01
2011-04-01
2014-01-01
crime5$Date2 %>% class
## [1] "character"
crime5$Date2 = as.Date(crime5$Date2,"%Y-%m-%d")
crime5$Date2 %>% class
## [1] "Date"
kable(crime5$Date2[sample(1:10000,5)])
x
2018-01-01
2018-01-01
2011-05-01
2015-02-01
2011-01-01
#there are specially few homicides in august 2019, because it is not the full month
crime_c1 = as.numeric(table(crime5$Date2)) 
timeid = unique(crime5$Date2) #take only unique dates
timeid[1:5] 
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
length(timeid) #there are 140 months
## [1] 140
crime_ts = xts(crime_c1, order.by=timeid)
jpeg("time_crime.jpg", width = 800, height = 300)
plot(crime_ts)
dev.off() 
## png 
##   2
#idea: build a spatial VAR model with time series of crimes

This is the time series for number of homicides in Santa Catarina since January 2008 till August 2019. There seems to be a seasonal behaviour in the series. In our dataset, August ends in the day 22, so the number of homicides is subreported.

#need to build a data frame with each line a hexagon-region for each month
#create neighboors link
nb1 = poly2nb(hex_sp, queen=T)
#connections between hexagon-regions
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)

This map shows the hexagon polygons and the links. Those links are used to determine the neighborhood and spatially lagged values.

jpeg("links.jpg", width = 1000, height = 1000)
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
dev.off() 
## png 
##   2
W = nb2listw(nb1, glist=NULL, style ="W") #normlize weights to 1
#create list with spatial data frames, with each month
df_crime = data.frame() #general data frame for models
list_poly = list() #list to input spatial polygon data frame
len_hex = length(hex_grid) #variable that contain length of spatial dataframe
ii = 1 #just a test
for (ii in 1:length(timeid)) {
  var1 = crime5$Date2 == timeid[ii] #obs of interest
  crime6 = crime5[var1,]

  res = over(hex_grid, crime6[,24] , fn=sum)
  res$count[is.na(res$count)] = 0
  hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:len_hex), match.ID = FALSE)
  hex_sp$crime = res$count
  hex_sp$period = ii
  hex_sp$date = timeid[ii]
  hex_sp$lag_crime = lag.listw(W,hex_sp$crime)
  hex_sp$long =  coordinates(hex_sp)[,1]
  hex_sp$lat =  coordinates(hex_sp)[,2]
  list_poly[[ii]] = hex_sp #input the spatial polygons in the list
  names(list_poly)[ii] = ii #input name in the list for the spatial polygon 
  df_crime = rbind(df_crime,hex_sp@data)
  #print(ii) #just to see progress
}
total1=length(timeid)*len_hex; total1
## [1] 15260
df_crime %>% dim
## [1] 15260     7
#they should match
#df_crime %>% View()
df_crime[sample(1:total1,5),] %>% kable
ID crime period date lag_crime long lat
9642 50 0 89 2015-05-01 0.5000000 -53.63526 -27.05875
13961 9 0 129 2018-09-01 0.0000000 -49.88526 -28.35779
9982 63 0 92 2015-08-01 0.1666667 -49.73526 -27.05875
12499 73 0 115 2017-07-01 0.5000000 -51.68526 -26.79894
2980 37 8 28 2010-04-01 1.3333333 -48.53526 -27.57836
missmap(df_crime)

#moran's I
#there is spatial correlation in homicide
var2 = list_poly[[1]]$crime
moran.test(var2,W)
## 
##  Moran I test under randomisation
## 
## data:  var2  
## weights: W    
## 
## Moran I statistic standard deviate = 2.4378, p-value = 0.007388
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.118433931      -0.009259259       0.002743691

This is a Moran’s I test for homicides for one month. Note the p-value, there is spatial correlation in homicide.

The following function creates a ggplot for nº of homicides and the spatial lag of homicides.

#plot graph with vec1 and spatial lag of vec1
scatLag = function(vec1,W){
    vec1lag = lag.listw(W,vec1) 
    h2.df = as.data.frame(cbind(vec1,vec1lag))  
    plot = #scatterplot x=vec, y=lagged vec1
      ggplot(data=h2.df, aes(y=vec1lag, x=vec1)) + 
      geom_point(size=.5,alpha=0.5)+
      coord_equal()+ 
      ggtitle("scatterplot x=homicide,y=lagged homicide")+
      xlab("homicide") + ylab("lagged homicide")
    plot
  }
scatLag(var2,W)

The graph is not very informative. But it is a visual way to identify what the Moran’s I test does. We note a positive relationship between homicides and homicides in neighboring regions.

GIFs and more Maps

Here we plot how much crime there is and plot them in a GIF.

# maps ang gifs -----------------------------------------
#idea: build maps of how much crime is there each regions
#build maps for each month then plot them in a gif

hex_fortify = fortify(hex_grid) #transforms spatial polygons into data frame, 
# fortigy is important for ggplot
path_images = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images/"

The following loop creates all images for GIF.

ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
  spdf = list_poly[[ii]][,"crime"]
  spdf$piece = 1:len_hex
  crime_points = data.frame(coordinates(spdf), as.data.frame(spdf$crime))
  names(crime_points) = c('lon', 'lat', 'crime')
  crime_hex = ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
     geom_polygon(data=hex_fortify, color='black', alpha=0, size=.1,
                  aes(y=lat, x=long, group=group, fill=''))+
     geom_point(data=crime_points, aes(x=lon, y=lat, color=crime), size=6, alpha=.9)+
    scale_colour_distiller(palette = "Spectral", limits=c(0,20))+ 
    ggtitle(paste('crime', timeid[ii]))
  ggsave(paste0(path_images,"crime_",1000+ii,".jpg"), width = 20, height = 20, units = "cm",dpi = 70)
  #print(ii)
}

This code chunck creates the GIF.

getwd() #where gif will be saved
## [1] "/home/william/Dropbox/working/projects_git/Crime-SC/tests william"
list.files(path = path_images, pattern = "*.jpg", full.names = T) %>% 
  map(image_read) %>% # reads each path file
  image_join() %>% # joins image
  image_animate(fps=1) %>% # animates, can opt for number of loops
  image_write("crime_gif.gif") # write to current dir

This GIF show how number of homicides change in each hexagon, it shows how much homicide changed in each hexagon. The periods are changing in 6 month intervals.

#GIF with raster ----------------------------------------------------
#build another type of GIF now with IDW function and raster format

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 90000)
gridded(grid) = TRUE
plot(grid)

We can’t see in this map. But there are 900000 squares in it. This is used in making a IDW (inverse distance weight) interpolation to create continuous maps.

grid %>% class
## [1] "SpatialPixels"
## attr(,"package")
## [1] "sp"
#this procedure is important to make the raster map
#idea:  make all the analysis here using a square grid instead hexagonal
path_images3 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images3"
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
  spdf = list_poly[[ii]][,"crime"]
  #build raster with idw
  v = as.numeric(spdf$crime)
  df1 = data.frame(v) 
  coordinates(df1) = coordinates(spdf)
  proj4string(df1) = proj4string(spdf)
  names(df1@data) = c("var1")
  idw = idw(var1 ~ 1, df1, grid)
  idw.output = as.data.frame(idw)
  names(idw.output)[1:3] = c("long", "lat", "crime")
  spg = idw.output
  coordinates(spg) = ~ long + lat
  gridded(spg) = TRUE
  raster_idw = raster(spg)
  projection(raster_idw) = projection(spdf)
  
  # map with munic borders
  tm1 = tm_shape(raster_idw) + 
    tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
              n = 4, palette = "Spectral", legend.show = TRUE)+
    tm_shape(munic)+
    tm_borders(alpha=.2, col = "black", lwd=.1)+   
    tm_layout(legend.text.size = .5,
              legend.position = c("left","bottom"),
              title=timeid[ii],
              title.size=.8,
              title.position = c(0.25,.3))
  tmap_save(tm = tm1, 
            filename = paste0(path_images3,"/crime_",1000+ii,".jpg"),
            width = 3, height = 3)

  #print(ii)  #just to follow progress
}
#to build GIF for munic raster
getwd()
list.files(path = path_images3, pattern = "*.jpg", full.names = T) %>% 
  map(image_read) %>% # reads each path file
  image_join() %>% # joins image
  image_animate(fps=1) %>% # animates, can opt for number of loops
  image_write("crime_gif_raster.gif") # write to current dir

This GIF is another way to see how homicides change in time. The intervals are of 6 months. Using raster and a continuous representation is better than using a discrete point representation, as in the previous GIF.

#prepare data frame for modelling --------------------------------
#build lag variables and prepare data frame for modelling
library(dplyr)
df_temp1 = df_crime[,c("period", "crime", "lag_crime")] #select variables of interest
df_temp1[sample(1:total1,10),] %>% kable()
period crime lag_crime
555 6 0 0.2000000
14183 131 0 0.2500000
4469 41 0 2.3333333
15180 140 0 0.0000000
9359 86 0 0.0000000
10784 99 0 0.2500000
10730 99 0 0.8333333
7789 72 0 0.5000000
9991 92 0 0.0000000
9097 84 0 0.5000000
#the following loop will make temporal lags as new features
jj = 1
df_crime2 = df_crime
for (jj in 1:12) {
  #create columns with spatial and temporal lags of rows
  #just a filler with NAs for rbind later
  toPut = data.frame(matrix(NA,nrow=len_hex*jj,ncol=dim(df_temp1)[2]))
  names(toPut) = names(df_temp1)
  df_temp2 = df_temp1[1:(total1-len_hex*jj),]
  df_temp3 = rbind(toPut,df_temp2)
  df_temp4 = df_temp3 
  names(df_temp4) = c(paste0("period_",jj), paste0("crime_",jj), 
                      paste0("lag_crime_",jj))
  df_crime2 = cbind(df_crime2, df_temp4)
  #df_crime2 %>% dim
  #print(jj)
}
df_crime2[sample(1:total1,3),] %>% kable
ID crime period date lag_crime long lat period_1 crime_1 lag_crime_1 period_2 crime_2 lag_crime_2 period_3 crime_3 lag_crime_3 period_4 crime_4 lag_crime_4 period_5 crime_5 lag_crime_5 period_6 crime_6 lag_crime_6 period_7 crime_7 lag_crime_7 period_8 crime_8 lag_crime_8 period_9 crime_9 lag_crime_9 period_10 crime_10 lag_crime_10 period_11 crime_11 lag_crime_11 period_12 crime_12 lag_crime_12
14154 93 0 130 2018-10-01 0.3333333 -50.33526 -26.53913 129 0 0.1666667 128 0 0.0000000 127 0 0.0 126 0 0.1666667 125 0 0.0 124 0 0.0 123 0 0.5000000 122 0 0.1666667 121 0 0.0000000 120 0 0.0000000 119 0 0.5 118 1 0.1666667
1047 66 1 10 2008-10-01 1.4000000 -48.83526 -27.05875 9 0 2.2000000 8 1 3.8000000 7 1 1.4 6 3 1.8000000 5 1 1.6 4 1 1.4 3 2 4.6000000 2 1 3.6000000 1 0 2.0000000 NA NA NA NA NA NA NA NA NA
7067 91 0 65 2013-05-01 0.3333333 -50.93526 -26.53913 64 0 0.3333333 63 0 0.1666667 62 0 0.0 61 0 0.5000000 60 0 0.5 59 0 0.0 58 0 0.1666667 57 0 0.0000000 56 0 0.6666667 55 0 0.3333333 54 0 0.0 53 0 0.8333333

Note that there are some missing values in the beginning of the data frame. This happens because we created columns that are lagged values of the table.

missmap(df_crime2)

#drop missing lines
mis1 = apply(df_crime2,1,function(x){anyNA(x)})
#they should match
12*len_hex; sum(mis1)
## [1] 1308
## [1] 1308

We exclude all lines that have missing. Note that the missmap doesn’t have any missing.

df_crime3 = df_crime2[!mis1,]
df_crime3 %>% dim
## [1] 13952    43
missmap(df_crime3) #there are no more missing

anyNA(df_crime3)
## [1] FALSE
df_crime3 %>% names
##  [1] "ID"           "crime"        "period"       "date"         "lag_crime"   
##  [6] "long"         "lat"          "period_1"     "crime_1"      "lag_crime_1" 
## [11] "period_2"     "crime_2"      "lag_crime_2"  "period_3"     "crime_3"     
## [16] "lag_crime_3"  "period_4"     "crime_4"      "lag_crime_4"  "period_5"    
## [21] "crime_5"      "lag_crime_5"  "period_6"     "crime_6"      "lag_crime_6" 
## [26] "period_7"     "crime_7"      "lag_crime_7"  "period_8"     "crime_8"     
## [31] "lag_crime_8"  "period_9"     "crime_9"      "lag_crime_9"  "period_10"   
## [36] "crime_10"     "lag_crime_10" "period_11"    "crime_11"     "lag_crime_11"
## [41] "period_12"    "crime_12"     "lag_crime_12"

Random Forest

#tree models --------------------------------------------

#ideas
#one of the advantages of using a random forest 
#algorithm is that we do not assume any form of 
#distribution compared for example to the log gaussian 
#cox process which assumes a Poisson distribution for 
#the point process.

#for our dataset we have 14000 observations
#we have around 100 regions and time is in months
#for future work, we could change
#more, or less, regions. a larger period.
#trimester, or semester, for example. 

#instead of using hexagons, we could just use the municipality
#borders as polygons. It makes much more sense.
#there are 293 municipalities in Santa Catarina.
#there will be many more rows in the table.
df_crime3[sample(1:10000,3),] %>% kable
ID crime period date lag_crime long lat period_1 crime_1 lag_crime_1 period_2 crime_2 lag_crime_2 period_3 crime_3 lag_crime_3 period_4 crime_4 lag_crime_4 period_5 crime_5 lag_crime_5 period_6 crime_6 lag_crime_6 period_7 crime_7 lag_crime_7 period_8 crime_8 lag_crime_8 period_9 crime_9 lag_crime_9 period_10 crime_10 lag_crime_10 period_11 crime_11 lag_crime_11 period_12 crime_12 lag_crime_12
4312 61 0 40 2011-04-01 0.3333333 -50.33526 -27.05875 39 0 0.1666667 38 0 0.1666667 37 0 0.0000000 36 0 0.0000000 35 0 0.3333333 34 2 0.1666667 33 0 0.0000000 32 0 0.0000000 31 0 0.1666667 30 1 0.0000000 29 0 0.1666667 28 0 0.1666667
4515 46 0 42 2011-06-01 0.0000000 -49.58526 -27.31856 41 0 0.6666667 40 2 0.1666667 39 1 0.0000000 38 0 0.0000000 37 0 0.0000000 36 0 0.0000000 35 0 0.1666667 34 1 0.0000000 33 0 0.3333333 32 0 0.5000000 31 0 0.0000000 30 0 0.1666667
9297 32 1 86 2015-02-01 0.1666667 -50.03526 -27.57836 85 0 0.5000000 84 0 0.6666667 83 0 0.1666667 82 0 0.6666667 81 0 0.0000000 80 0 0.1666667 79 0 0.0000000 78 0 0.3333333 77 1 0.5000000 76 0 0.1666667 75 0 1.0000000 74 0 0.3333333
#for a first analysis we'll use only a subset of variables
data = df_crime3[, c("crime", 
                            "crime_1","lag_crime_1",
                            "crime_2","lag_crime_2",
                            "crime_3",
                            "crime_4",
                            "crime_5",
                            "crime_6",
                            "crime_7",
                            "crime_8",
                            "crime_9",
                            "crime_10",
                            "crime_11",
                            "crime_12",
                            "long", "lat")]

The variable crime is what we predict. crime_1 is the temporal lag for crime. lag_crime_1 is the spatial lag for crime_1. So lag_crime_1 is the mean of the neighbors for a region but one period prior. crime_2 and lag_crime_2 are two temporal lags for crime and spatial lag of crime.

hist(data$crime, breaks=50)

This number is important for the division of training and test datasets.

12*109 #one year of forecasting
## [1] 1308
len_hex #number of hexagons in SC
## [1] 109
len_hex*12
## [1] 1308
set.seed(1)
nrow = nrow(data)
train = 1:(nrow-len_hex*12)
data %>% names %>% kable
x
crime
crime_1
lag_crime_1
crime_2
lag_crime_2
crime_3
crime_4
crime_5
crime_6
crime_7
crime_8
crime_9
crime_10
crime_11
crime_12
long
lat
data %>% names %>% length #27 variables in data frame
## [1] 17
data %>% names %>% length %>% sqrt() #input in mtry
## [1] 4.123106

Create a dataset for training and test. We use a spatiotemporal model, so it is not a cross-section. Hence the division between training and test datasets has a temporal aspect.

trainX = data[train,]
trainY = data$crime[train]
testX = data[-train,]
testY = data$crime[-train]

We run the random forest model. Note that we use 100 trees in boostraping.

m1 = randomForest(crime~.,trainX,
                  mtry=5,importance=TRUE, ntree=300)
m1
## 
## Call:
##  randomForest(formula = crime ~ ., data = trainX, mtry = 5, importance = TRUE,      ntree = 300) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.8378361
##                     % Var explained: 72.91

Export R objects to be used in another article.

saveRDS(trainX, file = paste0(pathExport,"/trainX.rds")) 
saveRDS(trainY, file = paste0(pathExport,"/trainY.rds")) 
saveRDS(testX, file = paste0(pathExport,"/testX.rds")) 
saveRDS(testY, file = paste0(pathExport,"/testY.rds")) 
saveRDS(m1, file = paste0(pathExport,"/modelRF.rds")) 

This is the MSE for training set.

yh1 = predict(m1,trainX)
mean((trainY - yh1)^2)
## [1] 0.1965913

This is the MSE for test set.

yh = predict(m1,testX)
mean((testY - yh)^2)
## [1] 0.7008546
plot(yh, testY)
abline(0,1)

#importance(m1)
varImpPlot(m1)

This is an importance plot. This plot shows which variables are more important. lag_crime_1,long,crime_11,crime_9 are the most relevant. That is the spatially lagged values on with one temporal lag is the most important variable to explain crime one period ahead. Longitude is important because the biggest cities of Santa Catarina are on the coast in the west. Number of homicides 11 and 9 periods prior is imporant maybe this is a cyclical effect. For it to make sense it should be 12 lagged values so that it would indicate some yearly cycle.

Prediction

Here we plot maps showing the prediction and the ture values.

Prediction for 01/01/2018

#plot of predict for 180901 -------------------------
#prepare a data frame for ploting of results 
#predict Data Frame
yh[1:4] #vector of predictions
##     13953     13954     13955     13956 
## 0.4598918 0.1214090 0.5837738 0.4330119

Here we plot residuals for some predictions.

12 #number of intervals to choose
## [1] 12
len_hex #number of hexagons in SC
## [1] 109

We plot the prediction for the first month of the test set.

period1 = 1
init1 = len_hex*(period1-1)+1 
inter1 = (init1:(init1+len_hex-1)) #interval to extract from yh
preDf = yh[inter1] 
#they should be the same as 
preDf %>% length; len_hex
## [1] 109
## [1] 109
#plot prediction and residuals 
#present prediction and true values

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("predict ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

Here we prepare data for plot of actual value of homicides.

#plot of actual for 180901
period1
## [1] 1
#build map
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
tm2 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("true ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3

tmap_save(tm = tm3, 
          filename = paste0("crime_","comp1",".jpg"),
          width = 3, height = 3)

Prediction for 01/06/2019

#plot of predict for 190601 -------------------------
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

tm11 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("predict ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

#build map
period1
## [1] 10
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm21 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("true ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm11,tm21)
tmap_mode(current.mode)

tmap_save(tm = tm3, 
          filename = paste0("crime_","comp2",".jpg"),
          width = 3, height = 3)

tm3

Residuals

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
saveRDS(pol, file = paste0(pathExport,"/pol.rds")) #export to use in another article
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE

period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 

period1
## [1] 1
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime 
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) + 
  tm_raster("crime", style='quantile',
            n=7 ,palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .3,
            legend.position = c("left","bottom"),
            title= paste0("residuals ",timeid[period1+(140-12)]),
            title.size=.6,
            title.position = c(0.25,.3))


moran.test(v,W)
## 
##  Moran I test under randomisation
## 
## data:  v  
## weights: W    
## 
## Moran I statistic standard deviate = -1.4624, p-value = 0.9282
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.095835052      -0.009259259       0.003504995
#for period 1 of prediction, there is no spatial correlation in the residuals

#now, for period 10
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 

ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime 
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) + 
  tm_raster("crime", style='quantile',
            n=7 ,palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .3,
            legend.position = c("left","bottom"),
            title= paste0("residuals ",timeid[period1+(140-12)]),
            title.size=.6,
            title.position = c(0.25,.3))


moran.test(v,W)
## 
##  Moran I test under randomisation
## 
## data:  v  
## weights: W    
## 
## Moran I statistic standard deviate = -1.4845, p-value = 0.9312
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.092495433      -0.009259259       0.003143964
#for period 10 of prediction, there is no spatial correlation in the residuals

current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tmap_save(tm = tm3, 
          filename = paste0("crime_","res",".jpg"),
          width = 3, height = 3)
tm3

# end -------------------------------------

Concluding Remarks

Here we try to predict number of homicides in each polygon. For another article we can predict the number of homicide in each municipality. We can incorporate data about population, GDP per capita and inequality. We could find other crime variables. We can study larger periods, for example, trimester or semester.

Given that the data have coordinates we can study inside city occurances. Model phenomena in the street level.

It is relevant to find a way to interpret how variables (features) affect our variable of interest, \(y\). In a future article I’ll build a Shap like method to intrepret outcomes of black-box like models.