by William Suzuki and Ian Fukushima

The objective of this article is to explore and model the dataset of homicides in the state of Santa Catarina. After request, the state’s security bureau of Santa Catarina gave us the data. Ian Fukushima, my colleague, inspired me to make the maps and analysis of this data, I’m grateful for his support.

Load, Explore and Transform

Let’s start by loading packages:

#preliminaries -----------------------------------------
library(knitr) #for kable, stylized table in r markdown
library(tidyverse) #for pipe
library(dplyr) #select() function
library(sp)
library(spdep)
library(rgdal) 
library(Amelia) #missmap()
library(ggmap) 
library(rgeos) #gUnaryUnion
library(xts) #extended time series
library(magick) # to make gifs
library(randomForest)
library(gstat) #idw function
library(tmap) #to use tmap function
library(randomForest)
library(grid) #to plot two tmaps together
library(raster) #raster()

Include some paths to save images. You must change those paths for your machine

path1 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/"
path_root = "/home/william/Dropbox/working/projects_git/Crime-SC/"
path_data = "/home/william/Dropbox/working/projects_git/Crime-SC/Dados"
pathExport = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/export"

load.csv() will load the data. Ian used Google Maps’ API to find the geographic coordinates. The raw data that came only with the address of each homicide incident.

#prepare base----------------------------------------
#load base
crime = read.csv(file=paste0(path_data,"/df_sc_crime_geocoded.csv"), header=TRUE, sep=",", fileEncoding="latin1")

crime %>% class

## [1] "data.frame"

The data have 10745 incidents of homicide.

crime %>% dim()

## [1] 10745    25

The following are the variables. We are going to use only "Data", "lat","long".

crime %>% names()

##  [1] "X"                                 "Crime"                            
##  [3] "Data"                              "Hora"                             
##  [5] "ID"                                "Endereço"                         
##  [7] "Município"                         "Bairro"                           
##  [9] "Tipo.Logradouro"                   "Logradouro"                       
## [11] "Número"                            "Tipo.de.local"                    
## [13] "Localidade"                        "PosiTime"                         
## [15] "lat"                               "long"                             
## [17] "accuracy"                          "formatted_address"                
## [19] "address_type"                      "status"                           
## [21] "index"                             "MOTIVAÇÃO"                        
## [23] "VIOLÊNCIA.DOMÉSTICA"               "MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER"
## [25] "ESTUPRO..ANTECEDENTE"

Here is the summary of our data. We are going to use "Data", "lat","long". But variables as "Municipio","Número","Endereço","Bairro" are important when using Google’s API for geocoding.

summary(crime)

##        X                                                       Crime     
##  Min.   :    1   HOMICÍDIO                                        :9162  
##  1st Qu.: 2687   LATROCÍNIO                                       : 665  
##  Median : 5373   PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO     : 619  
##  Mean   : 5373   LESÃO CORPORAL SEGUIDA DE MORTE                  : 170  
##  3rd Qu.: 8059   PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO       :  63  
##  Max.   :10745   PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO:  50  
##                  (Other)                                          :  16  
##         Data            Hora            ID       
##  6/12/2015:   12   21:00  : 638   Min.   :    1  
##  13/3/2010:   11   23:00  : 604   1st Qu.: 2687  
##  13/5/2018:   11   22:00  : 579   Median : 5373  
##  31/3/2018:   11   20:00  : 530   Mean   : 5373  
##  1/1/2013 :   10   0:00   : 512   3rd Qu.: 8059  
##  11/7/2010:   10   19:00  : 438   Max.   :10745  
##  (Other)  :10680   (Other):7444                  
##                                                     Endereço    
##    , JOINVILLE, Santa Catarina, BRASIL                  :   50  
##    , FLORIANÓPOLIS, Santa Catarina, BRASIL              :   22  
##  RUA SEIS DE JANEIRO , JOINVILLE, Santa Catarina, BRASIL:   21  
##    , SÃO JOSÉ, Santa Catarina, BRASIL                   :   15  
##    , ITAJAÍ, Santa Catarina, BRASIL                     :   13  
##    , CRICIÚMA, Santa Catarina, BRASIL                   :   12  
##  (Other)                                                :10612  
##          Município               Bairro     Tipo.Logradouro
##  JOINVILLE    :1133   CENTRO        : 990   RUA     :7234  
##  FLORIANÓPOLIS:1063   INTERIOR      : 321   ESTRADA :1105  
##  ITAJAÍ       : 561                 : 216   RODOVIA : 701  
##  CHAPECÓ      : 487   RURAL         : 193   AVENIDA : 642  
##  SÃO JOSÉ     : 467   MONTE ALEGRE  : 161           : 311  
##  CRICIÚMA     : 380   PARANAGUAMIRIM: 142   SERVIDÃO: 282  
##  (Other)      :6654   (Other)       :8722   (Other) : 470  
##           Logradouro        Número                       Tipo.de.local 
##                :  322          :7686   VIA PÚBLICA              :5728  
##  GERAL         :  185   20     :  21   RESIDÊNCIA               :2956  
##  BR 101        :   71   200    :  20   OUTRO                    : 848  
##  BR 470        :   35   30     :  15   ESTABELECIMENTO COMERCIAL: 489  
##  SANTA CATARINA:   30   55     :  14   BARES E SIMILARES        : 452  
##  GETÚLIO VARGAS:   28   12     :  13   ESTABELECIMENTO PENAL    : 122  
##  (Other)       :10074   (Other):2976   (Other)                  : 150  
##                                  Localidade                  PosiTime    
##  ZONA URBANA                          :9851   2014-08-31 03:00:00:    6  
##  ZONA RURAL                           : 882   2014-09-06 22:00:00:    6  
##  NÃO INFORMADA                        :   4   ZONA URBANA        :    6  
##  \\"CANTINA PER TUTI\\",","VIA PÚBLICA:   1   2014-03-20 19:00:00:    5  
##  \\"E\\",","VIA PÚBLICA               :   1   2015-02-26 04:00:00:    5  
##  406, \\"G\\",","VIA PÚBLICA          :   1   2016-06-12 06:00:00:    5  
##  (Other)                              :   5   (Other)            :10712  
##           lat                 long                 accuracy   
##  -28.6010292:   91   -49.151029 :   91   route         :5971  
##  -26.2834213:   73   -48.8452269:   73   street_address:2136  
##  -26.6140628:   41   -53.7265491:   41   locality      : 790  
##  -27.6140791:   27   -48.6370861:   27   establishment : 781  
##  -26.0713285:   25   -48.6177384:   25   premise       : 371  
##  (Other)    :10471   (Other)    :10471   (Other)       : 679  
##  NA's       :   17   NA's       :   17   NA's          :  17  
##                                                                         formatted_address
##  Estrada Geral, Santa Catarina, Brazil                                           :   91  
##  Joinville - State of Santa Catarina, Brazil                                     :   74  
##  BR-101, Santa Catarina, Brazil                                                  :   41  
##  Estrada de Linha, Paraíso - SC, 89906-000, Brazil                               :   41  
##  São José, Santa Catarina - Barreiros, São José - State of Santa Catarina, Brazil:   27  
##  (Other)                                                                         :10454  
##  NA's                                                                            :   17  
##                                 address_type 
##  route                                :5971  
##  street_address                       :2136  
##  locality,political                   : 790  
##  premise                              : 371  
##  administrative_area_level_2,political: 321  
##  (Other)                              :1139  
##  NA's                                 :  17  
##                                                                    status     
##  OK                                                                   :10720  
##  Error - NA - Verifify                                                :   17  
##  route                                                                :    3  
##  Chapecó, State of Santa Catarina, Brazil                             :    1  
##  clothing_store,establishment,home_goods_store,point_of_interest,store:    1  
##  establishment,point_of_interest                                      :    1  
##  (Other)                                                              :    2  
##      index                   MOTIVAÇÃO       VIOLÊNCIA.DOMÉSTICA
##  OK     :    5   NÃO INFORMADA    :3480   NÃO          :10178   
##  1      :    1   DESAVENÇA        :2853   NÃO INFORMADO:    1   
##  10     :    1   TRÁFICO DE DROGAS:1726   SIM          :  566   
##  100    :    1   PASSIONAL        : 846                         
##  1000   :    1   AÇÃO POLICIAL    : 745                         
##  10000  :    1   ROUBO            : 655                         
##  (Other):10735   (Other)          : 440                         
##  MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER    ESTUPRO..ANTECEDENTE
##  NÃO          :4489                NÃO          :10559    
##  NÃO INFORMADO:6232                NÃO INFORMADO:  160    
##  SIM          :  24                SIM          :   26    
##                                                           
##                                                           
##                                                           
##

kable(crime[1:3,], caption="original data of homicides") #shows stylized table of data frame

original data of homicides
X	Crime	Data	Hora	ID	Endereço	Município	Bairro	Tipo.Logradouro	Logradouro	Número	Tipo.de.local	Localidade	PosiTime	lat	long	accuracy	formatted_address	address_type	status	index	MOTIVAÇÃO	VIOLÊNCIA.DOMÉSTICA	MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER	ESTUPRO..ANTECEDENTE
1	HOMICÍDIO	1/1/2008	0:00	1	RUA MONTEVIDÉU , CHAPECÓ, Santa Catarina, BRASIL	CHAPECÓ	FORTES	RUA	MONTEVIDÉU		VIA PÚBLICA	ZONA URBANA	2008-01-01 00:00:00	-27.0996274	-52.5998438	route	R. Montevidéu, Chapecó - SC, Brazil	route	OK	1	TRÁFICO DE DROGAS	NÃO	NÃO INFORMADO	NÃO
2	HOMICÍDIO	1/1/2008	0:00	2	RUA PEGASUS 187, JOINVILLE, Santa Catarina, BRASIL	JOINVILLE	JARDIM PARAÍSO	RUA	PEGASUS	187	OUTRO	ZONA URBANA	2008-01-01 00:00:00	-26.2144603	-48.8141942	route	R. Pegasus - Jardim Paraiso, Joinville - SC, 89240-000, Brazil	route	OK	2	TRÁFICO DE DROGAS	NÃO	NÃO INFORMADO	NÃO
3	HOMICÍDIO	1/1/2008	6:00	3	AVENIDA AMÂNDIO CABRAL , BALNEÁRIO BARRA DO SUL, Santa Catarina, BRASIL	BALNEÁRIO BARRA DO SUL	CENTRO	AVENIDA	AMÂNDIO CABRAL		VIA PÚBLICA	ZONA URBANA	2008-01-01 06:00:00	-26.4594161	-48.6030489	route	Av. Amândio Cabral, Balneário Barra do Sul - SC, 89247-000, Brazil	route	OK	3	TRÁFICO DE DROGAS	NÃO	NÃO INFORMADO	NÃO

Many variables came in factor format. We transform all variables that are factor into character.

#transform all variables that are factor into character
crime2 = crime
len1 = length(names(crime2)) 
ii = 1
for (ii in 1:len1) {
  vv = crime[,ii]
  if (is.factor(vv)) {
    vv1 <- as.character(vv)
    crime2[,ii] = vv1
  }
}

#X and ID show the same thing
identical(crime2$X,crime2$ID)

## [1] TRUE

#years and days, and hours of crime
crime2$Data %>% class

## [1] "character"

#we can aggregate the data into months or trimesters

Take a look at the types of homicide. We’ll consider all of them as just homicide.

kable(crime2$Crime %>% table)

.	Freq
HOMICÍDIO	9162
INFANTICÍDIO	8
LATROCÍNIO	665
LESÃO CORPORAL SEGUIDA DE MORTE	170
PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO	63
PESSOA MORTA POR POLICIAL CIVIL FORA DE SERVIÇO	8
PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO	619
PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO	50

There are only some lines that are missing, but for variables that we don’t use.

#map of missing in crime2 from Amelia
missmap(crime2)

#transform lat and long into numeric
crime2$lat=as.numeric(crime2$lat) 
crime2$long=as.numeric(crime2$long)

#there are a few missings, 25 and 19
is.na(crime2$lat) %>% sum

## [1] 25

is.na(crime2$long) %>% sum

## [1] 19

#drop places with missing lat and long
mis_lat =  is.na(crime2$lat)
mis_lon = is.na(crime2$lat)
crime3 = crime2[!(mis_lat|mis_lon),]
crime3 %>% dim()

## [1] 10720    25

#we drop 25 observations

There are some outliers for latitude, for example, -27027.

summary(crime3$lat)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -27027.00    -27.59    -27.05    -34.48    -26.74     45.14

#there are some outliers in lat and long
#proportion of points that does not make sense
summary(crime3$long)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -119.70  -49.58  -48.83  -49.40  -48.65   35.37

len2 = dim(crime3)[1]
len2

## [1] 10720

#10720 number of lines in data

sum(crime3$long < -54)

## [1] 8

#8 number of outliers
sum(crime3$long < -54)/len2

## [1] 0.0007462687

#0.0007462687
#-54 long is more than the most western frontiers of Santa Catarina
#-53.8376609 most west of SC
#-47.2330741 most east of SC

sum(crime3$long > -47)

## [1] 63

#latitude explore outliers

summary(crime3$lat)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -27027.00    -27.59    -27.05    -34.48    -26.74     45.14

#look at google maps to see which values makes sense
# -25.3345494 most north of SC
# -29.5 most south of SC

Maps

We use function get_map() from ggmap to get a “screen shot” from Google maps.

# maps ------------------------------------
#google maps
map = get_map(location=c(-53.9, -29.5 ,-47,-25.33),maptype = 'roadmap')
plot(map)

#transform crime3 into %>% points data.frame
coordinates(crime3) = ~long+lat
plot(crime3$long,crime3$lat )

#plot of coordinates doesn't make sense because of outliers
plot(crime3)

Those strange results are important to be shown. Those are some examples of problems that we face using real world data.

crime3 %>% class

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

The projection of a spatial object is important. Below we see that crime3 doesn’t have a projection.

crime3@proj4string #there is no projection

## CRS arguments: NA

Here we give a projection for crime3

crime3@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(crime3) #there are some outliers in the west, it doesn't make sense

points_occur = data.frame(lon=coordinates(crime3)[,1],
                           lat=coordinates(crime3)[,2])

(crime_sc1=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  #geom_polygon(aes(x=long, y=lat, group=group), 
  #             fill='grey', size=.2, color='blue', data=, alpha=0)+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.5, color='red', alpha=.5)
)

#save image to disk
ggsave("crime_sc1.jpg", width = 20, height = 20, units = "cm",dpi = 600)

Municipal Borders

#import data of municipalities:
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields

#pick up only municipalities that are in the state of Santa Catarina
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='42',] 
munic@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(munic)

#plot map with municipaility borders and points of crime and google maps
munic_fortify = fortify(munic, region="CODIGOIB1")
munic_fortify %>% class

## [1] "data.frame"

#we first transform spatial polygons into data frame
(crime_sc2=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  geom_polygon(data=munic_fortify, color='black', alpha=0,
               aes(y=lat, x=long, group=group, fill='grey'))+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)

#save image
ggsave("crime_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)

#we need to exclude points outside santa catarina's borders
#get outer border of Santa Catarina
munic_dissolve = gUnaryUnion(munic)
plot(munic_dissolve)

Note that munic_dissolve is a spatial polygon, then we use it to select values of crime3 which is spatial points. So this is a way to crop values in a spatial points.

#exclude outliers, we solve the problem
crime4 =  crime3[munic_dissolve,]
plot(crime4)

Now that we excluded all outliers we have a recognizable map.

Hexagon Polygons

Our raw data is a spatial point process. But to model, we analyze a discrete number of occurrences, which is how many homicides happened in a given region or area.

We use a hexagon area to count the number of homicides that happened inside of it. There are many options in this step. We could use the municipalities area or a squared grid. Just for fun, I used a hexagon gird. But in hindsight, a better option would be the municipal borders. Because it makes much more sense to predict the number of homicides in each municipality and not a general fixed area of a hexagon. And another problem with a fixed geometric grid is that some points at the borders of the state are lost because the polygons can’t reach all corners of the map of Santa Catarina.

#create hexagon polygons --------------------------------------------------
#code from:
#http://strimas.com/spatial/hexagonal-grids/#f1
size = 0.3
set.seed(123)
hex_points = spsample(munic_dissolve, type = "hexagonal", cellsize = size)
hex_points %>% class #spatial points

## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"

hex_points %>% plot #just make points in a hexagon fashion

hex_points %>% length #there are 109 points

## [1] 109

hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% class #from spatial points to spatial polygons

## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"

plot(hex_grid)

#number of hexagons in map
len_hex = length(hex_grid) ; len_hex

## [1] 109

plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)

#to save image
jpeg("hexagon_sc.jpg")
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
dev.off()

## png 
##   2

hex_fortify = fortify(hex_grid)
(hex_sc2=ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
  geom_polygon(data=hex_fortify, color='black', alpha=0,
               aes(y=lat, x=long, group=group, fill='grey'))+
  geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)

Note that there are some points that are not inside any hexagon. Those points will not be accounted in the final data frame. This is a drawback in using hexagon polygons.

ggsave("hex_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)

#another view of the data
plot(hex_grid)
plot(crime4, col="red" , add=TRUE)

#prepare spatial polygons data frame
crime4$count = 1
names(crime4)[24] #"count"

## [1] "count"

sp1 = crime4[,24] #take just the "count" variable
sp1 %>% class

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

res = over(hex_grid, sp1 , fn=sum) #count nº of homicides in each hexagon
res %>% dim

## [1] 109   1

is.na(res$count) %>% sum

## [1] 5

res$count[is.na(res$count)] = 0 #hexagons with no crime have 0 crimes

hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:length(hex_grid)), match.ID = FALSE)
hex_sp$crime = res$count
#plot of crime for all periods
spplot(hex_sp,c("crime"))

This map shows for each hexagon how many homicides happened. The numbers shown are for all periods.

Prepare Data Frame

#separate for each month ------------------------------------
#the previous maps was built with all periods
#now we need to separate the base for different months
crime5 = crime4 #rename to manipulate
crime5$Data = as.Date(crime5$Data,format="%d/%m/%Y")
summary(crime5$Data)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2008-01-01" "2011-02-09" "2014-03-16" "2014-01-01" "2016-12-06" "2019-08-22"

#data max 22/08/2019 almost at the end of the period
#but I expect that this month the prediction will be bad
crime5$Data %>% class

## [1] "Date"

crime5$Data[sample(1:10000,5)]

## [1] "2017-08-21" "2011-07-06" "2010-03-14" "2018-03-03" "2011-12-27"

#create only month data
crime5$Date2 =  paste0(substr(as.character(crime5$Data),1,7),"-01") 
crime5$Date2[sample(1:10000,5)] %>% kable

x
2013-08-01
2015-10-01
2018-09-01
2011-04-01
2014-01-01

crime5$Date2 %>% class

## [1] "character"

crime5$Date2 = as.Date(crime5$Date2,"%Y-%m-%d")
crime5$Date2 %>% class

## [1] "Date"

kable(crime5$Date2[sample(1:10000,5)])

x
2018-01-01
2018-01-01
2011-05-01
2015-02-01
2011-01-01

#there are specially few homicides in august 2019, because it is not the full month
crime_c1 = as.numeric(table(crime5$Date2)) 
timeid = unique(crime5$Date2) #take only unique dates
timeid[1:5]

## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"

length(timeid) #there are 140 months

## [1] 140

crime_ts = xts(crime_c1, order.by=timeid)

jpeg("time_crime.jpg", width = 800, height = 300)
plot(crime_ts)
dev.off()

## png 
##   2

#idea: build a spatial VAR model with time series of crimes

This is the time series for number of homicides in Santa Catarina since January 2008 till August 2019. There seems to be a seasonal behaviour in the series. In our dataset, August ends in the day 22, so the number of homicides is subreported.

#need to build a data frame with each line a hexagon-region for each month
#create neighboors link
nb1 = poly2nb(hex_sp, queen=T)
#connections between hexagon-regions
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)

This map shows the hexagon polygons and the links. Those links are used to determine the neighborhood and spatially lagged values.

jpeg("links.jpg", width = 1000, height = 1000)
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
dev.off()

## png 
##   2

W = nb2listw(nb1, glist=NULL, style ="W") #normlize weights to 1

#create list with spatial data frames, with each month
df_crime = data.frame() #general data frame for models
list_poly = list() #list to input spatial polygon data frame
len_hex = length(hex_grid) #variable that contain length of spatial dataframe

ii = 1 #just a test
for (ii in 1:length(timeid)) {
  var1 = crime5$Date2 == timeid[ii] #obs of interest
  crime6 = crime5[var1,]

  res = over(hex_grid, crime6[,24] , fn=sum)
  res$count[is.na(res$count)] = 0
  hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:len_hex), match.ID = FALSE)
  hex_sp$crime = res$count
  hex_sp$period = ii
  hex_sp$date = timeid[ii]
  hex_sp$lag_crime = lag.listw(W,hex_sp$crime)
  hex_sp$long =  coordinates(hex_sp)[,1]
  hex_sp$lat =  coordinates(hex_sp)[,2]
  list_poly[[ii]] = hex_sp #input the spatial polygons in the list
  names(list_poly)[ii] = ii #input name in the list for the spatial polygon 
  df_crime = rbind(df_crime,hex_sp@data)
  #print(ii) #just to see progress
}

total1=length(timeid)*len_hex; total1

## [1] 15260

df_crime %>% dim

## [1] 15260     7

#they should match
#df_crime %>% View()
df_crime[sample(1:total1,5),] %>% kable

	ID	crime	period	date	lag_crime	long	lat
9642	50	0	89	2015-05-01	0.5000000	-53.63526	-27.05875
13961	9	0	129	2018-09-01	0.0000000	-49.88526	-28.35779
9982	63	0	92	2015-08-01	0.1666667	-49.73526	-27.05875
12499	73	0	115	2017-07-01	0.5000000	-51.68526	-26.79894
2980	37	8	28	2010-04-01	1.3333333	-48.53526	-27.57836

missmap(df_crime)

#moran's I
#there is spatial correlation in homicide
var2 = list_poly[[1]]$crime
moran.test(var2,W)

## 
##  Moran I test under randomisation
## 
## data:  var2  
## weights: W    
## 
## Moran I statistic standard deviate = 2.4378, p-value = 0.007388
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.118433931      -0.009259259       0.002743691

This is a Moran’s I test for homicides for one month. Note the p-value, there is spatial correlation in homicide.

The following function creates a ggplot for nº of homicides and the spatial lag of homicides.

#plot graph with vec1 and spatial lag of vec1
scatLag = function(vec1,W){
    vec1lag = lag.listw(W,vec1) 
    h2.df = as.data.frame(cbind(vec1,vec1lag))  
    plot = #scatterplot x=vec, y=lagged vec1
      ggplot(data=h2.df, aes(y=vec1lag, x=vec1)) + 
      geom_point(size=.5,alpha=0.5)+
      coord_equal()+ 
      ggtitle("scatterplot x=homicide,y=lagged homicide")+
      xlab("homicide") + ylab("lagged homicide")
    plot
  }
scatLag(var2,W)

The graph is not very informative. But it is a visual way to identify what the Moran’s I test does. We note a positive relationship between homicides and homicides in neighboring regions.

GIFs and more Maps

Here we plot how much crime there is and plot them in a GIF.

# maps ang gifs -----------------------------------------
#idea: build maps of how much crime is there each regions
#build maps for each month then plot them in a gif

hex_fortify = fortify(hex_grid) #transforms spatial polygons into data frame, 
# fortigy is important for ggplot
path_images = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images/"

The following loop creates all images for GIF.

ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
  spdf = list_poly[[ii]][,"crime"]
  spdf$piece = 1:len_hex
  crime_points = data.frame(coordinates(spdf), as.data.frame(spdf$crime))
  names(crime_points) = c('lon', 'lat', 'crime')
  crime_hex = ggmap(map,  ylab = "Latitude", xlab = "Longitude") +
     geom_polygon(data=hex_fortify, color='black', alpha=0, size=.1,
                  aes(y=lat, x=long, group=group, fill=''))+
     geom_point(data=crime_points, aes(x=lon, y=lat, color=crime), size=6, alpha=.9)+
    scale_colour_distiller(palette = "Spectral", limits=c(0,20))+ 
    ggtitle(paste('crime', timeid[ii]))
  ggsave(paste0(path_images,"crime_",1000+ii,".jpg"), width = 20, height = 20, units = "cm",dpi = 70)
  #print(ii)
}

This code chunck creates the GIF.

getwd() #where gif will be saved

## [1] "/home/william/Dropbox/working/projects_git/Crime-SC/tests william"

list.files(path = path_images, pattern = "*.jpg", full.names = T) %>% 
  map(image_read) %>% # reads each path file
  image_join() %>% # joins image
  image_animate(fps=1) %>% # animates, can opt for number of loops
  image_write("crime_gif.gif") # write to current dir

This GIF show how number of homicides change in each hexagon, it shows how much homicide changed in each hexagon. The periods are changing in 6 month intervals.

#GIF with raster ----------------------------------------------------
#build another type of GIF now with IDW function and raster format

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 90000)
gridded(grid) = TRUE
plot(grid)

We can’t see in this map. But there are 900000 squares in it. This is used in making a IDW (inverse distance weight) interpolation to create continuous maps.

grid %>% class

## [1] "SpatialPixels"
## attr(,"package")
## [1] "sp"

#this procedure is important to make the raster map
#idea:  make all the analysis here using a square grid instead hexagonal

path_images3 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images3"
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
  spdf = list_poly[[ii]][,"crime"]
  #build raster with idw
  v = as.numeric(spdf$crime)
  df1 = data.frame(v) 
  coordinates(df1) = coordinates(spdf)
  proj4string(df1) = proj4string(spdf)
  names(df1@data) = c("var1")
  idw = idw(var1 ~ 1, df1, grid)
  idw.output = as.data.frame(idw)
  names(idw.output)[1:3] = c("long", "lat", "crime")
  spg = idw.output
  coordinates(spg) = ~ long + lat
  gridded(spg) = TRUE
  raster_idw = raster(spg)
  projection(raster_idw) = projection(spdf)
  
  # map with munic borders
  tm1 = tm_shape(raster_idw) + 
    tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
              n = 4, palette = "Spectral", legend.show = TRUE)+
    tm_shape(munic)+
    tm_borders(alpha=.2, col = "black", lwd=.1)+   
    tm_layout(legend.text.size = .5,
              legend.position = c("left","bottom"),
              title=timeid[ii],
              title.size=.8,
              title.position = c(0.25,.3))
  tmap_save(tm = tm1, 
            filename = paste0(path_images3,"/crime_",1000+ii,".jpg"),
            width = 3, height = 3)

  #print(ii)  #just to follow progress
}

#to build GIF for munic raster
getwd()
list.files(path = path_images3, pattern = "*.jpg", full.names = T) %>% 
  map(image_read) %>% # reads each path file
  image_join() %>% # joins image
  image_animate(fps=1) %>% # animates, can opt for number of loops
  image_write("crime_gif_raster.gif") # write to current dir

This GIF is another way to see how homicides change in time. The intervals are of 6 months. Using raster and a continuous representation is better than using a discrete point representation, as in the previous GIF.

#prepare data frame for modelling --------------------------------
#build lag variables and prepare data frame for modelling
library(dplyr)
df_temp1 = df_crime[,c("period", "crime", "lag_crime")] #select variables of interest
df_temp1[sample(1:total1,10),] %>% kable()

	period	lag_crime
555	6	0.2000000
14183	131	0.2500000
4469	41	2.3333333
15180	140	0.0000000
9359	86	0.0000000
10784	99	0.2500000
10730	99	0.8333333
7789	72	0.5000000
9991	92	0.0000000
9097	84	0.5000000

#the following loop will make temporal lags as new features
jj = 1
df_crime2 = df_crime
for (jj in 1:12) {
  #create columns with spatial and temporal lags of rows
  #just a filler with NAs for rbind later
  toPut = data.frame(matrix(NA,nrow=len_hex*jj,ncol=dim(df_temp1)[2]))
  names(toPut) = names(df_temp1)
  df_temp2 = df_temp1[1:(total1-len_hex*jj),]
  df_temp3 = rbind(toPut,df_temp2)
  df_temp4 = df_temp3 
  names(df_temp4) = c(paste0("period_",jj), paste0("crime_",jj), 
                      paste0("lag_crime_",jj))
  df_crime2 = cbind(df_crime2, df_temp4)
  #df_crime2 %>% dim
  #print(jj)
}

df_crime2[sample(1:total1,3),] %>% kable

	ID	crime	period	date	lag_crime	long	lat	period_1	lag_crime_1	period_2	crime_2	lag_crime_2	period_3	crime_3	lag_crime_3	period_4	crime_4	lag_crime_4	period_5	crime_5	lag_crime_5	period_6	crime_6	lag_crime_6	period_7	crime_7	lag_crime_7	period_8	crime_8	lag_crime_8	period_9	lag_crime_9	period_10	crime_10	lag_crime_10	period_11	crime_11	lag_crime_11	period_12	crime_12	lag_crime_12
14154	93	0	130	2018-10-01	0.3333333	-50.33526	-26.53913	129	0.1666667	128	0	0.0000000	127	0	0.0	126	0	0.1666667	125	0	0.0	124	0	0.0	123	0	0.5000000	122	0	0.1666667	121	0.0000000	120	0	0.0000000	119	0	0.5	118	1	0.1666667
1047	66	1	10	2008-10-01	1.4000000	-48.83526	-27.05875	9	2.2000000	8	1	3.8000000	7	1	1.4	6	3	1.8000000	5	1	1.6	4	1	1.4	3	2	4.6000000	2	1	3.6000000	1	2.0000000	NA	NA	NA	NA	NA	NA	NA	NA	NA
7067	91	0	65	2013-05-01	0.3333333	-50.93526	-26.53913	64	0.3333333	63	0	0.1666667	62	0	0.0	61	0	0.5000000	60	0	0.5	59	0	0.0	58	0	0.1666667	57	0	0.0000000	56	0.6666667	55	0	0.3333333	54	0	0.0	53	0	0.8333333

Note that there are some missing values in the beginning of the data frame. This happens because we created columns that are lagged values of the table.

missmap(df_crime2)

#drop missing lines

mis1 = apply(df_crime2,1,function(x){anyNA(x)})
#they should match
12*len_hex; sum(mis1)

## [1] 1308

## [1] 1308

We exclude all lines that have missing. Note that the missmap doesn’t have any missing.

df_crime3 = df_crime2[!mis1,]
df_crime3 %>% dim

## [1] 13952    43

missmap(df_crime3) #there are no more missing

anyNA(df_crime3)

## [1] FALSE

df_crime3 %>% names

##  [1] "ID"           "crime"        "period"       "date"         "lag_crime"   
##  [6] "long"         "lat"          "period_1"     "crime_1"      "lag_crime_1" 
## [11] "period_2"     "crime_2"      "lag_crime_2"  "period_3"     "crime_3"     
## [16] "lag_crime_3"  "period_4"     "crime_4"      "lag_crime_4"  "period_5"    
## [21] "crime_5"      "lag_crime_5"  "period_6"     "crime_6"      "lag_crime_6" 
## [26] "period_7"     "crime_7"      "lag_crime_7"  "period_8"     "crime_8"     
## [31] "lag_crime_8"  "period_9"     "crime_9"      "lag_crime_9"  "period_10"   
## [36] "crime_10"     "lag_crime_10" "period_11"    "crime_11"     "lag_crime_11"
## [41] "period_12"    "crime_12"     "lag_crime_12"

Random Forest

#tree models --------------------------------------------

#ideas
#one of the advantages of using a random forest 
#algorithm is that we do not assume any form of 
#distribution compared for example to the log gaussian 
#cox process which assumes a Poisson distribution for 
#the point process.

#for our dataset we have 14000 observations
#we have around 100 regions and time is in months
#for future work, we could change
#more, or less, regions. a larger period.
#trimester, or semester, for example. 

#instead of using hexagons, we could just use the municipality
#borders as polygons. It makes much more sense.
#there are 293 municipalities in Santa Catarina.
#there will be many more rows in the table.

df_crime3[sample(1:10000,3),] %>% kable

	ID	crime	period	date	lag_crime	long	lat	period_1	lag_crime_1	period_2	crime_2	lag_crime_2	period_3	crime_3	lag_crime_3	period_4	lag_crime_4	period_5	lag_crime_5	period_6	crime_6	lag_crime_6	period_7	lag_crime_7	period_8	crime_8	lag_crime_8	period_9	crime_9	lag_crime_9	period_10	crime_10	lag_crime_10	period_11	lag_crime_11	period_12	lag_crime_12
4312	61	0	40	2011-04-01	0.3333333	-50.33526	-27.05875	39	0.1666667	38	0	0.1666667	37	0	0.0000000	36	0.0000000	35	0.3333333	34	2	0.1666667	33	0.0000000	32	0	0.0000000	31	0	0.1666667	30	1	0.0000000	29	0.1666667	28	0.1666667
4515	46	0	42	2011-06-01	0.0000000	-49.58526	-27.31856	41	0.6666667	40	2	0.1666667	39	1	0.0000000	38	0.0000000	37	0.0000000	36	0	0.0000000	35	0.1666667	34	1	0.0000000	33	0	0.3333333	32	0	0.5000000	31	0.0000000	30	0.1666667
9297	32	1	86	2015-02-01	0.1666667	-50.03526	-27.57836	85	0.5000000	84	0	0.6666667	83	0	0.1666667	82	0.6666667	81	0.0000000	80	0	0.1666667	79	0.0000000	78	0	0.3333333	77	1	0.5000000	76	0	0.1666667	75	1.0000000	74	0.3333333

#for a first analysis we'll use only a subset of variables
data = df_crime3[, c("crime", 
                            "crime_1","lag_crime_1",
                            "crime_2","lag_crime_2",
                            "crime_3",
                            "crime_4",
                            "crime_5",
                            "crime_6",
                            "crime_7",
                            "crime_8",
                            "crime_9",
                            "crime_10",
                            "crime_11",
                            "crime_12",
                            "long", "lat")]

The variable crime is what we predict. crime_1 is the temporal lag for crime. lag_crime_1 is the spatial lag for crime_1. So lag_crime_1 is the mean of the neighbors for a region but one period prior. crime_2 and lag_crime_2 are two temporal lags for crime and spatial lag of crime.

hist(data$crime, breaks=50)

This number is important for the division of training and test datasets.

12*109 #one year of forecasting

## [1] 1308

len_hex #number of hexagons in SC

## [1] 109

len_hex*12

## [1] 1308

set.seed(1)
nrow = nrow(data)
train = 1:(nrow-len_hex*12)
data %>% names %>% kable

x
crime
crime_1
lag_crime_1
crime_2
lag_crime_2
crime_3
crime_4
crime_5
crime_6
crime_7
crime_8
crime_9
crime_10
crime_11
crime_12
long
lat

data %>% names %>% length #27 variables in data frame

## [1] 17

data %>% names %>% length %>% sqrt() #input in mtry

## [1] 4.123106

Create a dataset for training and test. We use a spatiotemporal model, so it is not a cross-section. Hence the division between training and test datasets has a temporal aspect.

trainX = data[train,]
trainY = data$crime[train]
testX = data[-train,]
testY = data$crime[-train]

We run the random forest model. Note that we use 100 trees in boostraping.

m1 = randomForest(crime~.,trainX,
                  mtry=5,importance=TRUE, ntree=300)
m1

## 
## Call:
##  randomForest(formula = crime ~ ., data = trainX, mtry = 5, importance = TRUE,      ntree = 300) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.8378361
##                     % Var explained: 72.91

Export R objects to be used in another article.

saveRDS(trainX, file = paste0(pathExport,"/trainX.rds")) 
saveRDS(trainY, file = paste0(pathExport,"/trainY.rds")) 
saveRDS(testX, file = paste0(pathExport,"/testX.rds")) 
saveRDS(testY, file = paste0(pathExport,"/testY.rds")) 
saveRDS(m1, file = paste0(pathExport,"/modelRF.rds"))

This is the MSE for training set.

yh1 = predict(m1,trainX)
mean((trainY - yh1)^2)

## [1] 0.1965913

This is the MSE for test set.

yh = predict(m1,testX)
mean((testY - yh)^2)

## [1] 0.7008546

plot(yh, testY)
abline(0,1)

#importance(m1)
varImpPlot(m1)

This is an importance plot. This plot shows which variables are more important. lag_crime_1,long,crime_11,crime_9 are the most relevant. That is the spatially lagged values on with one temporal lag is the most important variable to explain crime one period ahead. Longitude is important because the biggest cities of Santa Catarina are on the coast in the west. Number of homicides 11 and 9 periods prior is imporant maybe this is a cyclical effect. For it to make sense it should be 12 lagged values so that it would indicate some yearly cycle.

Prediction

Here we plot maps showing the prediction and the ture values.

Prediction for 01/01/2018

#plot of predict for 180901 -------------------------
#prepare a data frame for ploting of results 
#predict Data Frame
yh[1:4] #vector of predictions

##     13953     13954     13955     13956 
## 0.4598918 0.1214090 0.5837738 0.4330119

Here we plot residuals for some predictions.

12 #number of intervals to choose

## [1] 12

len_hex #number of hexagons in SC

## [1] 109

We plot the prediction for the first month of the test set.

period1 = 1
init1 = len_hex*(period1-1)+1 
inter1 = (init1:(init1+len_hex-1)) #interval to extract from yh
preDf = yh[inter1] 
#they should be the same as 
preDf %>% length; len_hex

## [1] 109

## [1] 109

#plot prediction and residuals 
#present prediction and true values

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE

#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat

gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

# map with munic borders
tm1 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("predict ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

Here we prepare data for plot of actual value of homicides.

#plot of actual for 180901
period1

## [1] 1

#build map
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

tm2 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("true ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3

tmap_save(tm = tm3, 
          filename = paste0("crime_","comp1",".jpg"),
          width = 3, height = 3)

Prediction for 01/06/2019

#plot of predict for 190601 -------------------------
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE

#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

tm11 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("predict ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

#build map
period1

## [1] 10

ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm21 = tm_shape(raster_idw) + 
  tm_raster("crime",  style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
            n = 4, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title= paste0("true ",timeid[period1+(140-12)]),
            title.size=.8,
            title.position = c(0.25,.3))

current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm11,tm21)
tmap_mode(current.mode)

tmap_save(tm = tm3, 
          filename = paste0("crime_","comp2",".jpg"),
          width = 3, height = 3)

tm3

Residuals

#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
saveRDS(pol, file = paste0(pathExport,"/pol.rds")) #export to use in another article
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE

period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 

period1

## [1] 1

ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime 
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) + 
  tm_raster("crime", style='quantile',
            n=7 ,palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .3,
            legend.position = c("left","bottom"),
            title= paste0("residuals ",timeid[period1+(140-12)]),
            title.size=.6,
            title.position = c(0.25,.3))


moran.test(v,W)

## 
##  Moran I test under randomisation
## 
## data:  v  
## weights: W    
## 
## Moran I statistic standard deviate = -1.4624, p-value = 0.9282
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.095835052      -0.009259259       0.003504995

#for period 1 of prediction, there is no spatial correlation in the residuals

#now, for period 10
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1] 

ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime 
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) + 
  tm_raster("crime", style='quantile',
            n=7 ,palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .3,
            legend.position = c("left","bottom"),
            title= paste0("residuals ",timeid[period1+(140-12)]),
            title.size=.6,
            title.position = c(0.25,.3))


moran.test(v,W)

## 
##  Moran I test under randomisation
## 
## data:  v  
## weights: W    
## 
## Moran I statistic standard deviate = -1.4845, p-value = 0.9312
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.092495433      -0.009259259       0.003143964

#for period 10 of prediction, there is no spatial correlation in the residuals

current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tmap_save(tm = tm3, 
          filename = paste0("crime_","res",".jpg"),
          width = 3, height = 3)
tm3

# end -------------------------------------

Concluding Remarks

Here we try to predict number of homicides in each polygon. For another article we can predict the number of homicide in each municipality. We can incorporate data about population, GDP per capita and inequality. We could find other crime variables. We can study larger periods, for example, trimester or semester.

Given that the data have coordinates we can study inside city occurances. Model phenomena in the street level.

It is relevant to find a way to interpret how variables (features) affect our variable of interest, \(y\). In a future article I’ll build a Shap like method to intrepret outcomes of black-box like models.

Homicides in Santa Catarina: a Random Forest Application