by William Suzuki

In this article we propose a method to summarize the prediction made by random forest and gradient boosting models. We do this for two applications made in previous articles.

Let’s start by loading the necessary packages:

#191224 blackbox
#interpreting blackbox models
#preliminaries ----------------------------------------
library(gbm)
library(tidyverse)
library(stats)
library(sp)
library(gstat) #idw
library(raster)
library(tmap)
library(randomForest)
library(xts)
library(tmap)
library(knitr)

Interpreting Grandient Boosting

In this session we interpret the results of a gradient boosting model applied to the problem of real estate prices from the metropolitan region of São Paulo. The article is in this link.

We import R objects that will be used.

#import objects -------------------
m1 = readRDS(file = "modelBoosting.rds")
trainX = readRDS(file = "trainX.rds")
trainY = readRDS(file = "trainY.rds")
testX = readRDS(file = "testX.rds")
testY = readRDS(file = "testY.rds")
trainSpreal = readRDS(file = "trainSpreal.rds")

We make the predictions in the training sample.

yh1 = predict(m1,trainX,n.trees=300)
yh1 %>% length
## [1] 2149
trainX %>% names
##  [1] "DORM_UNID"  "BANH_UNID"  "GAR_UNID"   "ELEV"       "BLOCOS"    
##  [6] "UNIDAND"    "ANDARES"    "AR_UT_UNID" "PC_TT_ATU"  "ANO_LAN"

Impact of unit’s area on price

We sum 1 to the unit’s area and see the prediction.

Let \(\hat{y} = f(X_1,X_2)\) where \(f(.)\) is the gradient boosting model, and \(X_1\) is the area of unit and \(X_2\) are the other covariates.

Note that \(X_1\) is a vector with length \(n\), where \(n\) is the sample size. And \(X_2\) is a matrix \(n \times k-1\) where \(k\) is the number of covariates (features).

In this problem we have \(n=2149\). And \(k=10\).

We calculate \(X_1' = X_1 + \vec{1}\). Where \(\vec{1}\) is a vector of 1’s of size \(n\).

Then we calculate \(\hat{y}' = f(X_1',X_2)\). Then \(\Delta y = \hat{y} ' - \hat{y}\).

Note that \(\Delta y\) is a vector of size \(n\). We calculate the mean of \(\Delta y\).

#see the effect of AR_UT_UNID --------------------------
trainX2 = trainX #repeat the training set
trainX2$AR_UT_UNID = trainX$AR_UT_UNID + 1 #give a delta shock with other values the same
yh2 = predict(m1,trainX2,n.trees=1000)
dy = yh2 - yh1
meanDy = mean(dy)
meanDy
## [1] 10768.38
hist(dy, breaks=300)

This histogram is the distribution of the elements in the vector \(\Delta y\).

10768.38 is the average impact of increasing \(1 m^3\) of area in unit’s price.

#bootstraping
dlen = length(dy)
bootsVec = c()
for (i in 1:1000){
  bootSample = sample(dy, dlen, replace=TRUE)  
  bootsMean = mean(bootSample)
  bootsVec = append(bootsVec,bootsMean)
}
hist(bootsVec, breaks=150)

summary(bootsVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3606    9026   10854   10911   12790   19009

This histogram is a boostrap distribution of \(\text{avg}(\Delta y)\).

We can think of this distribution as a way to see if this variable is significant or not. In this case the distribution of \(\text{avg}(\Delta y)\) is far from zero.

Effect of number of bathrooms in price

We study how the number of bathrooms affect price, given that all other variables are fixed. In practice this makes little sense because usually the number of bathrooms increase with the size of the unit.

trainX2 = trainX
summary(trainX$BANH_UNID)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.798   2.000   5.000
trainX2$BANH_UNID = trainX$BANH_UNID + 1 #give a delta shock with other values the same
#add 1 in meters squared 
yh2 = predict(m1,trainX2,n.trees=1000)
dy = yh2 - yh1
hist(dy, breaks=120)

#there some outliers
meanDy = mean(dy)
meanDy
## [1] 16564.82

This histogram is the distribution of the vector \(\Delta y\).

And the average effect of one more bathroom on price is 16564.82.

dlen = length(dy)
bootsVec = c()
for (i in 1:300){
  bootSample = sample(dy, dlen, replace=TRUE)  
  bootsMean = mean(bootSample)
  bootsVec = append(bootsVec,bootsMean)
}
summary(bootsVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9416   15177   16702   16726   18102   23392
hist(bootsVec, breaks=150)

This is the distribution of \(\text{avg}(\Delta y)\).

Effect of number of bedrooms on price

We see how an increase in one more room impact price. The effect shown here is the partial effect of number of bedrooms, maintaining fixed all other variables. But usually with an increase of one more bedroom the unit’s area also increase.

trainX2 = trainX
trainX2$DORM_UNID = trainX$DORM_UNID + 1 #give a delta shock with other values the same
yh2 = predict(m1,trainX2,n.trees=1000)
dy = yh2 - yh1
meanDy = mean(dy)
meanDy
## [1] -63750.27

The impact of one more room is negative, -63750.27.

Note that this effect may happens given that we are fixing all other variables.

dlen = length(dy)
bootsVec = c()
for (i in 1:300){
  bootSample = sample(dy, dlen, replace=TRUE)  
  bootsMean = mean(bootSample)
  bootsVec = append(bootsVec,bootsMean)
}
summary(bootsVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -70912  -65392  -63743  -63669  -62021  -56084
hist(bootsVec, breaks=150)

Non-linear effect of Area on price

We saw a one number summary of impact of a covariate on price.

Now, we see a non-homogeneous impact of covariate \(X_1\) on \(y\). For this we build graphs showing how this impact \(\Delta y\) changes with \(y\) or \(X_1\).

trainX2 = trainX
trainX2$AR_UT_UNID = trainX$AR_UT_UNID + 1 #give a delta shock with other values the same
yh2 = predict(m1,trainX2,n.trees=1000)
dy = yh2 - yh1
mean(dy)
## [1] 10768.38

We apply moving averages to see how \(\Delta y\) behaves along \(y\).

#moving average --------------------------------------
#moving average function for smoothing data
y = trainY
yOrder = y[order(y)]
dyOrder = dy[order(y)]
len1 = length(dyOrder)
window =  101 #number of neighbors to use
mva = rep(NA,len1) #moving average
bootsMva = rep(NA,len1) #bootstrap mean moving average
boots05 = rep(NA,len1)
boots95 = rep(NA,len1)
halfSpan = (window-1)/2
endValue = len1-halfSpan
for (ii in 1:len1) {
  init = (ii-halfSpan)
  final = (ii+halfSpan)
  if (init<halfSpan) {init = 0}
  if (final>len1) {final=len1}
  vv1 = dyOrder[init:final]
  vv2 = mean(vv1)
  mva[ii] = vv2
  bootsVec = c()
  for (jj in 1:100){
    bootSample = sample(vv1, window, replace=TRUE)  
    bootsMean = mean(bootSample)
    bootsVec = append(bootsVec,bootsMean)
    meanBoots = mean(bootsVec) 
    quant05 = quantile(bootsVec, probs = .05)
    quant95 = quantile(bootsVec, probs = .95)
  }
  bootsMva[ii] = meanBoots
  boots05[ii] = quant05
  boots95[ii] = quant95
  #print(ii)
} #the loop takes around 1 min
xmin = 0
xmax = 2e6
ymin = -2e4
ymax = 4e4
plot(y, dy, xlim=c(xmin, xmax), ylim=c(ymin, ymax),
     ylab="Delta Price of Area", xlab="Price")
abline(h=0, col='blue')
lines(yOrder,bootsMva, col='blue')
lines(yOrder,boots05,col='red')
lines(yOrder,boots95,col='red')
abline(h=7274.792)

This graph shows how \(\Delta y\) changes with \(y\), which is the price. For lower values under R$ 200,000.00 the effect of area is negative. The red band is the 90% confidence interval from a boostrap procedure. Note that for larger valeus of \(y\) the band gets larger, due to small sample.

We have small sample of units that cost more than R$ 1,000,000.00 but the graph shows that one \(1 m^3\) makes no difference in price.

The next figure shows a smaller subplot for values that are more common.

#a closer view of more important regions
xmin = 1.3e5
xmax = 5e5
ymin = -8e3
ymax = 1.5e4
plot(y, dy, xlim=c(xmin, xmax), ylim=c(ymin, ymax),
     ylab="Delta Price of Area", xlab="Price")
abline(h=0, col='blue')
lines(yOrder,bootsMva, col='blue')
lines(yOrder,boots05,col='red')
lines(yOrder,boots95,col='red')
abline(h=7274.792)

The y-axis is the is \(\Delta y\). And x-axis is the level of price. This graph show how the impact of area changes with price.

This is a type of local statitics, on the hand taking the aggregate average effect is a type of global statistic.

The figure shows that more expensive units have a higher \(\Delta y\), so as the unit gets more expensive the impact of one more \(1m^3\) on prices is higher.

The blue line is the level zero. The black line is the global average effect.

Maps of Impact of Area on Price

#maps of delta on sp metropolitan region -------------------------
#make map showing how area of building changes price per region
y = trainY
coords = coordinates(trainSpreal)
dist1 = as.matrix(dist(coords,method="euclidean",diag = FALSE, upper = FALSE, p = 2)) 
window = 201
ii = 1
spatialMA = rep(NA,len1)
for (ii in 1:len1) {
  distances = dist1[ii,]  
  values = dy[order(distances)]  
  values2 = values[1:window]
  smav = mean(values2) #spatial moving average
  spatialMA[ii] = smav
}
#spatial delta
spatialD = trainSpreal
spatialD$spatialMA = spatialMA
#plot maps raster idw 
grid = readRDS(file = "grid.rds")
munic = readRDS(file = "munic.rds")
v = spatialMA
df1 = data.frame(v)
spdf = spatialD #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) 
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
  tm_raster("variable",  style = "fixed", breaks=c(-3000,0,2000,5000,6000,7000,9e3,1e4,1.5e4,30000), 
            n = 10, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.4, col = "black", lwd=.6)+
  tm_layout(legend.text.size = .5,
            legend.position = c("right","bottom"),
            title="Delta Price impact of Area",
            title.size=1,
            title.position = c(0.01,.9))
tm1

This maps shows the heterogeneity of \(\Delta y\).

We could make a map of the bootstrap interquantile range. This map would show which areas we are more certain about the statistics value.

Effect of area with different \(\delta\)

From the beginning all the analysis that we made till now were with with the adoption of \(X_1'= X_1 + \vec{1}\). But we could make \(X_1'= X_1 + \vec{\delta}\) where \(\delta\) is a scalar and \(\vec{\delta}\) is a vector containing the value \(\delta\).

It would be good if \[ \frac{\Delta y}{\delta}\] doesn’t change when we change \(\delta\). We used \(\delta=1\) until now, but in this next chunck we test different values of \(\delta\).

#effect of AR_UT_UNID with different delta --------------------------
trainX2 = trainX
delta = 0.1 #size of shock
trainX2$AR_UT_UNID = trainX$AR_UT_UNID + delta #shock of "delta" of area unit
yh2 = predict(m1,trainX2,n.trees=1000)
dy = (yh2 - yh1)/delta #divide by area unit
meanDy = mean(dy)
meanDy
## [1] 20731.05
#the impact of one more room in the housing price is negative
dlen = length(dy)
bootsVec = c()
#bootstrapping
for (i in 1:300){
  bootSample = sample(dy, dlen, replace=TRUE)  
  bootsMean = mean(bootSample)
  bootsVec = append(bootsVec,bootsMean)
}
hist(bootsVec, breaks=150)

summary(bootsVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -32844    4979   21129   20847   36787   85823

With a \(\delta=0.1\) the effect \(\Delta y\) on average is twice as high, compared to when \(\delta=1\).

#loop,graph,confidence interval how delta impacts estimates -----------------
mean = c()
bootsMeanSeq = c()
bootsUpper = c()
bootsLower = c()
deltaSeq = c(-10:-3,seq(-2,2,by=0.03),3:10)
for (delta in deltaSeq) {
  #print(delta) #shock size, follow loop's progress
  trainX2 = trainX
  trainX2$AR_UT_UNID = trainX$AR_UT_UNID + delta #shock of "delta" of area unit
  yh2 = predict(m1,trainX2,n.trees=1000)
  dy = (yh2 - yh1)/delta #divide by area unit
  meanDy = mean(dy)
  mean = append(mean,meanDy)
  #the impact of one more room in the housing price is negative
  dlen = length(dy)
  bootsVec = c()
  #bootstrapping
  for (i in 1:300){
    bootSample = sample(dy, dlen, replace=TRUE)  
    bootsMean = mean(bootSample)
    bootsVec = append(bootsVec,bootsMean)
  }
  bootsMeanSeq = append(bootsMeanSeq, mean(bootsVec))
  quant05 = quantile(bootsVec, probs = .05)
  quant95 = quantile(bootsVec, probs = .95)
  bootsUpper = append(bootsUpper,quant05)
  bootsLower = append(bootsLower,quant95)
}
xmin = -10
xmax = 10
ymin = -1e5
ymax = 1e5
plot(deltaSeq,mean,type='l',xlim=c(xmin, xmax), ylim=c(ymin, ymax))
lines(deltaSeq,bootsMeanSeq, col='blue')
lines(deltaSeq,bootsUpper,col='red')
lines(deltaSeq,bootsLower,col='red')
abline(h=10768.38, col='darkred')

The x-axis is different values of \(\delta\) and y-axis show the average of \(\Delta y\). Note that we can have negative values of \(\delta\).

For \(\delta\) that are small \(\Delta y\) gets explosive. But for higher values the behavior of \(\Delta y\) is almost constant, which is a good sign that we can use \(\delta=1\)

xmin = -10
xmax = 10
ymin = -1e4
ymax = 6e4
plot(deltaSeq,mean,type='l',xlim=c(xmin, xmax), ylim=c(ymin, ymax))
lines(deltaSeq,bootsMeanSeq, col='blue')
lines(deltaSeq,bootsUpper,col='red')
lines(deltaSeq,bootsLower,col='red')
abline(h=10768.38, col='darkred')

From this zoomed figure we can see that there is a statistically significant difference when \(\delta\) is negative and positive.

xmin = 0
xmax = 2
ymin = 2.5e3
ymax = 2.5e4
plot(deltaSeq,mean,type='l',xlim=c(xmin, xmax), ylim=c(ymin, ymax))
lines(deltaSeq,bootsMeanSeq, col='blue')
lines(deltaSeq,bootsUpper,col='red')
lines(deltaSeq,bootsLower,col='red')
abline(h=10768.38, col='darkred')

The dark-red line shows the average of \(\Delta y\) when \(\delta=1\).

Explaining Random Forests

For the random forest model application we study number of homicides in the state of Santa Catarina, the article is in this link.

We call the R objects used.

m1 = readRDS(file = "sc_crime/modelRF.rds")
trainX = readRDS(file = "sc_crime/trainX.rds")
trainY = readRDS(file = "sc_crime/trainY.rds")
testX = readRDS(file = "sc_crime/testX.rds")
testY = readRDS(file = "sc_crime/testY.rds")
pol = readRDS(file = "sc_crime/pol.rds")
spdf = readRDS(file = "sc_crime/spdf.rds")

yh1 = predict(m1,trainX)
yh1 %>% length
## [1] 12644
trainX %>% names
##  [1] "crime"       "crime_1"     "lag_crime_1" "crime_2"     "lag_crime_2"
##  [6] "crime_3"     "crime_4"     "crime_5"     "crime_6"     "crime_7"    
## [11] "crime_8"     "crime_9"     "crime_10"    "crime_11"    "crime_12"   
## [16] "long"        "lat"

1 temporal lag to explain homicides

#see the effect of crime_1 --------------------------
#build function impact measure 
names(trainX)
##  [1] "crime"       "crime_1"     "lag_crime_1" "crime_2"     "lag_crime_2"
##  [6] "crime_3"     "crime_4"     "crime_5"     "crime_6"     "crime_7"    
## [11] "crime_8"     "crime_9"     "crime_10"    "crime_11"    "crime_12"   
## [16] "long"        "lat"
variable = 2
impact = function(trainX,delta=1,variable){
  list = list()
  yh1 = predict(m1,trainX)
  trainX2 = trainX
  variableName = names(trainX)[variable]
  trainX2[variableName] = trainX2[variableName] + delta
  #trainX2$crime_1 = trainX$crime_1 + delta 
  yh2 = predict(m1,trainX2)
  dy = (yh2 - yh1)/delta
  meanDy = mean(dy)
  list[[1]] = meanDy
  #the impact of one more room in the housing price is negative
  dlen = length(dy)
  bootsVec = c()
  #bootstrapping
  for (i in 1:300){
    bootSample = sample(dy, dlen, replace=TRUE)  
    bootsMean = mean(bootSample)
    bootsVec = append(bootsVec,bootsMean)
  }
  list[[2]] = bootsVec
  list[[3]] = dy
  names(list) = c('mean','boostVec','dy')
  list
}

This chunck creates a function impact() which returns a boostrap vector of means of \(\Delta y\).

listResult = impact(trainX,1,2)
bootsVec = listResult$boostVec
mean = listResult$mean
mean
## [1] 0.07006641
hist(bootsVec, breaks=150)

summary(bootsVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06609 0.06935 0.07023 0.07022 0.07116 0.07321
dy = listResult$dy

The effect on homicides of 1 temporal lag is 0.070.

#test for spatial and temporal lag 1 -----------------
names(trainX)
##  [1] "crime"       "crime_1"     "lag_crime_1" "crime_2"     "lag_crime_2"
##  [6] "crime_3"     "crime_4"     "crime_5"     "crime_6"     "crime_7"    
## [11] "crime_8"     "crime_9"     "crime_10"    "crime_11"    "crime_12"   
## [16] "long"        "lat"
listResult = impact(trainX,delta=1,variable=3) #variable=3
boostVec = listResult$boostVec
mean = listResult$mean
mean
## [1] 0.06178288
hist(bootsVec, breaks=150)

summary(boostVec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05851 0.06096 0.06180 0.06180 0.06268 0.06558

This is the impact of homicides from neighboring regions 1 lag in time. The effect is 0.062. For just a time lag the effect is 0.70.

width = summary(bootsVec) %>% length 
depth = names(trainX) %>% length - 1

tableResult = data.frame(matrix(NA,nrow=depth,ncol=width))
names(tableResult) = names(summary(bootsVec))
rownames(tableResult) = names(testX)[-1]
for (ii in 2:(depth+1)) {
  if (ii %in% c(16,17)) { #variables of long and lat
    delta = 0.01}
  listResult = impact(trainX,delta=delta,variable=ii)
  boostVec = listResult$boostVec
  summary1 = summary(boostVec)
  tableResult[ii-1,] = round(as.numeric(summary1),5) 
  #print(ii) #show progress of loop
}
tableResult %>% kable
Min. 1st Qu. Median Mean 3rd Qu. Max.
crime_1 0.06626 0.06692 0.06715 0.06715 0.06737 0.06793
lag_crime_1 0.03433 0.03475 0.03490 0.03491 0.03507 0.03553
crime_2 0.07948 0.08011 0.08031 0.08031 0.08050 0.08110
lag_crime_2 0.01337 0.01380 0.01392 0.01392 0.01404 0.01453
crime_3 0.02564 0.02622 0.02636 0.02636 0.02651 0.02689
crime_4 0.02382 0.02422 0.02437 0.02436 0.02449 0.02489
crime_5 0.03134 0.03195 0.03208 0.03207 0.03220 0.03258
crime_6 0.03600 0.03640 0.03653 0.03654 0.03667 0.03716
crime_7 0.05453 0.05509 0.05523 0.05525 0.05541 0.05590
crime_8 0.02934 0.02992 0.03006 0.03006 0.03020 0.03081
crime_9 0.02607 0.02656 0.02673 0.02673 0.02688 0.02741
crime_10 0.05035 0.05075 0.05089 0.05089 0.05100 0.05150
crime_11 0.04716 0.04768 0.04785 0.04785 0.04801 0.04843
crime_12 0.06188 0.06243 0.06259 0.06259 0.06274 0.06311
long -0.00595 -0.00218 -0.00119 -0.00111 -0.00007 0.00370
lat -0.00415 -0.00017 0.00088 0.00087 0.00189 0.00524
#tableResult 

This table shows how each covariate impact homicides. crime_1 is 1 temporal lag of a region. lag_crime_1 is homicide in all neighboring regions in 1 time lag. If a region A had a high level crime in \(t-1\) we expect that in \(t\) the crim will be high too. If the neighboring regions of A had a higher level of crime in \(t-1\) then we expect that in \(t\) we observe a higher level of crime in region A.

The important column is Median, the values shown there are the impact on homicides.

We can use the 1st and 3rd quartiles to show a type of confidence interval. Note that long and lat are “not significant” in a linear fashion. But we’ll see that ploting values on a map show how geography is important to explain the random forest’s prediction.

excludeValues = c(-2,-4,-15,-16)
vecAuto = tableResult$Mean[excludeValues]
medianAuto = tableResult$Median[excludeValues]
upperAuto = tableResult$`3rd Qu.`[excludeValues]
lowerAuto = tableResult$`1st Qu.`[excludeValues]
plot(vecAuto,type='l',
     ylab="Delta price, change in area", xlab="temporal lag")
lines(medianAuto, col='blue')
lines(upperAuto, col='red')
lines(lowerAuto, col='red')

This graph show the temporal impact of lags from the same region. Lag 1, 2, 7 and 12 are large. That is homicide levels that happen in those periods have more impact on

xmin = 1
xmax = 4
ymin = .06
ymax = 0.08
plot(vecAuto,type='l',xlim=c(xmin, xmax), ylim=c(ymin, ymax), 
     ylab="Delta price, change in area", xlab="temporal lag")
lines(medianAuto, col='blue')
lines(upperAuto, col='red')
lines(lowerAuto, col='red')

Time series of \(\Delta y\)

Note that we can find how \(\Delta y\) behaves in each region for each month.

The following function creates a time series moving average (monthly) for each region.

number=2
listResult = impact(trainX,1,number)
dy = listResult$dy
#hist(dy,breaks=300)
len2 = dim(spdf)[1] #number of regions
len3 = length(dy)
len4 = len3/len2 #number of periods
tableDivide = data.frame(matrix(NA,nrow=len4,ncol=len2))
jj = 3
for (jj in 1:len4) {
  init1 =  jj*len2 - len2 + 1
  end1 = jj*len2
  timeS1 = dy[init1:end1]
  tableDivide[jj,] = timeS1
  #print(jj)
}
names(tableDivide) = paste0("region",as.character(1:len2))
seqDate = seq(as.Date('2009-1-1'),to=as.Date('2018-8-1'),by='months')
tsTable = xts(tableDivide,order.by=seqDate)
plot(tsTable[,c(1)]) #delta y chaning in time for region 1

plot(tsTable[,c(1)]["2016/"]) #delta y chaning in time for region 1, for 2016 onwards

plot(tsTable[,c(1,2)])#delta y chaning in time for region 1 and 2

plot(tsTable[,c(1,2)]["2016/"])#delta y chaning in time for region 1 and 2, for 2016 onwards

plot(tsTable[,c(1,2,3)])#delta y chaning in time for region 1, 2 and 3

plot(tsTable[,c(1,2,3)]["2016/"])#delta y chaning in time for region 1,2,3, for 2016 onwards

The graphs are very noisy. We’ll smooth them with a moving average.

#mean of dy for all regions
tsSeries1 = as.data.frame(apply(tsTable,1,mean)) 
tsSeries2 = xts(tsSeries1,order.by=seqDate)
plot(tsSeries2) #mean of all regions, all period

plot(tsSeries2["2016/"]) #mean of all regions, 2016 onwards

plot(tsSeries2["2017/"]) #mean of all regions, 2017 onwards

sizeW = 7
movingWin = rollmean(tsSeries2,k=sizeW)
movingWin2 = cbind(tsSeries2,movingWin)
plot(movingWin2) #mean series in black, moving average in red

plot(movingWin2[,2]) #just the mocing average

upperMov = rollapply(tsSeries2, sizeW,function(x) quantile(x,probs = .10))
lowerMov = rollapply(tsSeries2, sizeW,function(x) quantile(x,probs = .90))
movingWin3 = cbind(movingWin2,upperMov,lowerMov)
plot(movingWin3)

This graph shows that we can’t say that we had a significant change in \(\Delta y\) in time. The effect of 1 temporal lag is noisy in time, but there is no notable trend.

This would be interesting to see if, for example, there were a positive trend this would mean that the spillover of homicides intensified in time.

#function for moving average time series -----------------
movingAv = function(tsSeries2,sizeW = 7){
  movingWin = rollmean(tsSeries2,k=sizeW)
  upperMov = rollapply(tsSeries2, sizeW,function(x) quantile(x,probs = .10))
  lowerMov = rollapply(tsSeries2, sizeW,function(x) quantile(x,probs = .90))
  movingWin3 = cbind(tsSeries2,movingWin,upperMov,lowerMov)
  movingWin3
}
movingAv(tsTable[,1]) %>% plot

movingAv(tsTable[,2]) %>% plot

movingAv(tsTable[,3]) %>% plot

Those graphs show how for regions 1, 2 and 3 the effect \(\Delta y\) changed in time.

timeMoving = function(dd=2){
  movingWin = rollmean(tsSeries2,k=sizeW)
  tsMoving = movingWin
  for (ii in dd:(dd+1)) {
    tsMoving = cbind(tsMoving,movingAv(tsTable[,ii])[,2])
  }
  tsMoving %>% plot
}
timeMoving(dd=103)

timeMoving(dd=6)

timeMoving(dd=10)

timeMoving(dd=2)

timeMoving(dd=1)

The black line in those graphs is the average behaviour. All other series are different regions behaviours of \(\Delta y\).

Maps of Spatial Heterogeneity

#maps of lag 1 impact --------------------------------------
#this is a spatial temporal model, sao paulo real estate
#is a spatial model only
#for homicide in santa catarina we can show who the 
#parameters change in time

regionsMean = apply(tsTable,2,mean)
regionsMean %>% length
## [1] 109
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
v = regionsMean
df1 = data.frame(v) 
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) + 
  tm_raster("crime", style='quantile',
            n=15 ,palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.2, col = "black", lwd=.1)+   
  tm_layout(legend.text.size = .7,
            legend.position = c("left","bottom"),
            title= "dy",
            title.size=1,
            title.position = c(0.4,.3))
tm1

# 

There is a spatial heterogeneity in \(\Delta y\). On the coast the statistics are higher compared to the center of the state.

Non-linear relationship

#make non-linear interpreation -----------------------
y = trainY
listResult = impact(trainX,1,2)
dy = listResult$dy
seriesQuant = 0:21
len1 = length(seriesQuant)
mva = rep(NA,len1) #moving average
bootsMva = rep(NA,len1) #bootstrap mean moving average
boots05 = rep(NA,len1)
boots95 = rep(NA,len1)
ii=0
for (ii in seriesQuant) {
  postions1 = y==ii 
  vv1 = dy[postions1]
  vv2 = mean(vv1)
  mva[ii+1] = vv2
  bootsVec = c()
  for (jj in 1:100){
    bootSample = sample(vv1, length(vv1), replace=TRUE)  
    bootsMean = mean(bootSample)
    bootsVec = append(bootsVec,bootsMean)
  }
  meanBoots = mean(bootsVec) 
  quant05 = quantile(bootsVec, probs = .05)
  quant95 = quantile(bootsVec, probs = .95)
  bootsMva[ii+1] = meanBoots
  boots05[ii+1] = quant05
  boots95[ii+1] = quant95
  #print(ii)
} 
plot(y, dy)
abline(h=0, col='blue')
lines(seriesQuant,mva, col='darkred')
lines(seriesQuant,bootsMva, col='blue')
lines(seriesQuant,boots05, col='red')
lines(seriesQuant,boots95, col='red')
abline(h=listResult[[1]])

xmin = 0
xmax = 10
ymin = -.2
ymax = .3
plot(y, dy, xlim=c(xmin, xmax), ylim=c(ymin, ymax))
abline(h=0, col='blue')
lines(seriesQuant,mva, col='darkred')
lines(seriesQuant,bootsMva, col='blue')
lines(seriesQuant,boots05, col='red')
lines(seriesQuant,boots95, col='red')
abline(h=listResult[[1]])

This graph shows how the impact of 1 temporal lag changes with level of crime. If the level of crime is 0, then the lagged homicide have a positive impact in the next period. But for values of homicides of 1,2 3, 4 this effect is negative. This is strange. For higher levels of homicide the effect is positive. Which means that when the level of crime is high one an increase in crime in t-1 have a positive impact on t.

The blue horizontal line is the level zero. The black horizontal line shows the global average effect.

Concluding Remarks

The analysis made here are summary statitics on how a model behaves globally. Although we use local statistics to make conclusions. The LIME method does something distinct and more local, it is an analysis for particular observations.

In this article we propose methods to understand how a black-box model behaves in a global and local circunstances. This does not mean that for every situtation the model will behave as our explanations points. But the interpretations drawn here can help the researcher to understand the phenomenon behind the data.