by William Suzuki

The objective of this work is to explore the data about real estate launching in the metropolitan region of São Paulo.

I would like to thank suggestions, ideas, and encouragement from Ian Fukushima. I would also like to thank my advisor Márcio Laurini for showing me almost everything that I know about spatial statistics. And to suggest with his course the dataset that I’m using.

The data source is Centro de Estudos da Metrópole (CEM). Let’s explore what the dataset contains.

Load, Explore and Transform

library(rgdal)
library(tidyverse)
library(Amelia)
library(ggmap)
library(broom) #for tidy (alternative to fortify)
library(rgeos) #gBuffer, gUnaryUnion
library(ggvoronoi) #stat_voronoi
library(spdep)
library(INLA)
library(kdensity) #for kdensity
library(raster) #raster()
library(gstat) #idw function
library(tmap) #to use tmap function
library(sf) #for sf objects
library(gbm) #gradient boosting
library(knitr) #for markdown kable

#this data was used during the course of Spatial Statistics by Marcio Laurini
#it was made by Research Center for the Metropolis
#the dataset contain information about lauching prices of real estate 
#for the metropolitan region of Sao Paulo

You should change path1 to your machine. This link have my dropbox location for the data used.

path1 = "~/Dropbox/working/RAW_DATA/BASESCEM"
#lancamento de imoveis na regiao metropolitana
spreal=readOGR( paste(path1,"/LanRes_85_13_RMSP_CEM.shp",sep=""), layer="LanRes_85_13_RMSP_CEM")

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/william/Dropbox/working/RAW_DATA/BASESCEM/LanRes_85_13_RMSP_CEM.shp", layer: "LanRes_85_13_RMSP_CEM"
## with 16935 features
## It has 85 fields
## Integer64 fields read as strings:  ID CEP_4DIG AR_TT_TERR PC_TT_UN VLR_US__CO COOPERATIV

pathExport = "/home/william/Dropbox/working/projects_git/sp_real_estate/export_objects"

#create coordinate variables
coords = coordinates(spreal)
data = spreal@data

data$lat = coords[,2]
data$long = coords[,1]

#transform all variables that are factor into character
data2 = data
len1 = length(names(data2)) 
ii = 1
for (ii in 1:len1) {
  vv = data[,ii]
  if (is.factor(vv)) {
    vv1 = as.character(vv)
    data2[,ii] = vv1
  }
}

str(data2)

## 'data.frame':    16935 obs. of  87 variables:
##  $ ID        : chr  "9037" "16908" "16565" "15812" ...
##  $ COD_EMP   : int  9056 16935 16592 15839 15840 15390 15910 9326 14147 8239 ...
##  $ TIPO_EMP  : chr  "HORIZ" "VERTIC" "HORIZ" "HORIZ" ...
##  $ MES_LAN   : chr  "15-JUL-2004" "15-DEC-2013" "15-SEP-2013" "15-DEC-2012" ...
##  $ ANO_LAN   : int  2004 2013 2013 2012 2012 2012 2012 2004 2010 2003 ...
##  $ DATA_ENT  : chr  "15-JUL-2004" "15-APR-2016" "15-AUG-2015" "15-MAR-2015" ...
##  $ DIST      : chr  NA NA NA NA ...
##  $ SUBPREF   : chr  NA NA NA NA ...
##  $ MUNICIPIO : chr  "COTIA" "COTIA" "COTIA" "COTIA" ...
##  $ ENDERECOCO: chr  "Ouro Preto (Rua) 303" "Agua Espraiada (Estr) 320" "Caucaia (Estr) 3 000" "Jose Teixeira de Oliveira (Rua) 620" ...
##  $ LOGRADOURO: chr  "Ouro Preto" "Agua Espraiada" "Caucaia" "Jose Teixeira de Oliveira" ...
##  $ TIT_VIA   : chr  NA NA NA NA ...
##  $ TIPO_VIA  : chr  "Rua" "Estr" "Estr" "Rua" ...
##  $ NUM       : chr  "303" "320" "3000" "620" ...
##  $ CEP       : chr  "04330-045" "06726-400" "06727-190" "06726-486" ...
##  $ CEP_4DIG  : chr  "4330" "6726" "6727" "6726" ...
##  $ NOME_EMP  : chr  "OURO PRETO" "PARQUE SAINT BENJAMIN" "RES VIVA CAUCAIA - 1\xaa FASE" "RES AGUASSAI I" ...
##  $ ZONA      : chr  "R" "R" "R" "R" ...
##  $ SETOR     : chr  "M" "M" "M" "M" ...
##  $ QUADRA    : chr  "S" "S" "S" "S" ...
##  $ LOTE      : chr  "P" "P" "P" "P" ...
##  $ DORM_UNID : int  4 2 2 2 3 3 3 2 3 2 ...
##  $ BANH_UNID : int  4 1 1 1 2 2 2 1 2 1 ...
##  $ GAR_UNID  : int  4 1 1 2 3 2 2 2 2 1 ...
##  $ ELEV      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ COB       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BLOCOS    : int  1 7 1 1 0 1 1 1 1 1 ...
##  $ UNIDAND   : int  0 7 0 0 0 0 0 0 0 0 ...
##  $ ANDARES   : int  2 4 2 2 2 2 2 2 1 1 ...
##  $ AR_UT_UNID: chr  "180" "44" "48" "61,44" ...
##  $ AR_TT_UNID: chr  "500" "50" "70" "90" ...
##  $ AR_TT_TERR: chr  "9986" "16165" "25470" "32100" ...
##  $ TT_UNID   : int  10 192 198 108 36 20 22 70 11 78 ...
##  $ DORM_EMP  : int  40 384 396 216 108 60 66 140 33 156 ...
##  $ BANH_EMP  : int  40 192 198 108 72 40 44 70 22 78 ...
##  $ GAR_EMP   : int  40 192 198 216 108 40 44 140 22 78 ...
##  $ AU_EMP    : chr  "1800" "8427" "9563" "6635,52" ...
##  $ AT_EMP    : chr  "5000" "9663" "13860" "9720" ...
##  $ PC_TT_UN  : chr  "290000" "129558" "135800" "139000" ...
##  $ PC_M2_AU  : chr  "1611,11" "2952" "2812" "2262,37" ...
##  $ PC_M2_AT  : chr  "580" "2574,17" "1940" "1544,44" ...
##  $ PC_TT_ATU : chr  "489006,07" "129558" "137983,74" "146683,58" ...
##  $ PC_AU_ATU : chr  "2716,7" "2952" "2857,22" "2387,43" ...
##  $ PC_AT_ATU : chr  "978,01" "2574,17" "1971,2" "1629,81" ...
##  $ PC_EMP_ATU: chr  "4890060,73" "24875136" "27320780,31" "15841827,03" ...
##  $ VLR_US__CO: chr  "3" "2" "2" "2" ...
##  $ PC_TT_UN_U: chr  "93942,34" "54944,02" "57178,95" "65596,98" ...
##  $ PC_M2_AU_U: chr  "521,9" "1251,86" "1183,83" "1067,66" ...
##  $ PC_M2_AT_U: chr  "187,88" "1091,68" "816,84" "728,85" ...
##  $ SIST_FINAN: chr  "Preco Fechado" "Preco Fechado" "Preco Fechado" "Preco Fechado" ...
##  $ AGENTE    : chr  NA "BANCO DO BRASIL" "CEF" "SFH" ...
##  $ INCORPORAD: chr  "LM Incorporacoes e Empreendimentos Ltda" "MRV Engenharia e Participacoes S/A" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
##  $ VENDEDORA : chr  "Formanova/Proin/G3I/3\xbaMI/Especonc/Ellit" "Habitcasa Consultoria de Imoveis Ltda" "Drive Imoveis" "Atam Empreendimentos" ...
##  $ CONSTRUTOR: chr  "Consil Engenharia S/C Ltda" "MRV Engenharia e Participacoes S/A" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
##  $ ENGENHEIRO: chr  "Nao Informado" "Nao Informado" "Nao Informado" "Nao Informado" ...
##  $ ARQUITETO : chr  "Nao Informado" "Nao Informado" "Nao Informado" "Nao Informado" ...
##  $ HOTELARIA : chr  NA NA NA NA ...
##  $ INCORPOR_A: chr  "LM Incorporacoes e Empreendimentos Ltda" "MRV Engenharia e Participacoes S" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
##  $ INCORPOR_B: chr  NA NA NA NA ...
##  $ INCORPOR_C: chr  NA NA NA NA ...
##  $ INCOPORADO: chr  NA NA NA NA ...
##  $ INCORPOR_D: chr  NA NA NA NA ...
##  $ INCORPOR_E: chr  NA NA NA NA ...
##  $ VENDEDOR_A: chr  "Formanova" "Habitcasa" "Drive Imoveis" "Atam Empreendimentos" ...
##  $ VENDEDORA2: chr  "Proin" NA NA NA ...
##  $ VENDEDORA3: chr  "G3I Geracao de Inv e Incorp Imob Ltda" NA NA NA ...
##  $ VENDEDORA4: chr  "3\xbaMI" NA NA NA ...
##  $ VENDEDORA5: chr  "Especonc" NA NA NA ...
##  $ VENDEDORA6: chr  "Ellit" NA NA NA ...
##  $ CONSTRUT_A: chr  "Consil Engenharia S" "MRV Engenharia e Participacoes S" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
##  $ CONSTRUT_B: chr  NA NA NA NA ...
##  $ CONSTRUT_C: chr  NA NA NA NA ...
##  $ CONSTRUT_D: chr  NA NA NA NA ...
##  $ COOPERATIV: chr  "0" "0" "0" "0" ...
##  $ HOTEL     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAT      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ EXFLAT    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AP2010    : num  3.51e+12 3.51e+12 3.51e+12 3.51e+12 3.51e+12 ...
##  $ SC_2010   : num  3.51e+14 3.51e+14 3.51e+14 3.51e+14 3.51e+14 ...
##  $ RENRESP91 : chr  "112291,99" "112291,99" "112291,99" "112291,99" ...
##  $ RENRESP00 : chr  "570,95" "570,95" "570,95" "570,95" ...
##  $ RENRESP10 : chr  "954,75" "954,75" "954,75" "954,75" ...
##  $ PCMEDAU91 : chr  NA NA NA NA ...
##  $ PCMEDAU00 : chr  NA NA NA NA ...
##  $ PCMEDAU10 : chr  NA "2470,97" NA "2566,94" ...
##  $ lat       : num  -23.7 -23.7 -23.7 -23.7 -23.7 ...
##  $ long      : num  -47 -47 -47 -47.1 -47.1 ...

#data have 16935 lines

#map of missing data
missmap(data2)

There are many missing for information that we don’t use for example, the seller, the construction firm. So this is not a problem.

#explore year variable
ano=data2$ANO_LAN
summary(ano)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1985    1995    2003    2002    2009    2013

#ano goes from 1985 till 2013
ano %>% class #integer

## [1] "integer"

sum(is.na(ano)) #there are not missing for ano

## [1] 0

table(ano)

## ano
## 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 
##  367  745  276  342  452  287  216  212  432  533  578  570  548  402  390  558 
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
##  521  608  684  659  617  656  867  874  765  982  992  758 1044

Make time series with how many lauches are there per year.

ano_which = names(table(ano))
ano_ts = as.numeric(table(ano))
plot(ano_which,ano_ts, type='l')

length(ano_ts)

## [1] 29

The dataset have 29 years of data. More recent years have more real estate lauches.

Idea: Make spatiotemporal models. Make predictions.

The following summary of data from paper: Albuquerque_2016_preco_imovel

\texttt{“/home/william/Dropbox/working/projects_git/sp_real_estate/Albuquerque_2016_preco_imovel.pdf”}

DORM_UNID: quantidade de dormitorios por unidade.
BANH_UNID: quantidade de banheiros por unidade.
GAR_UNID: quantidade d
ANDARES: quantidade de andares do empreendimento.e vagas por unidade.
ELEV: quantidade de elevadores por lancamento.
COB: quantidade de unidades na cobertura.
BLOCOS: quantidade de blocos no empreendimento.
UNIDAND: quantidade de unidades por andar.
AR_UT_UNID: area util da unidade em m2.
PC_TT_UN: variavel dependente preco de venda da unidade.

#change data2
data3 = data2
#transform character into numeric
list_variables = c("AR_UT_UNID", "PC_TT_UN_U","PC_TT_ATU", "PC_TT_UN")
variable = "PC_TT_UN"
for (variable in list_variables) {
  fl1 = data3[,variable]
  fl2 = as.numeric(gsub(",",".",fl1))
  data3[variable] = fl2
}

AR_UT_UNID is area of unit.

PC_TT_UN_U,PC_TT_ATU, PC_TT_UN are all price variable. We want to analyze all those price variables to see what is inside in each of them.

data3$AR_UT_UNID %>% summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.72   57.68   74.00  103.55  120.00 1975.00

data3$PC_TT_ATU %>% summary

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    57202   239701   384850   656059   704158 31843099

#note how the median is smaller than the mean
#which makes sense because the distribution of prices is asymetric
data3$PC_TT_UN %>% summary

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.210e+05 2.661e+05 2.651e+07 7.342e+05 2.112e+09

data3$PC_TT_UN_U %>% summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7803   59475  103459  176922  195511 6377875

#we'll use PC_TT_ATU
hist(data3$PC_TT_ATU, breaks=1000)

#there some extreme values
#the maximum value is 31,843,099 
vv2 = data3$PC_TT_ATU
vv3 = vv2[vv2<2000000] #take out some outliers
#see distribution of only houses less than 2 000 000 reais
hist(vv3, breaks=1000)

#tree methods doesn't have problems in using non-gaussian variables

#take some variables only
data3_1 = data3[,c(list_variables,"ANO_LAN") ]
#time series graphs for MEAN
time_data3 = aggregate(data3_1, by=list(data3$ANO_LAN), FUN=mean, na.rm=TRUE)
#time series for PC_TT_ATU
plot(time_data3$ANO_LAN,time_data3$PC_TT_ATU,type='l')

#time series for PC_TT_UN
plot(time_data3$ANO_LAN,time_data3$PC_TT_UN,type='l')

#time series for PC_TT_UN_U
plot(time_data3$ANO_LAN,time_data3$PC_TT_UN_U,type='l')

#PC_TT_ATU must be preco atualizado

#make graphs for MEDIAN
#time series analysis for median
time_data3_median = aggregate(data3_1, by=list(data3$ANO_LAN),
                              FUN=median, na.rm=TRUE)

#time series for PC_TT_ATU
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_ATU,type='l')

#time series for PC_TT_UN
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_UN,type='l')

#[1] "AR_UT_UNID" "PC_TT_UN_U" "PC_TT_ATU"  "PC_TT_UN"  
#time series for PC_TT_UN_U
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_UN_U,type='l')

#make graph of PC_TT_ATU with mean an median side by side
plot(time_data3_median$ANO_LAN, time_data3_median$PC_TT_ATU,type='l')
lines(time_data3$ANO_LAN, time_data3$PC_TT_ATU,type='l', col='dark red',)

The black line is the median, dark red line is mean.

The median rovers below the mean, the distribution is asymetrical

#see how many municipalities are there in the base
data3$MUNICIPIO %>% table %>% kable

.	Freq
ARUJA	11
BARUERI	276
CAIEIRAS	1
CAJAMAR	39
CARAPICUIBA	52
COTIA	226
DIADEMA	105
EMBU	13
FERRAZ DE VASCON	28
FRANCO DA ROCHA	2
GUARAREMA	2
GUARULHOS	490
ITAPECERICA DA S	4
ITAPEVI	7
ITAQUAQUECETUBA	31
JANDIRA	25
MAIRIPORA	1
MAUA	107
MOGI DAS CRUZES	145
OSASCO	239
POA	33
SANTANA DE PARNA	49
SANTO ANDRE	539
SAO BERNARDO DO	691
SAO CAETANO DO S	413
SAO PAULO	13248
SUZANO	60
TABOAO DA SERRA	90
VARGEM GRANDE PA	8

This table show all lauches for all 29 years of data.

data3$MUNICIPIO %>% table %>% length

## [1] 29

There are only 29 municipalities in the dataset. The data is for the metropolitan region of São Paulo.

Another place where you can find the data and more: https://properatidata.carto.com/tables/lanres_85_13_rmsp_cem/public/map

Shrink Data

#shrink data -------------------------------
#this link: http://adv-r.had.co.nz/Subsetting.html
#have very interesting material
#
#we'll only use the last three years of data so that we can fit it into memory
#we'll use years 2011, 2012 and 2013
pos1 = (ano==2011|ano==2012|ano==2013)
#pos1 = ano %in% c(2011,2012,2013)
#create data31
data31  = data3[pos1,]
data31 %>% dim

## [1] 2794   87

#only 2794 observations, more feasible without crashing the computer

We use the last three years in the data because there are two much data, and the memory of my machine can support all objects created. And with a smaller dataset is faster to estimate. We use data for years 2011, 2012 and 2013.

We make a cross-section boosting model. Our objective is to classify prices.

#take only variables that we'll use
data4 = data31[,c(
                "DORM_UNID", #DORM_UNID: quantity of restrooms
                "BANH_UNID", #BANH_UNID: quatity of bathrooms
                "GAR_UNID",  #GAR_UNID: quantity of garages
                "ELEV",      #ELEV: quantity of elevators
                "BLOCOS",    #BLOCOS: qnt of blocks in condominium
                "UNIDAND",   #UNIDAND: qnt of unit per floor
                "ANDARES",   #ANDARES: qnt of floors in building
                "AR_UT_UNID",#AR_UT_UNID: area of unit
                "PC_TT_ATU", #PC_TT_ATU: price of unit
                "ANO_LAN",   #ANO_LAN: year of lauch
                "long",      #long: longitude
                "lat")]      #lat: latitude

data4 %>% str

## 'data.frame':    2794 obs. of  12 variables:
##  $ DORM_UNID : int  2 2 2 3 3 3 2 2 2 2 ...
##  $ BANH_UNID : int  1 1 1 2 2 2 1 1 1 1 ...
##  $ GAR_UNID  : int  1 1 2 3 2 2 2 1 1 1 ...
##  $ ELEV      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BLOCOS    : int  7 1 1 0 1 1 1 7 1 1 ...
##  $ UNIDAND   : int  7 0 0 0 0 0 0 4 0 0 ...
##  $ ANDARES   : int  4 2 2 2 2 2 2 4 1 2 ...
##  $ AR_UT_UNID: num  44 48 61.4 76.8 86.9 ...
##  $ PC_TT_ATU : num  129558 137984 146684 185729 282754 ...
##  $ ANO_LAN   : int  2013 2013 2012 2012 2012 2012 2012 2012 2012 2013 ...
##  $ long      : num  -47 -47 -47.1 -47.1 -47 ...
##  $ lat       : num  -23.7 -23.7 -23.7 -23.7 -23.6 ...

All variables are numeric.

Some Maps

#apply coordinates into data frame to create spatial points 
#create spreal2
spreal2 = data4
coordinates(spreal2) = ~long+lat
spreal2@proj4string = spreal@proj4string
spreal2 %>% class

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

plot(spreal2)

Those are the information points for the smaller sample of the last three years.

For our models we are only going to use those points. There are 2794 observations.

spreal2@data[1:4, ] %>% kable

	DORM_UNID	BANH_UNID	GAR_UNID	BLOCOS	UNIDAND	ANDARES	AR_UT_UNID	PC_TT_ATU	ANO_LAN
2	2	1	1	7	7	4	44.00	129558.0	2013
3	2	1	1	1	0	2	48.00	137983.7	2013
4	2	1	2	1	0	2	61.44	146683.6	2012
5	3	2	3	0	0	2	76.80	185728.9	2012

#to save  df in disk
#saveRDS(spreal2, "spreal2.RData")
#all point in our dataset

#download municiplities borders -------------------------
#we use municipalities frontier only to make the maps, for models
#we do not use municipalities polygons
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields

#pick up only municipalities that are in the state of sao paulo
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='35',] #restrict nc.sids state==35 Sao Paulo
plot(munic)

Those are the municipalities in the state of São Paulo.

#transform cod ibge from munic into character
munic$CODIGOIB1 = as.character(munic$CODIGOIB1)

Municipalities in the metrpolitan region

Norte: Caieiras, Cajamar, Francisco Morato, Franco da Rocha e Mairipora.
Leste: Aruja, Biritiba Mirim, Ferraz de Vasconcelos, Guararema, Guarulhos, Itaquaquecetuba, Mogi das Cruzes, Poa, Salesopolis, Santa Isabel e Suzano.
Sudeste: Diadema, Maua, Ribeirao Pires, Rio Grande da Serra, Santo Andre, Sao Bernardo do Campo e Sao Caetano do Sul.
Sudoeste: Cotia, Embu, Embu Guacu, Itapecerica da Serra, Juquitiba, Sao Lourenco da Serra, Taboao da Serra e Vargem Grande Paulista.
Oeste: Barueri, Carapicuiba, Itapevi, Jandira, Osasco, Pirapora do Bom Jesus e Santana de Parnaiba.

#I took those codes by hand, but there must be a better way to get them
list_metropolitan = c("350900","350920","351630","351640","352850",
                      "350390","350660","351570","351830","351880",
                      "352310","353060","353980","354500","354680",
                      "355250","351380","352940","354330","354410",
                      "354780","354870","354880","351300","351500",
                      "351510","352220","352620","354995","355280",
                      "355645","350570","351060","352250","352500",
                      "353440","353910","354730","355030")
length(list_metropolitan)

## [1] 39

There are 39 municipalities in the metropolitan region of São Paulo.

# a test to find municipality in list 
sum(munic@data$CODIGOIB1 == "355030")

## [1] 1

#select only municipalities of the metrpolitan region
match_places = match(list_metropolitan, munic@data$CODIGOIB1)
#select wanted municipalities in munic dataset
munic2 = munic[match_places,]

Letvec1, vec2 be two vectors, match(vec1, vec2) function finds the first place where the first element of vec1 is found in vec2. We want here from all muncipalities of the state of São Paulo, just of the vector list_metropolitan.

#map of metrpolitan region of sao paulo
plot(munic2)

#get map from google
#                        ( west,south, east,north) #determined by hand on google maps
map = get_map(location =c(-47.2,-24.1,-45.6,-23.1),maptype = 'satellite')
plot(map)

This is a map form Google of the metropolitan region.

#procedures for ggplot
posReal = data.frame(lat=coordinates(spreal2)[,2],
                     lon=coordinates(spreal2)[,1])

munic_fortify = fortify(munic2, region="CODIGOIB1")

(sp_real_state =
    ggmap(map,  ylab = "Latitude", xlab = "Longitude") + 
    geom_polygon(data=munic_fortify, color='blue', alpha=0,
                 aes(y=lat, x=long, group=group, fill='grey'))+
    geom_point(data=posReal, aes(x=lon, y=lat),size=.5, color='red', alpha=.5))

ggsave("images/sp_real_state_last2years.jpg", width = 20, height = 20, units = "cm",dpi = 600)

Red points are points of observation for the smaller sample. Blue frontier are the municipalities in the metropolitan region.

#plot raster map --------------------------------------
#create polygon to build raster
spdf = spreal2 #spdf is the copula for spatial polygon or point
proj4string(munic2) #no projection

## [1] NA

pol = as(munic2,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 10000) #one million little squares
gridded(grid) = TRUE
proj4string(grid) = proj4string(spdf) #give projection the same as spreal
#plot of one million little squares
plot(grid)

This grid have 10 000 squares. We use this to make a IDW interpolation and represent the data in a raster format.

The object munic2 is a spatial polygon with all municipalties of the metropolitan region.

#we must use spreal2 which contains data about prices
v = spreal2$PC_TT_ATU
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

# map with munic borders
tm1 = tm_shape(raster_idw) +
  tm_raster("variable",  style = "quantile", #breaks = c(0,0.5,1,2,3,4,5,6),
            n = 20, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.4, col = "black", lwd=.3)+
  tm_layout(legend.text.size = .5,
            legend.position = c("left","bottom"),
            title="Prices",
            title.size=1,
            title.position = c(0,.9))
tm1

tmap_save(tm = tm1,
          filename = "name1.jpg",
          width = 3, height = 3)
#some areas of the map are actually non informative because 
#we don't have data

Bluer regions have higher prices.

This map may be misleading, because in many places there are few observed points. We make the area of interpolation smaller so that we can have a more accurate view of how prices are distributed in the map.

The step we change is when we create the grid in `pol = as(munic2,“SpatialPolygons”)

grid = spsample(pol, type = ‘regular’, n = 10000) #one million little squares`

We change the object of munic2 to another spatial polygons

#just an experiment with hexagons... 
size = 0.03
hex_points = spsample(munic2, type = "hexagonal", cellsize = size)
hex_points %>% class

## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"

hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% plot

#we could study the spatial point process of occurances of lauches

This is an experiment in using hexagon polygons.

Idea: make price map in the polygons instead of making a continuous like raster map. The problem is that we need information that covers a more area, and not just concentrate in some places.

#experimenting with sf pacakge ------------------------------
#I want to build a radius around each point of observation
#and build a spatial polygon from it
#exercise taken from answer:
#https://gis.stackexchange.com/a/292336

spreal2 %>% names

##  [1] "DORM_UNID"  "BANH_UNID"  "GAR_UNID"   "ELEV"       "BLOCOS"    
##  [6] "UNIDAND"    "ANDARES"    "AR_UT_UNID" "PC_TT_ATU"  "ANO_LAN"

matcor = coordinates(spreal2)
points2 = st_as_sf(spreal2,coords=c("long","lat"))
points2 %>% class

## [1] "sf"         "data.frame"

points2 %>% plot

Here I’m experimenting with the sf package. This map show the points and intensity of the variables, which are all numeric or just integer.

#make polygon union all municipalities in metropolitan region
unionMunic = gUnaryUnion(munic2)
unionMunic %>% class

## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"

unionMunic %>% plot

gUnaryUnion() function takes a spatial polygons as input and output just the outer border, dissolving internal frontiers.

#Convert sp to sf:
sPol1 = st_as_sf(unionMunic)
sPol1 %>% class

## [1] "sf"         "data.frame"

sPol1 %>% plot

The plot of an sf polygon object have the same look of a spatial polygon.

#points2 has a coordinate system, take it to the sPol1: 
st_crs(sPol1)

## Coordinate Reference System: NA

st_crs(points2)

## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"

st_crs(sPol1) = st_crs(points2)

Here we are assigning a projection for object sPol1. Note that sPol1 doesn’t have a projection, we assign the same projection from points2.

#Transform to metric coordinate system:
points2Km = st_transform(points2, "+proj=utm +zone=42N +datum=WGS84 +units=km")
sPol1Km = st_transform(sPol1, "+proj=utm +zone=42N +datum=WGS84 +units=km")

Here we are transforming the projection from a lat-long form to UTM. UTM is measured in meters instead degrees.

points2Km %>% plot

sPol1Km %>% plot

#the plots are strange, they are up side down

When we change the projection the plot becomes strange, it is up side down. That is, when in UTF projection the map is not informative, we need to use a long-lat projection.

#Create 10km buffer zone:
points2_buffer = st_buffer(points2Km, 10)
points2_buffer %>% class

## [1] "sf"         "data.frame"

Here we are making a buffer zone around each point of observation. We chose a 10km radius around each point. From this buffer zone we’ll make the IDW raster map.

#find intersection between municipality polygons and bufferzone
points2_sPol1 = st_intersection(points2_buffer, sPol1Km)
points2_sPol1 %>% class

## [1] "sf"         "data.frame"

st_intersection() makes an intersection between two polygon objects, we make a buffer zone but we also want that the polygon is inside the frontier of the metropolitan region. points2_buffer is the 10km buffer zone, and sPol1Km is the metropolitan region polygon.

#https://gis.stackexchange.com/questions/239118/r-convert-sf-object-back-to-spatialpolygonsdataframe
#https://gis.stackexchange.com/a/239215
#transform sf object to spatial polygons object
buffsp = as(points2_sPol1, 'Spatial')
buffsp %>% class

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Here we transform back the sf object to a spatial polygon sp object.

buffsp %>% plot

This is the result of making the buffer and cutting places where the metropolitan region frontier is smaller than the 10km buffer. Note in the south east part that the circles are cut.

#make union of spatial polygons
unionSp = gUnaryUnion(buffsp)
unionSp %>% plot

This is the result when we dissolve the internal frontiers.

Remeber that the object is in UTF, so it is up side down.

#change projection
#https://stackoverflow.com/a/30018607/12552718
#https://stackoverflow.com/questions/30018098/how-to-convert-utm-coordinates-to-lat-and-long-in-r
#sputm = SpatialPoints(randompoints, proj4string=CRS("+proj=utm +zone=42 +datum=WGS84"))
unionSp = spTransform(unionSp, CRS("+proj=longlat +datum=WGS84"))

Here we change the projection from a UTF to a long-lat. Which makes the map in the right orientation.

unionSp %>% plot(col='grey')
spreal2 %>% plot(add=T, col='red')
munic2 %>% plot(add=T)

#we'll use unionSp to make the IDW interpolation
#and visualization of prices and other variables

Red crosses are points of observation. Gray area are the buffers. Black lines are municipalities frontiers of the metropolitan region.

#plot all polygons and spatial points together
munic_fortify = fortify(munic2, region="CODIGOIB1")
unionFort = fortify(unionSp, region="1")
posReal = data.frame(lat=coordinates(spreal2)[,2],
                     lon=coordinates(spreal2)[,1])
(mapPol =
    ggmap(map,  ylab = "Latitude", xlab = "Longitude") + 
    geom_polygon(data=munic_fortify, color='blue', alpha=0,
                 aes(y=lat, x=long, group=group, fill='grey'))+
    geom_polygon(data=unionFort, color='black', alpha=0,
                 aes(y=lat, x=long, group=group, fill='grey'))+
    geom_point(data=posReal, aes(x=lon, y=lat),size=1, color='red', alpha=.5))

ggsave("mapPol.jpg", width = 20, height = 20, units = "cm",dpi = 600)

The red points are the observations. Black lines are the 10km buffer area around each point. And blue line are the municipal frontiers of the metropolitan region of São Paulo.

#build raster map with new frontier -----------------------
#create polygon to build raster
spdf = spreal2 #spdf is the copula for spatial polygon or point
pol = as(unionSp,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 30000) 
gridded(grid) = TRUE
proj4string(grid) = proj4string(spdf) #give projection the same as spreal
#plot of one million little squares
plot(grid)

Here we make 30 000 square grid for IDW interpolation.

#we must use spreal2 which contains data about prices
v = spreal2$PC_TT_ATU
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

# map with munic borders
tm2 = tm_shape(raster_idw) +
  tm_raster("variable",  style = "quantile", 
            n = 10, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.4, col = "black", lwd=.6)+
  tm_layout(legend.text.size = .5,
            legend.position = c("right","bottom"),
            title="Real Estate Prices",
            title.size=1,
            title.position = c(0.01,.9))
tm2

This map show a more proportional picture of how prices are distributed. In the ceter of São Paulo prices are higher. And the IDW function spread it values beyond. Note the circle blob in the north, the price is low in the center but becomes higher in the borders, this doesn’t make sense. The higher prices from the center of the city with the effect of IDW function is making this.

Models

OLS

# apply simple liner model to the real estate data --------------
data4 =  spreal2@data
data4 %>% names %>% kable

x
DORM_UNID
BANH_UNID
GAR_UNID
ELEV
BLOCOS
UNIDAND
ANDARES
AR_UT_UNID
PC_TT_ATU
ANO_LAN

prices = data4$PC_TT_ATU
hist(prices, breaks=200)

#prices are highly asymetric, we take the log
pricesl = log(prices)
hist(pricesl, breaks=200)

#now it have a more Gaussian like shape
#create data41 to substitute log prices
data41 = data4
data41$PC_TT_ATU = pricesl

#standatize data, use data41
data5 = as.data.frame(scale(data41))

We need to standardize because in the inla function we have problem of convergence.

nrow1 = nrow(data5) 
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length

## [1] 2149

train %>% length/nrow1 #proportion of train set

## [1] 0.7691482

#create train test dataset
trainX = data5[train,] #we exclude the variable 9 because it is price "PC_TT_ATU"
trainY = data5$PC_TT_ATU[train]
testX = data5[train,]
testY = data5$PC_TT_ATU[-train]

mOls = lm(PC_TT_ATU ~., data=trainX)
summary(mOls)

## 
## Call:
## lm(formula = PC_TT_ATU ~ ., data = trainX)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4559 -0.3542 -0.0175  0.3182  2.0433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0006293  0.0111588   0.056    0.955    
## DORM_UNID   -0.2063837  0.0180898 -11.409  < 2e-16 ***
## BANH_UNID    0.2661923  0.0211269  12.600  < 2e-16 ***
## GAR_UNID     0.3242184  0.0207998  15.588  < 2e-16 ***
## ELEV        -0.0161026  0.0129381  -1.245    0.213    
## BLOCOS      -0.1093757  0.0112653  -9.709  < 2e-16 ***
## UNIDAND      0.0918194  0.0145181   6.324 3.08e-10 ***
## ANDARES      0.1958803  0.0136529  14.347  < 2e-16 ***
## AR_UT_UNID   0.4264263  0.0200034  21.318  < 2e-16 ***
## ANO_LAN      0.0469628  0.0112192   4.186 2.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5169 on 2139 degrees of freedom
## Multiple R-squared:  0.7417, Adjusted R-squared:  0.7406 
## F-statistic: 682.4 on 9 and 2139 DF,  p-value: < 2.2e-16

DORM_UNID have negative impact on price, and BANH_UNID have positive impact.

Although the prices are in present value, the year of lauch ANO_LAN have a positive impact on price.

Number of garages GAR_UNID have a positive and significant impact on price.

AR_UT_UNID have a positive impact on prices.

#take the residuals of model ols
res1 = residuals(mOls)

#create spatial object for training
sptrain = spreal2[train,]

knn1 = knearneigh(sptrain, k=5, longlat = TRUE)
nb1 = knn2nb(knn1)
listw1 = nb2listw(nb1)
moran.test(res1, listw1)

## 
##  Moran I test under randomisation
## 
## data:  res1  
## weights: listw1    
## 
## Moran I statistic standard deviate = 45.218, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5683299088     -0.0004655493      0.0001582306

#there is spatial correlation between residuals of simple ols

Note that in the residuals there is spatial correlation. There is some structure to model still. We’ll explore more this structure in the next model.

yhr = predict(mOls,trainX)
resOls = yhr-trainY
mean((resOls)^2)

## [1] 0.2659673

#this is the train MSE

yh = predict(mOls,testX)
mean((yh-testY)^2)

## [1] 1.670049

#this is the test MSE

orig1 = sptrain$PC_TT_ATU
v = (resOls)*sd(orig1)#make the value in reais
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(sptrain)
proj4string(df1) = proj4string(sptrain)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)

## [inverse distance weighted interpolation]

idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)

# map with munic borders
tm_resOLS = tm_shape(raster_idw) +
  tm_raster("variable",  style = "quantile", 
            n = 10, palette = "Spectral", legend.show = TRUE)+
  tm_shape(munic)+
  tm_borders(alpha=.4, col = "black", lwd=.6)+
  tm_layout(legend.text.size = .5,
            legend.position = c("right","bottom"),
            title="Residuals OLS",
            title.size=1,
            title.position = c(0.01,.9))
tm_resOLS

Map of OLS residuals.

There are clearly clusters of values. This explains the presence of spatial autocorrelation.

INLA

The package inla uses Laplace aproximations to make Bayesian inference.

#spatial model ----------------------
#try doing re-scale when convergence problems occur
data6 = data5
data6$ID = 1:dim(data6)[1]
spatial_data = spreal2
spatial_data@data = data6
spatial_data$PC_TT_ATU %>% hist(breaks=100) #it is gaussian like

knn1 = knearneigh(spatial_data, k=5, longlat = TRUE) #use knn, 5 nearest neighbors
nb1 = knn2nb(knn1)
nb2INLA("nc.adj", nb1) #this creates an object called nc.adj that will be used by the model

#inla model without spatial error random effects
m1_01 = inla(PC_TT_ATU ~ DORM_UNID+BANH_UNID+GAR_UNID+ELEV+BLOCOS+UNIDAND+ANDARES+AR_UT_UNID+ANO_LAN, family="gaussian", data=data6, control.predictor=list(compute=TRUE), verbose=FALSE)

summary(m1_01)

## 
## Call:
##    c("inla(formula = PC_TT_ATU ~ DORM_UNID + BANH_UNID + GAR_UNID + ", " 
##    ELEV + BLOCOS + UNIDAND + ANDARES + AR_UT_UNID + ANO_LAN, ", " family = 
##    \"gaussian\", data = data6, verbose = FALSE, control.predictor = 
##    list(compute = TRUE))" ) 
## Time used:
##     Pre = 1.26, Running = 2.99, Post = 0.353, Total = 4.6 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept)  0.000 0.010     -0.019    0.000      0.019  0.000   0
## DORM_UNID   -0.194 0.016     -0.225   -0.194     -0.163 -0.194   0
## BANH_UNID    0.252 0.018      0.216    0.252      0.288  0.252   0
## GAR_UNID     0.309 0.018      0.273    0.309      0.344  0.309   0
## ELEV        -0.014 0.012     -0.037   -0.014      0.008 -0.014   0
## BLOCOS      -0.118 0.011     -0.139   -0.118     -0.097 -0.118   0
## UNIDAND      0.086 0.012      0.063    0.086      0.110  0.086   0
## ANDARES      0.194 0.012      0.171    0.194      0.217  0.194   0
## AR_UT_UNID   0.448 0.018      0.412    0.448      0.485  0.448   0
## ANO_LAN      0.046 0.010      0.027    0.046      0.066  0.046   0
## 
## Model hyperparameters:
##                                         mean    sd 0.025quant 0.5quant
## Precision for the Gaussian observations 3.75 0.099       3.55     3.74
##                                         0.975quant mode
## Precision for the Gaussian observations       3.94 3.74
## 
## Expected number of effective parameters(stdev): 10.06(0.002)
## Number of equivalent replicates : 277.62 
## 
## Marginal log-Likelihood:  -2201.94 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

INLA is a Bayesian estimation, so there is a credibility interval for the distribution of the parameters. This model is basically the same as the OLS. And you can note that the results are almost the same. In Bayesian estimation the prior can have important impact in the estimates.

Boosting

#boosting --------------------------------------------------------------
#create data12
data12 = spreal2@data
data12 %>% names

##  [1] "DORM_UNID"  "BANH_UNID"  "GAR_UNID"   "ELEV"       "BLOCOS"    
##  [6] "UNIDAND"    "ANDARES"    "AR_UT_UNID" "PC_TT_ATU"  "ANO_LAN"

#create data2 another model with spatial information
data2 = data12
data2$long = coordinates(spreal2)[,1]
data2$lat = coordinates(spreal2)[,2]
data2 %>% names #include long and lat to the model

##  [1] "DORM_UNID"  "BANH_UNID"  "GAR_UNID"   "ELEV"       "BLOCOS"    
##  [6] "UNIDAND"    "ANDARES"    "AR_UT_UNID" "PC_TT_ATU"  "ANO_LAN"   
## [11] "long"       "lat"

#model with data12, without long lat
nrow1 = nrow(data12)
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length

## [1] 2149

trainX = data12[train,]
trainY = data12$PC_TT_ATU[train]
testX = data12[-train,]
testY = data12$PC_TT_ATU[-train]
trainSpreal = spreal2[train,]
m1 = gbm(PC_TT_ATU~., data=trainX, distribution='gaussian',
         n.trees = 1000, interaction.depth = 4)
m1

## gbm(formula = PC_TT_ATU ~ ., distribution = "gaussian", data = trainX, 
##     n.trees = 1000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 9 predictors of which 9 had non-zero influence.

m1 %>% summary

##                   var   rel.inf
## AR_UT_UNID AR_UT_UNID 81.754502
## GAR_UNID     GAR_UNID  6.730623
## ANDARES       ANDARES  3.742729
## UNIDAND       UNIDAND  2.495743
## ELEV             ELEV  1.574726
## DORM_UNID   DORM_UNID  1.244310
## ANO_LAN       ANO_LAN  1.094080
## BANH_UNID   BANH_UNID  1.014951
## BLOCOS         BLOCOS  0.348338

#export values to disk to be used in another script
saveRDS(munic, file = paste0(pathExport,"/munic.rds"))
saveRDS(m1,     file = paste0(pathExport,"/modelBoosting.rds"))
saveRDS(trainX, file = paste0(pathExport,"/trainX.rds"))
saveRDS(trainY, file = paste0(pathExport,"/trainY.rds"))
saveRDS(testX,  file = paste0(pathExport,"/testX.rds"))
saveRDS(testY,  file = paste0(pathExport,"/testY.rds"))
saveRDS(trainSpreal, file = paste0(pathExport,"/trainSpreal.rds"))

# Restore the object
#readRDS(file = "my_data.rds")

par(mfrow=c(1,2))
plot(m1,i='AR_UT_UNID')

plot(m1,i='GAR_UNID')

#train set MSE
yh1 = predict(m1,trainX,n.trees=1000)
mean((yh1-trainY)^2)

## [1] 30646177516

#test set MSE
yh = predict(m1,testX,n.trees=1000)
mean((yh-testY)^2)

## [1] 43306155745

#model with data2, with long lat
dataC = data2 #dataC is a data copula, just to hold the value
nrow1 = nrow(dataC)
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length

## [1] 2149

trainX = dataC[train,]
trainY = dataC$PC_TT_ATU[train]
testX = dataC[-train,]
testY = dataC$PC_TT_ATU[-train]
m1 = gbm(PC_TT_ATU~., data=trainX, distribution='gaussian',
         n.trees = 1000, interaction.depth = 4)
m1 %>% summary

##                   var    rel.inf
## AR_UT_UNID AR_UT_UNID 80.4970312
## GAR_UNID     GAR_UNID  5.4575780
## long             long  4.8621959
## lat               lat  4.4585578
## UNIDAND       UNIDAND  1.5815730
## ANDARES       ANDARES  1.3118184
## BANH_UNID   BANH_UNID  0.6119743
## ELEV             ELEV  0.4719243
## ANO_LAN       ANO_LAN  0.3535569
## DORM_UNID   DORM_UNID  0.2617212
## BLOCOS         BLOCOS  0.1320689

By far AR_UT_UNID is the most important variable to explain prices. Longitude and latitute are also on the top important although not so important as area of unit. This shows that location is one of the most imporntant variables that explain price on real estate.

plot(m1,i='AR_UT_UNID')

Area of unit AR_UT_UNID have an increasing and positive impact on prices.

plot(m1,i='GAR_UNID')

plot(m1,i='long')

plot(m1,i='lat')

Longitude and Latitute show that in the center of the map prices are higher. This makes sense because it is the center of São Paulo.

Idea: we should see how long and lat beave in a two dimensional setting. And plot over the map.

#for long lat it would be better to make a 2-dimensional exploration and
#plot it on a map

plot(m1,i='BANH_UNID')

Number of bathrooms have a stable increasing effect on prices than it drops, maybe units with more than 3 bathrooms are anohter type of housing.

plot(m1,i='DORM_UNID')

Note how DORM_UNID have a negative impact on prices.

#train set MSE
yh1 = predict(m1,trainX,n.trees=1000)
mean((yh1-trainY)^2)

## [1] 13628927882

#test set MSE
yh = predict(m1,testX,n.trees=1000)
mean((yh-testY)^2)

## [1] 22697302611

#we should make the model with price per square meter, it is a better form to calculate

Make map of residuals from boosting model. Make interpretation of results from boosting.

Real Estate Prices in São Paulo: a Boosting Application