The objective of this work is to explore the data about real estate launching in the metropolitan region of São Paulo.
I would like to thank suggestions, ideas, and encouragement from Ian Fukushima. I would also like to thank my advisor Márcio Laurini for showing me almost everything that I know about spatial statistics. And to suggest with his course the dataset that I’m using.
The data source is Centro de Estudos da Metrópole (CEM). Let’s explore what the dataset contains.
library(rgdal)
library(tidyverse)
library(Amelia)
library(ggmap)
library(broom) #for tidy (alternative to fortify)
library(rgeos) #gBuffer, gUnaryUnion
library(ggvoronoi) #stat_voronoi
library(spdep)
library(INLA)
library(kdensity) #for kdensity
library(raster) #raster()
library(gstat) #idw function
library(tmap) #to use tmap function
library(sf) #for sf objects
library(gbm) #gradient boosting
library(knitr) #for markdown kable
#this data was used during the course of Spatial Statistics by Marcio Laurini
#it was made by Research Center for the Metropolis
#the dataset contain information about lauching prices of real estate
#for the metropolitan region of Sao Paulo
You should change path1
to your machine. This link have my dropbox location for the data used.
path1 = "~/Dropbox/working/RAW_DATA/BASESCEM"
#lancamento de imoveis na regiao metropolitana
spreal=readOGR( paste(path1,"/LanRes_85_13_RMSP_CEM.shp",sep=""), layer="LanRes_85_13_RMSP_CEM")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/william/Dropbox/working/RAW_DATA/BASESCEM/LanRes_85_13_RMSP_CEM.shp", layer: "LanRes_85_13_RMSP_CEM"
## with 16935 features
## It has 85 fields
## Integer64 fields read as strings: ID CEP_4DIG AR_TT_TERR PC_TT_UN VLR_US__CO COOPERATIV
pathExport = "/home/william/Dropbox/working/projects_git/sp_real_estate/export_objects"
#create coordinate variables
coords = coordinates(spreal)
data = spreal@data
data$lat = coords[,2]
data$long = coords[,1]
#transform all variables that are factor into character
data2 = data
len1 = length(names(data2))
ii = 1
for (ii in 1:len1) {
vv = data[,ii]
if (is.factor(vv)) {
vv1 = as.character(vv)
data2[,ii] = vv1
}
}
str(data2)
## 'data.frame': 16935 obs. of 87 variables:
## $ ID : chr "9037" "16908" "16565" "15812" ...
## $ COD_EMP : int 9056 16935 16592 15839 15840 15390 15910 9326 14147 8239 ...
## $ TIPO_EMP : chr "HORIZ" "VERTIC" "HORIZ" "HORIZ" ...
## $ MES_LAN : chr "15-JUL-2004" "15-DEC-2013" "15-SEP-2013" "15-DEC-2012" ...
## $ ANO_LAN : int 2004 2013 2013 2012 2012 2012 2012 2004 2010 2003 ...
## $ DATA_ENT : chr "15-JUL-2004" "15-APR-2016" "15-AUG-2015" "15-MAR-2015" ...
## $ DIST : chr NA NA NA NA ...
## $ SUBPREF : chr NA NA NA NA ...
## $ MUNICIPIO : chr "COTIA" "COTIA" "COTIA" "COTIA" ...
## $ ENDERECOCO: chr "Ouro Preto (Rua) 303" "Agua Espraiada (Estr) 320" "Caucaia (Estr) 3 000" "Jose Teixeira de Oliveira (Rua) 620" ...
## $ LOGRADOURO: chr "Ouro Preto" "Agua Espraiada" "Caucaia" "Jose Teixeira de Oliveira" ...
## $ TIT_VIA : chr NA NA NA NA ...
## $ TIPO_VIA : chr "Rua" "Estr" "Estr" "Rua" ...
## $ NUM : chr "303" "320" "3000" "620" ...
## $ CEP : chr "04330-045" "06726-400" "06727-190" "06726-486" ...
## $ CEP_4DIG : chr "4330" "6726" "6727" "6726" ...
## $ NOME_EMP : chr "OURO PRETO" "PARQUE SAINT BENJAMIN" "RES VIVA CAUCAIA - 1\xaa FASE" "RES AGUASSAI I" ...
## $ ZONA : chr "R" "R" "R" "R" ...
## $ SETOR : chr "M" "M" "M" "M" ...
## $ QUADRA : chr "S" "S" "S" "S" ...
## $ LOTE : chr "P" "P" "P" "P" ...
## $ DORM_UNID : int 4 2 2 2 3 3 3 2 3 2 ...
## $ BANH_UNID : int 4 1 1 1 2 2 2 1 2 1 ...
## $ GAR_UNID : int 4 1 1 2 3 2 2 2 2 1 ...
## $ ELEV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ COB : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BLOCOS : int 1 7 1 1 0 1 1 1 1 1 ...
## $ UNIDAND : int 0 7 0 0 0 0 0 0 0 0 ...
## $ ANDARES : int 2 4 2 2 2 2 2 2 1 1 ...
## $ AR_UT_UNID: chr "180" "44" "48" "61,44" ...
## $ AR_TT_UNID: chr "500" "50" "70" "90" ...
## $ AR_TT_TERR: chr "9986" "16165" "25470" "32100" ...
## $ TT_UNID : int 10 192 198 108 36 20 22 70 11 78 ...
## $ DORM_EMP : int 40 384 396 216 108 60 66 140 33 156 ...
## $ BANH_EMP : int 40 192 198 108 72 40 44 70 22 78 ...
## $ GAR_EMP : int 40 192 198 216 108 40 44 140 22 78 ...
## $ AU_EMP : chr "1800" "8427" "9563" "6635,52" ...
## $ AT_EMP : chr "5000" "9663" "13860" "9720" ...
## $ PC_TT_UN : chr "290000" "129558" "135800" "139000" ...
## $ PC_M2_AU : chr "1611,11" "2952" "2812" "2262,37" ...
## $ PC_M2_AT : chr "580" "2574,17" "1940" "1544,44" ...
## $ PC_TT_ATU : chr "489006,07" "129558" "137983,74" "146683,58" ...
## $ PC_AU_ATU : chr "2716,7" "2952" "2857,22" "2387,43" ...
## $ PC_AT_ATU : chr "978,01" "2574,17" "1971,2" "1629,81" ...
## $ PC_EMP_ATU: chr "4890060,73" "24875136" "27320780,31" "15841827,03" ...
## $ VLR_US__CO: chr "3" "2" "2" "2" ...
## $ PC_TT_UN_U: chr "93942,34" "54944,02" "57178,95" "65596,98" ...
## $ PC_M2_AU_U: chr "521,9" "1251,86" "1183,83" "1067,66" ...
## $ PC_M2_AT_U: chr "187,88" "1091,68" "816,84" "728,85" ...
## $ SIST_FINAN: chr "Preco Fechado" "Preco Fechado" "Preco Fechado" "Preco Fechado" ...
## $ AGENTE : chr NA "BANCO DO BRASIL" "CEF" "SFH" ...
## $ INCORPORAD: chr "LM Incorporacoes e Empreendimentos Ltda" "MRV Engenharia e Participacoes S/A" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
## $ VENDEDORA : chr "Formanova/Proin/G3I/3\xbaMI/Especonc/Ellit" "Habitcasa Consultoria de Imoveis Ltda" "Drive Imoveis" "Atam Empreendimentos" ...
## $ CONSTRUTOR: chr "Consil Engenharia S/C Ltda" "MRV Engenharia e Participacoes S/A" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
## $ ENGENHEIRO: chr "Nao Informado" "Nao Informado" "Nao Informado" "Nao Informado" ...
## $ ARQUITETO : chr "Nao Informado" "Nao Informado" "Nao Informado" "Nao Informado" ...
## $ HOTELARIA : chr NA NA NA NA ...
## $ INCORPOR_A: chr "LM Incorporacoes e Empreendimentos Ltda" "MRV Engenharia e Participacoes S" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
## $ INCORPOR_B: chr NA NA NA NA ...
## $ INCORPOR_C: chr NA NA NA NA ...
## $ INCOPORADO: chr NA NA NA NA ...
## $ INCORPOR_D: chr NA NA NA NA ...
## $ INCORPOR_E: chr NA NA NA NA ...
## $ VENDEDOR_A: chr "Formanova" "Habitcasa" "Drive Imoveis" "Atam Empreendimentos" ...
## $ VENDEDORA2: chr "Proin" NA NA NA ...
## $ VENDEDORA3: chr "G3I Geracao de Inv e Incorp Imob Ltda" NA NA NA ...
## $ VENDEDORA4: chr "3\xbaMI" NA NA NA ...
## $ VENDEDORA5: chr "Especonc" NA NA NA ...
## $ VENDEDORA6: chr "Ellit" NA NA NA ...
## $ CONSTRUT_A: chr "Consil Engenharia S" "MRV Engenharia e Participacoes S" "Metacons Construtora e Incorporadora" "HSF Incorporacao e Construcao Ltda" ...
## $ CONSTRUT_B: chr NA NA NA NA ...
## $ CONSTRUT_C: chr NA NA NA NA ...
## $ CONSTRUT_D: chr NA NA NA NA ...
## $ COOPERATIV: chr "0" "0" "0" "0" ...
## $ HOTEL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ EXFLAT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AP2010 : num 3.51e+12 3.51e+12 3.51e+12 3.51e+12 3.51e+12 ...
## $ SC_2010 : num 3.51e+14 3.51e+14 3.51e+14 3.51e+14 3.51e+14 ...
## $ RENRESP91 : chr "112291,99" "112291,99" "112291,99" "112291,99" ...
## $ RENRESP00 : chr "570,95" "570,95" "570,95" "570,95" ...
## $ RENRESP10 : chr "954,75" "954,75" "954,75" "954,75" ...
## $ PCMEDAU91 : chr NA NA NA NA ...
## $ PCMEDAU00 : chr NA NA NA NA ...
## $ PCMEDAU10 : chr NA "2470,97" NA "2566,94" ...
## $ lat : num -23.7 -23.7 -23.7 -23.7 -23.7 ...
## $ long : num -47 -47 -47 -47.1 -47.1 ...
#data have 16935 lines
#map of missing data
missmap(data2)
There are many missing for information that we don’t use for example, the seller, the construction firm. So this is not a problem.
#explore year variable
ano=data2$ANO_LAN
summary(ano)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1985 1995 2003 2002 2009 2013
#ano goes from 1985 till 2013
ano %>% class #integer
## [1] "integer"
sum(is.na(ano)) #there are not missing for ano
## [1] 0
table(ano)
## ano
## 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
## 367 745 276 342 452 287 216 212 432 533 578 570 548 402 390 558
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 521 608 684 659 617 656 867 874 765 982 992 758 1044
Make time series with how many lauches are there per year.
ano_which = names(table(ano))
ano_ts = as.numeric(table(ano))
plot(ano_which,ano_ts, type='l')
length(ano_ts)
## [1] 29
The dataset have 29 years of data. More recent years have more real estate lauches.
Idea: Make spatiotemporal models. Make predictions.
The following summary of data from paper: Albuquerque_2016_preco_imovel
\texttt{“/home/william/Dropbox/working/projects_git/sp_real_estate/Albuquerque_2016_preco_imovel.pdf”}
DORM_UNID: quantidade de dormitorios por unidade.
BANH_UNID: quantidade de banheiros por unidade.
GAR_UNID: quantidade d
ANDARES: quantidade de andares do empreendimento.e vagas por unidade.
ELEV: quantidade de elevadores por lancamento.
COB: quantidade de unidades na cobertura.
BLOCOS: quantidade de blocos no empreendimento.
UNIDAND: quantidade de unidades por andar.
AR_UT_UNID: area util da unidade em m2.
PC_TT_UN: variavel dependente preco de venda da unidade.
#change data2
data3 = data2
#transform character into numeric
list_variables = c("AR_UT_UNID", "PC_TT_UN_U","PC_TT_ATU", "PC_TT_UN")
variable = "PC_TT_UN"
for (variable in list_variables) {
fl1 = data3[,variable]
fl2 = as.numeric(gsub(",",".",fl1))
data3[variable] = fl2
}
AR_UT_UNID
is area of unit.
PC_TT_UN_U
,PC_TT_ATU
, PC_TT_UN
are all price variable. We want to analyze all those price variables to see what is inside in each of them.
data3$AR_UT_UNID %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.72 57.68 74.00 103.55 120.00 1975.00
data3$PC_TT_ATU %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57202 239701 384850 656059 704158 31843099
#note how the median is smaller than the mean
#which makes sense because the distribution of prices is asymetric
data3$PC_TT_UN %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.210e+05 2.661e+05 2.651e+07 7.342e+05 2.112e+09
data3$PC_TT_UN_U %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7803 59475 103459 176922 195511 6377875
#we'll use PC_TT_ATU
hist(data3$PC_TT_ATU, breaks=1000)
#there some extreme values
#the maximum value is 31,843,099
vv2 = data3$PC_TT_ATU
vv3 = vv2[vv2<2000000] #take out some outliers
#see distribution of only houses less than 2 000 000 reais
hist(vv3, breaks=1000)
#tree methods doesn't have problems in using non-gaussian variables
#take some variables only
data3_1 = data3[,c(list_variables,"ANO_LAN") ]
#time series graphs for MEAN
time_data3 = aggregate(data3_1, by=list(data3$ANO_LAN), FUN=mean, na.rm=TRUE)
#time series for PC_TT_ATU
plot(time_data3$ANO_LAN,time_data3$PC_TT_ATU,type='l')
#time series for PC_TT_UN
plot(time_data3$ANO_LAN,time_data3$PC_TT_UN,type='l')
#time series for PC_TT_UN_U
plot(time_data3$ANO_LAN,time_data3$PC_TT_UN_U,type='l')
#PC_TT_ATU must be preco atualizado
#make graphs for MEDIAN
#time series analysis for median
time_data3_median = aggregate(data3_1, by=list(data3$ANO_LAN),
FUN=median, na.rm=TRUE)
#time series for PC_TT_ATU
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_ATU,type='l')
#time series for PC_TT_UN
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_UN,type='l')
#[1] "AR_UT_UNID" "PC_TT_UN_U" "PC_TT_ATU" "PC_TT_UN"
#time series for PC_TT_UN_U
plot(time_data3_median$ANO_LAN,time_data3_median$PC_TT_UN_U,type='l')
#make graph of PC_TT_ATU with mean an median side by side
plot(time_data3_median$ANO_LAN, time_data3_median$PC_TT_ATU,type='l')
lines(time_data3$ANO_LAN, time_data3$PC_TT_ATU,type='l', col='dark red',)
The black line is the median, dark red line is mean.
The median rovers below the mean, the distribution is asymetrical
#see how many municipalities are there in the base
data3$MUNICIPIO %>% table %>% kable
. | Freq |
---|---|
ARUJA | 11 |
BARUERI | 276 |
CAIEIRAS | 1 |
CAJAMAR | 39 |
CARAPICUIBA | 52 |
COTIA | 226 |
DIADEMA | 105 |
EMBU | 13 |
FERRAZ DE VASCON | 28 |
FRANCO DA ROCHA | 2 |
GUARAREMA | 2 |
GUARULHOS | 490 |
ITAPECERICA DA S | 4 |
ITAPEVI | 7 |
ITAQUAQUECETUBA | 31 |
JANDIRA | 25 |
MAIRIPORA | 1 |
MAUA | 107 |
MOGI DAS CRUZES | 145 |
OSASCO | 239 |
POA | 33 |
SANTANA DE PARNA | 49 |
SANTO ANDRE | 539 |
SAO BERNARDO DO | 691 |
SAO CAETANO DO S | 413 |
SAO PAULO | 13248 |
SUZANO | 60 |
TABOAO DA SERRA | 90 |
VARGEM GRANDE PA | 8 |
This table show all lauches for all 29 years of data.
data3$MUNICIPIO %>% table %>% length
## [1] 29
There are only 29 municipalities in the dataset. The data is for the metropolitan region of São Paulo.
Another place where you can find the data and more: https://properatidata.carto.com/tables/lanres_85_13_rmsp_cem/public/map
#shrink data -------------------------------
#this link: http://adv-r.had.co.nz/Subsetting.html
#have very interesting material
#
#we'll only use the last three years of data so that we can fit it into memory
#we'll use years 2011, 2012 and 2013
pos1 = (ano==2011|ano==2012|ano==2013)
#pos1 = ano %in% c(2011,2012,2013)
#create data31
data31 = data3[pos1,]
data31 %>% dim
## [1] 2794 87
#only 2794 observations, more feasible without crashing the computer
We use the last three years in the data because there are two much data, and the memory of my machine can support all objects created. And with a smaller dataset is faster to estimate. We use data for years 2011, 2012 and 2013.
We make a cross-section boosting model. Our objective is to classify prices.
#take only variables that we'll use
data4 = data31[,c(
"DORM_UNID", #DORM_UNID: quantity of restrooms
"BANH_UNID", #BANH_UNID: quatity of bathrooms
"GAR_UNID", #GAR_UNID: quantity of garages
"ELEV", #ELEV: quantity of elevators
"BLOCOS", #BLOCOS: qnt of blocks in condominium
"UNIDAND", #UNIDAND: qnt of unit per floor
"ANDARES", #ANDARES: qnt of floors in building
"AR_UT_UNID",#AR_UT_UNID: area of unit
"PC_TT_ATU", #PC_TT_ATU: price of unit
"ANO_LAN", #ANO_LAN: year of lauch
"long", #long: longitude
"lat")] #lat: latitude
data4 %>% str
## 'data.frame': 2794 obs. of 12 variables:
## $ DORM_UNID : int 2 2 2 3 3 3 2 2 2 2 ...
## $ BANH_UNID : int 1 1 1 2 2 2 1 1 1 1 ...
## $ GAR_UNID : int 1 1 2 3 2 2 2 1 1 1 ...
## $ ELEV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BLOCOS : int 7 1 1 0 1 1 1 7 1 1 ...
## $ UNIDAND : int 7 0 0 0 0 0 0 4 0 0 ...
## $ ANDARES : int 4 2 2 2 2 2 2 4 1 2 ...
## $ AR_UT_UNID: num 44 48 61.4 76.8 86.9 ...
## $ PC_TT_ATU : num 129558 137984 146684 185729 282754 ...
## $ ANO_LAN : int 2013 2013 2012 2012 2012 2012 2012 2012 2012 2013 ...
## $ long : num -47 -47 -47.1 -47.1 -47 ...
## $ lat : num -23.7 -23.7 -23.7 -23.7 -23.6 ...
All variables are numeric.
#apply coordinates into data frame to create spatial points
#create spreal2
spreal2 = data4
coordinates(spreal2) = ~long+lat
spreal2@proj4string = spreal@proj4string
spreal2 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
plot(spreal2)
Those are the information points for the smaller sample of the last three years.
For our models we are only going to use those points. There are 2794 observations.
spreal2@data[1:4, ] %>% kable
DORM_UNID | BANH_UNID | GAR_UNID | ELEV | BLOCOS | UNIDAND | ANDARES | AR_UT_UNID | PC_TT_ATU | ANO_LAN | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 1 | 1 | 0 | 7 | 7 | 4 | 44.00 | 129558.0 | 2013 |
3 | 2 | 1 | 1 | 0 | 1 | 0 | 2 | 48.00 | 137983.7 | 2013 |
4 | 2 | 1 | 2 | 0 | 1 | 0 | 2 | 61.44 | 146683.6 | 2012 |
5 | 3 | 2 | 3 | 0 | 0 | 0 | 2 | 76.80 | 185728.9 | 2012 |
#to save df in disk
#saveRDS(spreal2, "spreal2.RData")
#all point in our dataset
#download municiplities borders -------------------------
#we use municipalities frontier only to make the maps, for models
#we do not use municipalities polygons
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields
#pick up only municipalities that are in the state of sao paulo
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='35',] #restrict nc.sids state==35 Sao Paulo
plot(munic)
Those are the municipalities in the state of São Paulo.
#transform cod ibge from munic into character
munic$CODIGOIB1 = as.character(munic$CODIGOIB1)
Municipalities in the metrpolitan region
Norte: Caieiras, Cajamar, Francisco Morato, Franco da Rocha e Mairipora.
Leste: Aruja, Biritiba Mirim, Ferraz de Vasconcelos, Guararema, Guarulhos, Itaquaquecetuba, Mogi das Cruzes, Poa, Salesopolis, Santa Isabel e Suzano.
Sudeste: Diadema, Maua, Ribeirao Pires, Rio Grande da Serra, Santo Andre, Sao Bernardo do Campo e Sao Caetano do Sul.
Sudoeste: Cotia, Embu, Embu Guacu, Itapecerica da Serra, Juquitiba, Sao Lourenco da Serra, Taboao da Serra e Vargem Grande Paulista.
Oeste: Barueri, Carapicuiba, Itapevi, Jandira, Osasco, Pirapora do Bom Jesus e Santana de Parnaiba.
#I took those codes by hand, but there must be a better way to get them
list_metropolitan = c("350900","350920","351630","351640","352850",
"350390","350660","351570","351830","351880",
"352310","353060","353980","354500","354680",
"355250","351380","352940","354330","354410",
"354780","354870","354880","351300","351500",
"351510","352220","352620","354995","355280",
"355645","350570","351060","352250","352500",
"353440","353910","354730","355030")
length(list_metropolitan)
## [1] 39
There are 39 municipalities in the metropolitan region of São Paulo.
# a test to find municipality in list
sum(munic@data$CODIGOIB1 == "355030")
## [1] 1
#select only municipalities of the metrpolitan region
match_places = match(list_metropolitan, munic@data$CODIGOIB1)
#select wanted municipalities in munic dataset
munic2 = munic[match_places,]
Letvec1, vec2
be two vectors, match(vec1, vec2)
function finds the first place where the first element of vec1
is found in vec2
. We want here from all muncipalities of the state of São Paulo, just of the vector list_metropolitan
.
#map of metrpolitan region of sao paulo
plot(munic2)
#get map from google
# ( west,south, east,north) #determined by hand on google maps
map = get_map(location =c(-47.2,-24.1,-45.6,-23.1),maptype = 'satellite')
plot(map)
This is a map form Google of the metropolitan region.
#procedures for ggplot
posReal = data.frame(lat=coordinates(spreal2)[,2],
lon=coordinates(spreal2)[,1])
munic_fortify = fortify(munic2, region="CODIGOIB1")
(sp_real_state =
ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=munic_fortify, color='blue', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=posReal, aes(x=lon, y=lat),size=.5, color='red', alpha=.5))
ggsave("images/sp_real_state_last2years.jpg", width = 20, height = 20, units = "cm",dpi = 600)
Red points are points of observation for the smaller sample. Blue frontier are the municipalities in the metropolitan region.
#plot raster map --------------------------------------
#create polygon to build raster
spdf = spreal2 #spdf is the copula for spatial polygon or point
proj4string(munic2) #no projection
## [1] NA
pol = as(munic2,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 10000) #one million little squares
gridded(grid) = TRUE
proj4string(grid) = proj4string(spdf) #give projection the same as spreal
#plot of one million little squares
plot(grid)
This grid have 10 000 squares. We use this to make a IDW interpolation and represent the data in a raster format.
The object munic2
is a spatial polygon with all municipalties of the metropolitan region.
#we must use spreal2 which contains data about prices
v = spreal2$PC_TT_ATU
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("variable", style = "quantile", #breaks = c(0,0.5,1,2,3,4,5,6),
n = 20, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.4, col = "black", lwd=.3)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title="Prices",
title.size=1,
title.position = c(0,.9))
tm1
tmap_save(tm = tm1,
filename = "name1.jpg",
width = 3, height = 3)
#some areas of the map are actually non informative because
#we don't have data
Bluer regions have higher prices.
This map may be misleading, because in many places there are few observed points. We make the area of interpolation smaller so that we can have a more accurate view of how prices are distributed in the map.
The step we change is when we create the grid in `pol = as(munic2,“SpatialPolygons”)
grid = spsample(pol, type = ‘regular’, n = 10000) #one million little squares`
We change the object of munic2
to another spatial polygons
#just an experiment with hexagons...
size = 0.03
hex_points = spsample(munic2, type = "hexagonal", cellsize = size)
hex_points %>% class
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% plot
#we could study the spatial point process of occurances of lauches
This is an experiment in using hexagon polygons.
Idea: make price map in the polygons instead of making a continuous like raster map. The problem is that we need information that covers a more area, and not just concentrate in some places.
#experimenting with sf pacakge ------------------------------
#I want to build a radius around each point of observation
#and build a spatial polygon from it
#exercise taken from answer:
#https://gis.stackexchange.com/a/292336
spreal2 %>% names
## [1] "DORM_UNID" "BANH_UNID" "GAR_UNID" "ELEV" "BLOCOS"
## [6] "UNIDAND" "ANDARES" "AR_UT_UNID" "PC_TT_ATU" "ANO_LAN"
matcor = coordinates(spreal2)
points2 = st_as_sf(spreal2,coords=c("long","lat"))
points2 %>% class
## [1] "sf" "data.frame"
points2 %>% plot
Here I’m experimenting with the sf
package. This map show the points and intensity of the variables, which are all numeric or just integer.
#make polygon union all municipalities in metropolitan region
unionMunic = gUnaryUnion(munic2)
unionMunic %>% class
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"
unionMunic %>% plot
gUnaryUnion()
function takes a spatial polygons as input and output just the outer border, dissolving internal frontiers.
#Convert sp to sf:
sPol1 = st_as_sf(unionMunic)
sPol1 %>% class
## [1] "sf" "data.frame"
sPol1 %>% plot
The plot of an sf
polygon object have the same look of a spatial polygon
.
#points2 has a coordinate system, take it to the sPol1:
st_crs(sPol1)
## Coordinate Reference System: NA
st_crs(points2)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +datum=WGS84 +no_defs"
st_crs(sPol1) = st_crs(points2)
Here we are assigning a projection for object sPol1
. Note that sPol1
doesn’t have a projection, we assign the same projection from points2
.
#Transform to metric coordinate system:
points2Km = st_transform(points2, "+proj=utm +zone=42N +datum=WGS84 +units=km")
sPol1Km = st_transform(sPol1, "+proj=utm +zone=42N +datum=WGS84 +units=km")
Here we are transforming the projection from a lat-long form to UTM. UTM is measured in meters instead degrees.
points2Km %>% plot
sPol1Km %>% plot
#the plots are strange, they are up side down
When we change the projection the plot becomes strange, it is up side down. That is, when in UTF projection the map is not informative, we need to use a long-lat projection.
#Create 10km buffer zone:
points2_buffer = st_buffer(points2Km, 10)
points2_buffer %>% class
## [1] "sf" "data.frame"
Here we are making a buffer zone around each point of observation. We chose a 10km radius around each point. From this buffer zone we’ll make the IDW raster map.
#find intersection between municipality polygons and bufferzone
points2_sPol1 = st_intersection(points2_buffer, sPol1Km)
points2_sPol1 %>% class
## [1] "sf" "data.frame"
st_intersection()
makes an intersection between two polygon objects, we make a buffer zone but we also want that the polygon is inside the frontier of the metropolitan region. points2_buffer
is the 10km buffer zone, and sPol1Km
is the metropolitan region polygon.
#https://gis.stackexchange.com/questions/239118/r-convert-sf-object-back-to-spatialpolygonsdataframe
#https://gis.stackexchange.com/a/239215
#transform sf object to spatial polygons object
buffsp = as(points2_sPol1, 'Spatial')
buffsp %>% class
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Here we transform back the sf
object to a spatial polygon sp
object.
buffsp %>% plot
This is the result of making the buffer and cutting places where the metropolitan region frontier is smaller than the 10km buffer. Note in the south east part that the circles are cut.
#make union of spatial polygons
unionSp = gUnaryUnion(buffsp)
unionSp %>% plot
This is the result when we dissolve the internal frontiers.
Remeber that the object is in UTF, so it is up side down.
#change projection
#https://stackoverflow.com/a/30018607/12552718
#https://stackoverflow.com/questions/30018098/how-to-convert-utm-coordinates-to-lat-and-long-in-r
#sputm = SpatialPoints(randompoints, proj4string=CRS("+proj=utm +zone=42 +datum=WGS84"))
unionSp = spTransform(unionSp, CRS("+proj=longlat +datum=WGS84"))
Here we change the projection from a UTF to a long-lat. Which makes the map in the right orientation.
unionSp %>% plot(col='grey')
spreal2 %>% plot(add=T, col='red')
munic2 %>% plot(add=T)
#we'll use unionSp to make the IDW interpolation
#and visualization of prices and other variables
Red crosses are points of observation. Gray area are the buffers. Black lines are municipalities frontiers of the metropolitan region.
#plot all polygons and spatial points together
munic_fortify = fortify(munic2, region="CODIGOIB1")
unionFort = fortify(unionSp, region="1")
posReal = data.frame(lat=coordinates(spreal2)[,2],
lon=coordinates(spreal2)[,1])
(mapPol =
ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=munic_fortify, color='blue', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_polygon(data=unionFort, color='black', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=posReal, aes(x=lon, y=lat),size=1, color='red', alpha=.5))
ggsave("mapPol.jpg", width = 20, height = 20, units = "cm",dpi = 600)
The red points are the observations. Black lines are the 10km buffer area around each point. And blue line are the municipal frontiers of the metropolitan region of São Paulo.
#build raster map with new frontier -----------------------
#create polygon to build raster
spdf = spreal2 #spdf is the copula for spatial polygon or point
pol = as(unionSp,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 30000)
gridded(grid) = TRUE
proj4string(grid) = proj4string(spdf) #give projection the same as spreal
#plot of one million little squares
plot(grid)
Here we make 30 000 square grid for IDW interpolation.
#we must use spreal2 which contains data about prices
v = spreal2$PC_TT_ATU
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) +
tm_raster("variable", style = "quantile",
n = 10, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.4, col = "black", lwd=.6)+
tm_layout(legend.text.size = .5,
legend.position = c("right","bottom"),
title="Real Estate Prices",
title.size=1,
title.position = c(0.01,.9))
tm2
This map show a more proportional picture of how prices are distributed. In the ceter of São Paulo prices are higher. And the IDW function spread it values beyond. Note the circle blob in the north, the price is low in the center but becomes higher in the borders, this doesn’t make sense. The higher prices from the center of the city with the effect of IDW function is making this.
# apply simple liner model to the real estate data --------------
data4 = spreal2@data
data4 %>% names %>% kable
x |
---|
DORM_UNID |
BANH_UNID |
GAR_UNID |
ELEV |
BLOCOS |
UNIDAND |
ANDARES |
AR_UT_UNID |
PC_TT_ATU |
ANO_LAN |
prices = data4$PC_TT_ATU
hist(prices, breaks=200)
#prices are highly asymetric, we take the log
pricesl = log(prices)
hist(pricesl, breaks=200)
#now it have a more Gaussian like shape
#create data41 to substitute log prices
data41 = data4
data41$PC_TT_ATU = pricesl
#standatize data, use data41
data5 = as.data.frame(scale(data41))
We need to standardize because in the inla
function we have problem of convergence.
nrow1 = nrow(data5)
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length
## [1] 2149
train %>% length/nrow1 #proportion of train set
## [1] 0.7691482
#create train test dataset
trainX = data5[train,] #we exclude the variable 9 because it is price "PC_TT_ATU"
trainY = data5$PC_TT_ATU[train]
testX = data5[train,]
testY = data5$PC_TT_ATU[-train]
mOls = lm(PC_TT_ATU ~., data=trainX)
summary(mOls)
##
## Call:
## lm(formula = PC_TT_ATU ~ ., data = trainX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4559 -0.3542 -0.0175 0.3182 2.0433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0006293 0.0111588 0.056 0.955
## DORM_UNID -0.2063837 0.0180898 -11.409 < 2e-16 ***
## BANH_UNID 0.2661923 0.0211269 12.600 < 2e-16 ***
## GAR_UNID 0.3242184 0.0207998 15.588 < 2e-16 ***
## ELEV -0.0161026 0.0129381 -1.245 0.213
## BLOCOS -0.1093757 0.0112653 -9.709 < 2e-16 ***
## UNIDAND 0.0918194 0.0145181 6.324 3.08e-10 ***
## ANDARES 0.1958803 0.0136529 14.347 < 2e-16 ***
## AR_UT_UNID 0.4264263 0.0200034 21.318 < 2e-16 ***
## ANO_LAN 0.0469628 0.0112192 4.186 2.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5169 on 2139 degrees of freedom
## Multiple R-squared: 0.7417, Adjusted R-squared: 0.7406
## F-statistic: 682.4 on 9 and 2139 DF, p-value: < 2.2e-16
DORM_UNID
have negative impact on price, and BANH_UNID
have positive impact.
Although the prices are in present value, the year of lauch ANO_LAN
have a positive impact on price.
Number of garages GAR_UNID
have a positive and significant impact on price.
AR_UT_UNID
have a positive impact on prices.
#take the residuals of model ols
res1 = residuals(mOls)
#create spatial object for training
sptrain = spreal2[train,]
knn1 = knearneigh(sptrain, k=5, longlat = TRUE)
nb1 = knn2nb(knn1)
listw1 = nb2listw(nb1)
moran.test(res1, listw1)
##
## Moran I test under randomisation
##
## data: res1
## weights: listw1
##
## Moran I statistic standard deviate = 45.218, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5683299088 -0.0004655493 0.0001582306
#there is spatial correlation between residuals of simple ols
Note that in the residuals there is spatial correlation. There is some structure to model still. We’ll explore more this structure in the next model.
yhr = predict(mOls,trainX)
resOls = yhr-trainY
mean((resOls)^2)
## [1] 0.2659673
#this is the train MSE
yh = predict(mOls,testX)
mean((yh-testY)^2)
## [1] 1.670049
#this is the test MSE
orig1 = sptrain$PC_TT_ATU
v = (resOls)*sd(orig1)#make the value in reais
df1 = data.frame(v)
spdf = spreal2 #spdf is the copula for spatial polygon or point
coordinates(df1) = coordinates(sptrain)
proj4string(df1) = proj4string(sptrain)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "variable")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm_resOLS = tm_shape(raster_idw) +
tm_raster("variable", style = "quantile",
n = 10, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.4, col = "black", lwd=.6)+
tm_layout(legend.text.size = .5,
legend.position = c("right","bottom"),
title="Residuals OLS",
title.size=1,
title.position = c(0.01,.9))
tm_resOLS
Map of OLS residuals.
There are clearly clusters of values. This explains the presence of spatial autocorrelation.
The package inla
uses Laplace aproximations to make Bayesian inference.
#spatial model ----------------------
#try doing re-scale when convergence problems occur
data6 = data5
data6$ID = 1:dim(data6)[1]
spatial_data = spreal2
spatial_data@data = data6
spatial_data$PC_TT_ATU %>% hist(breaks=100) #it is gaussian like
knn1 = knearneigh(spatial_data, k=5, longlat = TRUE) #use knn, 5 nearest neighbors
nb1 = knn2nb(knn1)
nb2INLA("nc.adj", nb1) #this creates an object called nc.adj that will be used by the model
#inla model without spatial error random effects
m1_01 = inla(PC_TT_ATU ~ DORM_UNID+BANH_UNID+GAR_UNID+ELEV+BLOCOS+UNIDAND+ANDARES+AR_UT_UNID+ANO_LAN, family="gaussian", data=data6, control.predictor=list(compute=TRUE), verbose=FALSE)
summary(m1_01)
##
## Call:
## c("inla(formula = PC_TT_ATU ~ DORM_UNID + BANH_UNID + GAR_UNID + ", "
## ELEV + BLOCOS + UNIDAND + ANDARES + AR_UT_UNID + ANO_LAN, ", " family =
## \"gaussian\", data = data6, verbose = FALSE, control.predictor =
## list(compute = TRUE))" )
## Time used:
## Pre = 1.26, Running = 2.99, Post = 0.353, Total = 4.6
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 0.000 0.010 -0.019 0.000 0.019 0.000 0
## DORM_UNID -0.194 0.016 -0.225 -0.194 -0.163 -0.194 0
## BANH_UNID 0.252 0.018 0.216 0.252 0.288 0.252 0
## GAR_UNID 0.309 0.018 0.273 0.309 0.344 0.309 0
## ELEV -0.014 0.012 -0.037 -0.014 0.008 -0.014 0
## BLOCOS -0.118 0.011 -0.139 -0.118 -0.097 -0.118 0
## UNIDAND 0.086 0.012 0.063 0.086 0.110 0.086 0
## ANDARES 0.194 0.012 0.171 0.194 0.217 0.194 0
## AR_UT_UNID 0.448 0.018 0.412 0.448 0.485 0.448 0
## ANO_LAN 0.046 0.010 0.027 0.046 0.066 0.046 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 3.75 0.099 3.55 3.74
## 0.975quant mode
## Precision for the Gaussian observations 3.94 3.74
##
## Expected number of effective parameters(stdev): 10.06(0.002)
## Number of equivalent replicates : 277.62
##
## Marginal log-Likelihood: -2201.94
## Posterior marginals for the linear predictor and
## the fitted values are computed
INLA is a Bayesian estimation, so there is a credibility interval for the distribution of the parameters. This model is basically the same as the OLS. And you can note that the results are almost the same. In Bayesian estimation the prior can have important impact in the estimates.
#boosting --------------------------------------------------------------
#create data12
data12 = spreal2@data
data12 %>% names
## [1] "DORM_UNID" "BANH_UNID" "GAR_UNID" "ELEV" "BLOCOS"
## [6] "UNIDAND" "ANDARES" "AR_UT_UNID" "PC_TT_ATU" "ANO_LAN"
#create data2 another model with spatial information
data2 = data12
data2$long = coordinates(spreal2)[,1]
data2$lat = coordinates(spreal2)[,2]
data2 %>% names #include long and lat to the model
## [1] "DORM_UNID" "BANH_UNID" "GAR_UNID" "ELEV" "BLOCOS"
## [6] "UNIDAND" "ANDARES" "AR_UT_UNID" "PC_TT_ATU" "ANO_LAN"
## [11] "long" "lat"
#model with data12, without long lat
nrow1 = nrow(data12)
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length
## [1] 2149
trainX = data12[train,]
trainY = data12$PC_TT_ATU[train]
testX = data12[-train,]
testY = data12$PC_TT_ATU[-train]
trainSpreal = spreal2[train,]
m1 = gbm(PC_TT_ATU~., data=trainX, distribution='gaussian',
n.trees = 1000, interaction.depth = 4)
m1
## gbm(formula = PC_TT_ATU ~ ., distribution = "gaussian", data = trainX,
## n.trees = 1000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 9 predictors of which 9 had non-zero influence.
m1 %>% summary
## var rel.inf
## AR_UT_UNID AR_UT_UNID 81.754502
## GAR_UNID GAR_UNID 6.730623
## ANDARES ANDARES 3.742729
## UNIDAND UNIDAND 2.495743
## ELEV ELEV 1.574726
## DORM_UNID DORM_UNID 1.244310
## ANO_LAN ANO_LAN 1.094080
## BANH_UNID BANH_UNID 1.014951
## BLOCOS BLOCOS 0.348338
#export values to disk to be used in another script
saveRDS(munic, file = paste0(pathExport,"/munic.rds"))
saveRDS(m1, file = paste0(pathExport,"/modelBoosting.rds"))
saveRDS(trainX, file = paste0(pathExport,"/trainX.rds"))
saveRDS(trainY, file = paste0(pathExport,"/trainY.rds"))
saveRDS(testX, file = paste0(pathExport,"/testX.rds"))
saveRDS(testY, file = paste0(pathExport,"/testY.rds"))
saveRDS(trainSpreal, file = paste0(pathExport,"/trainSpreal.rds"))
# Restore the object
#readRDS(file = "my_data.rds")
par(mfrow=c(1,2))
plot(m1,i='AR_UT_UNID')
plot(m1,i='GAR_UNID')
#train set MSE
yh1 = predict(m1,trainX,n.trees=1000)
mean((yh1-trainY)^2)
## [1] 30646177516
#test set MSE
yh = predict(m1,testX,n.trees=1000)
mean((yh-testY)^2)
## [1] 43306155745
#model with data2, with long lat
dataC = data2 #dataC is a data copula, just to hold the value
nrow1 = nrow(dataC)
set.seed(1)
train = sample(1:nrow1,nrow1/1.3)
train %>% length
## [1] 2149
trainX = dataC[train,]
trainY = dataC$PC_TT_ATU[train]
testX = dataC[-train,]
testY = dataC$PC_TT_ATU[-train]
m1 = gbm(PC_TT_ATU~., data=trainX, distribution='gaussian',
n.trees = 1000, interaction.depth = 4)
m1 %>% summary
## var rel.inf
## AR_UT_UNID AR_UT_UNID 80.4970312
## GAR_UNID GAR_UNID 5.4575780
## long long 4.8621959
## lat lat 4.4585578
## UNIDAND UNIDAND 1.5815730
## ANDARES ANDARES 1.3118184
## BANH_UNID BANH_UNID 0.6119743
## ELEV ELEV 0.4719243
## ANO_LAN ANO_LAN 0.3535569
## DORM_UNID DORM_UNID 0.2617212
## BLOCOS BLOCOS 0.1320689
By far AR_UT_UNID
is the most important variable to explain prices. Longitude and latitute are also on the top important although not so important as area of unit. This shows that location is one of the most imporntant variables that explain price on real estate.
plot(m1,i='AR_UT_UNID')
Area of unit AR_UT_UNID
have an increasing and positive impact on prices.
plot(m1,i='GAR_UNID')
plot(m1,i='long')
plot(m1,i='lat')
Longitude and Latitute show that in the center of the map prices are higher. This makes sense because it is the center of São Paulo.
Idea: we should see how long and lat beave in a two dimensional setting. And plot over the map.
#for long lat it would be better to make a 2-dimensional exploration and
#plot it on a map
plot(m1,i='BANH_UNID')
Number of bathrooms have a stable increasing effect on prices than it drops, maybe units with more than 3 bathrooms are anohter type of housing.
plot(m1,i='DORM_UNID')
Note how DORM_UNID
have a negative impact on prices.
#train set MSE
yh1 = predict(m1,trainX,n.trees=1000)
mean((yh1-trainY)^2)
## [1] 13628927882
#test set MSE
yh = predict(m1,testX,n.trees=1000)
mean((yh-testY)^2)
## [1] 22697302611
#we should make the model with price per square meter, it is a better form to calculate
Make map of residuals from boosting model. Make interpretation of results from boosting.