Written by:
Jingqi Chen (998841054)
Yili Wang (999316533)
KaWoon Tam(913389384)
WeiZhuo Xiong(914443957)
When it comes to home buying market, people usually need to consider many characteristics of the house, such as number of bedrooms and bathrooms, lot size, garage, backyard, etc. However, there are actually much more factors need to be thought about beyond the house itself. For example, public school ratings and crime situation. Accroding to a national study conducted in 2013, public school ratings in the district can affect the homes' values, which leads to an average premium of $50 a square foot. Besides, racial composition of the neighborhood also influence home buyers' decisions. For most of the time, school ratings and racial composition interact with each other and mutually pose impact on real estate price. From 2006 report of the National Fair Housing Alliance, "it is evident from the investigation that schools have become a proxy for the racial or ethnic composition of neighborhoods".
Therefore, we conducted this project researching the relationship between public school rating and racial composition and their influences in home prices. Moreover, we examine the crime occurrence and house price. And in the end, use web scraping to predict the trend of the house price.
In terms of data searching, since we are undergraduates at UC Davis, we choose California real estate market as our sample. In order to achieve our research goal, we obtain data from the following sources:
Dataset | Source | Description | Parameter of Interest |
---|---|---|---|
Home Price | Zillow Data | monthly median prices of cities in California | Median Home Value Per Sq Ft ($) |
Public School Rating (API) | California Department of Education | API (Academic Performance Index) of California public schools in 2013 | API score; Percentage of White student; Percentage of Asian student; percent of immigrated student; Class size; Parents education level |
Crime | FBI Uniform Crime Reporting | Crime Incidences in California by city in 2013 | Different types of crime: Murder, Rape, Robbery, Aggravated assault, Property crime, Burglary, Larceny theft, Motor vehicle theft, Arson |
For our project, there are three main datasets, which are home price, public school average academic performance index(API) and crime. Since we only interested in the relationship between these datasets in California, and the dataset of 2013 public school academic information only incorporate those in California. Accordingly, by using merge method in pandas we can merge different dataset share same city name. After that, we will count the average home price in year 2013 for each city and use it to analyse.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm
from scipy.stats import gaussian_kde
price = pd.read_csv("Zip_MedianValuePerSqft_AllHomes.csv")
school = pd.read_csv("CA_schools.csv")
#merge price data and school
data=pd.merge(price,school,left_on='City',right_on='DNAME',how='inner')
data=data.groupby('City').mean()
#create dataframe have API in 2013 and home price in 2013
y=np.mean(data.iloc[:,208:220],axis=1)
APIPRICE=pd.DataFrame([data['API13'],y]).dropna(axis=1)
import scipy
from scipy import stats
from sklearn import datasets, linear_model
from scipy.stats import gaussian_kde
Academic Performance Index is a parameter express the education level for a school, the school which have higher Academic Performance Index(API) means the students in this school have better grade. Firstly, what we want to find out is whether API and house price are correlated. Since the original scatter plot between price and API share log form, we use the log value of both price and API to display the consequence, which shows that these two parameters seems to be positive related. Then the question is, if API have positive relation with price, what parameters will affect API?
For the following plots, we use different scatter plots to find the relationship between API and other parameters and use size of points to express home price (larger points means more expensive home price).
#relation between price and api
y=np.log(APIPRICE.iloc[0,:])
x=np.log(APIPRICE.iloc[1,:])
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=10, edgecolor='')
plt.plot( x, fit_fn(x), '--k')
plt.title('relation between price and API')
plt.xlabel('price')
plt.ylabel('API')
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
plt.show()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print('slope:{}\nintercept:{}\nr_value:{}\np_value:{}\nstd_err:{}'.format(slope, intercept, r_value, p_value, std_err))
The first thing we can thought out which might correlated with API is the percentage of different students. Since most of us are international students so we want to know whether percentage of international students might affect API, and the answer seems not, the distribution between API and migrate student have a normal distribution which is pretty natural, so we can't deduce correlation from that.
y=np.sum(data[['PCT_AS','PCT_WH']],axis=1)
y1=data['PCT_HI']
ASIANWHITE=pd.DataFrame([data['API13'],y]).dropna(axis=1)
MIGRAGE=pd.DataFrame([data['API13'],data['P_MIGED']]).dropna(axis=1)
ASWHLA=pd.DataFrame([data['API13'],y1]).dropna(axis=1)
#Percent of migrate student and API
y = MIGRAGE.iloc[0,:]
x = MIGRAGE.iloc[1,:]
size=np.mean(data.iloc[:,208:220],axis=1)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=size/10, edgecolor='')
plt.title('relation between Percent of migrate student and API')
plt.xlabel('Percent of migrate student')
plt.ylabel('API')
plt.show()
In addition, it is quite likely that the percentage of different race group of students might contribute differently to API, since we have heard a lot of times that SAT(college admissions in the United States) requirement for students from different race group is different.
Since both asian and white student share similar situation which means they have relative low percentage and high API in each school, we group them together and show that public school which have higher percentage of asian plus white students will have higher API. On the other hand, percentage of latino student seems to be negative correlated with API, which means diversity of public school might have positive contribution to API.
#Percent of asian+white student and API
y = ASIANWHITE.iloc[0,:]
x = ASIANWHITE.iloc[1,:]
size=np.mean(data.iloc[:,208:220],axis=1)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=size/10, edgecolor='')
plt.title('relation between Percent of asian+white student and API')
plt.xlabel('Percent of asian+white student')
plt.ylabel('API')
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
plt.plot( x, fit_fn(x), '--k')
plt.show()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print('slope:{}\nintercept:{}\nr_value:{}\np_value:{}\nstd_err:{}'.format(slope, intercept, r_value, p_value, std_err))
#percent of latino and api
y = ASWHLA.iloc[0,:]
x = ASWHLA.iloc[1,:]
size=np.mean(data.iloc[:,208:220],axis=1)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=size/10, edgecolor='')
plt.title('relation between percent of latino student and API')
plt.xlabel('percent of latino student')
plt.ylabel('API')
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
plt.plot( x, fit_fn(x), '--k')
plt.show()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print('slope:{}\nintercept:{}\nr_value:{}\np_value:{}\nstd_err:{}'.format(slope, intercept, r_value, p_value, std_err))
Then we want to know whether class size will affect API, since, intuitively, smaller class will make teachers have better understanding of each students, which might contribute to student academic success. However, the answer we got can't approve this idea. Because most of class will have class size close to 27, which means we don't have enough information about more extreme class size to analysis the relation between those two.
CLASS=pd.DataFrame([data['API13'],data['ACS_46']]).dropna(axis=1)
#class size and api
y = CLASS.iloc[0,:]
x = CLASS.iloc[1,:]
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
size=np.mean(data.iloc[:,208:220],axis=1)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=size/10, edgecolor='')
plt.title('relation between class size and API')
plt.xlabel('class size')
plt.ylabel('API')
plt.show()
Quite likely, parents education level might have strong correlation with API, since the parents who have earn academic success before might have higer intelligence or know the right way to pursuit education improvement. After that, their children will learn the right method from them and inherit their talent. Luckily, the answer we got approve this hypothesis. When compare the average parents education level and API for each public school in California, there seems exist a strong positive correlation.
EDU=pd.DataFrame([data['API13'],data['AVG_ED']]).dropna(axis=1)
#parents education level and api
y = EDU.iloc[0,:]
x = EDU.iloc[1,:]
size=np.mean(data.iloc[:,208:220],axis=1)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=size/10, edgecolor='')
plt.title('relation between parents education level and API')
plt.xlabel('parents education level')
plt.ylabel('API')
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
plt.plot( x, fit_fn(x), '--k')
plt.show()
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print('slope:{}\nintercept:{}\nr_value:{}\np_value:{}\nstd_err:{}'.format(slope, intercept, r_value, p_value, std_err))
To obtain a general idea about California home price, let's draw a bar plot and look at it!
Firstly, we re-organized the data to generate an appropriate data frame(api_homeprice) containing city, API, and home price in 2013, for convenience.
data1=pd.merge(price,school,left_on='City',right_on='DNAME',how='inner')
data0=data1.groupby('City').mean()
y=np.mean(data0.iloc[:,208:220],axis=1)
APIPRICE=pd.DataFrame([data0['API13'],y]).dropna(axis=1)
api_price = APIPRICE.T
city_name = pd.DataFrame(list(api_price.index))
API = pd.DataFrame(list(api_price.iloc[:,0]))
House_price13 = pd.DataFrame(list(api_price.iloc[:,1]))
api_homeprice = pd.concat([city_name, API,House_price13], axis=1)
api_homeprice.columns = ['city_name', 'API','House_price13']
import plotly
plotly.tools.set_credentials_file(username='', api_key='') ## Need own username and apikey
To mark the exact geological location of cities on map, we utilize the shapefile (obtained from website: http://basemaptutorial.readthedocs.io/en/latest/shapefile.html) to gain longitude and latitude of each California city.
# Start
import geopandas as gpd
ca = gpd.read_file("/Users/chenjingqi/Dropbox/College/2017 Winter/STA 141B/Jupyter/Project/Map_shapefile/citiesx010g_shp_nt00962/citiesx010g.shp")
# Merge api_homeprice with cities' lat and lon
ca0 = ca[ca['STATE']=="CA"]
compreh_geo = pd.merge(ca0,api_homeprice,left_on='NAME',right_on='city_name',how='inner')
lat = pd.DataFrame(list(compreh_geo['LATITUDE']))
lon = pd.DataFrame(list(compreh_geo['LONGITUDE']))
prep_data = pd.concat([api_homeprice, lat,lon], axis=1)
prep_data.columns = ['city_name', 'API','House_price13','lat','lon']
prep_data = prep_data[np.isfinite(prep_data['lat'])] # drop rows where contains NA in lat or lon
# Rank by API
prep_data = prep_data.sort_values(by=['API'])
# Rank by Home Price
prep_data_rankbyhome = prep_data.sort_values(by=['House_price13'])
import plotly.plotly as py
from plotly.graph_objs import *
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
bar_data = [Bar(x=prep_data_rankbyhome.city_name,
y=prep_data_rankbyhome.House_price13)]
py.iplot(bar_data, validate=False, filename='lala')
From the graph we can tell that the gap between home prices are huge. (Interact with the plot to see the home price in ANY city!)
In order to have a general idea about the geological distribution of public schools with different academic performance ratings and homes with distinctive price levels, we draw a map to visualize the relationship between two factors.
import plotly
plotly.tools.set_credentials_file(username='', api_key='') ## Need own username and apikey
# Reference: http://basemaptutorial.readthedocs.io/en/latest/shapefile.html
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
For better visualization outcome, we divide the average public schools' API scores in each city into five categories: Worst(approx. lowest 12.5%), Poor (approx. 12.5%), Medium (approx. 25%), Good (approx. 25%), Excellent(approx. 25%). In addition, they are marked by different colors. On the plot, home price is represented by the size of dots. Larger dots imply higher medium home price per square foot.
# Read the plot
# size of the dots: house price. Bigger, more expensive
# color of the dots: API. 5b levels corresponding to colorscale
import plotly.plotly as py
import pandas as pd
#df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_us_cities.csv')
#df.head()
prep_data['text'] = prep_data['city_name'] + '<br> Avg.Home Price ' + '$'+(prep_data['House_price13']).astype(str) + '/sqrft'+'<br> School Rating: '+(prep_data['API']).astype(str)
limits = [(0,50),(51,100),(101,200),(201,300),(301,383)]
colors = ["light yellow","rgb(8,18,28)","rgb(133,20,75)","rgb(255,133,27)","lightgrey"]
#colors = ~carat
cities = []
scale = 5000
name0 = ["Worst","Poor","Medium","Good","Excellent"]
for i in range(len(limits)):
lim = limits[i]
df_sub = prep_data.iloc[lim[0]:lim[1]]
city = dict(
type = 'scattergeo',
locationmode = 'ISO-3',
lon = df_sub['lon'],
lat = df_sub['lat'],
text = df_sub['text'],
marker = dict(
size = df_sub['House_price13']/10,
color = colors[i],
line = dict(width=0.5, color='rgb(40,40,40)'),
sizemode = 'area'
),
#name = '{0} - {1}'.format(lim[0],lim[1]) )
name = name0[i])
cities.append(city)
layout = dict(
title = '2013 California Public School Rating v.s Home Price<br>(Click legend to toggle traces)',
showlegend = True,
geo = dict(
scope='usa',
projection=dict( type='albers usa'),
resolution = 50,
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(255, 255, 255)",
lonaxis = dict( range= [ -135.0, -115.0 ] ),
lataxis = dict( range= [ 20.0, 45.0 ] ),
),
)
fig = dict( data=cities, layout=layout )
py.iplot( fig, validate=False, filename='d3-bubble-map-populations' )
Note: This is an interactive plot. You can see detailed information about API and home price for each city in our dataset. Play with it! =)
If we take a close look at one category each time (turn off/on category by clicking the legend on the map), some interesting patterns can be observed. Schools in the worst category are centered around L.A. If comparing it with the “Poor” category schools, we can see that dots become bigger in Poor, which implies the increase in home price. In Poor and Medium categories (only click on Poor and Medium cagtegories), the pattern of change in home price is not obvious, but schools of these two kinds are evenly located in Southern and Northern California, instead of being clustered around L.A. When comparing “Good” schools and “Excellent” schools(only click on Good and Excellent cagtegories), the dots become much more larger, which means the price of home increased gradually as the public school ratings increase. If we directly compare “Worst” to “Excellent”, the contrast in dot size (or say, home price) is much sharper. As the increase of schools’ rating, the home price increases gradually. Thus, we can say that there is positive correlation between home price and school rating.
Not surprisingly, excellent schools are centered around SF and LA. However, the overall public school quality in Northern California is more averaged and better than that in Sourthern California.
Next, we want to examine that if there is any relationship between crime and house price.
We first do some data selection for this purpose.
_2013 = price.columns[208:220]
casqft = price[price['State']=='CA']
price13 = casqft[_2013]
price13.loc[:,'Mean Price'] = price13.mean(1)
price13.loc[:,'City'] = casqft['City']
price13=price13.groupby('City').mean()
crime_price = pd.DataFrame({'City': list(price13.index), 'Mean Price': list(price13['Mean Price'])})
crime = pd.read_excel("table_8_offenses_known_to_law_enforcement_california_by_city_2013.xls")
crime.columns=['City', 'Population', 'Violent',
'Murder','Rape1', 'Rape',
'Robbery', 'Aggravated assault', 'Property crime', 'Burglary',
'Larceny theft', 'Motor vehicle theft', 'Arson']
crime_name=['Murder', 'Rape','Robbery', 'Aggravated assault', 'Property crime',
'Burglary','Larceny theft', 'Motor vehicle theft', 'Arson']
#crime[crime_name] = crime[crime_name].astype(int)
weights=[70.42,10.47,5.83,4.05,1.87,1.39,.37,.37,.16]
weighted_list=[]
not_weighted=[]
for name,weight in zip(crime_name,weights):
weighted_list.append(crime[name].values*weight)
not_weighted.append(crime[name].values)
weighted_crime=np.sum(weighted_list,axis=0)
not_weighted_crime=np.sum(not_weighted,axis=0)
crime['weighted_crime']=weighted_crime
crime['not_weighted_crime']=not_weighted_crime
total_data_crime=pd.merge(crime_price,crime,left_on='City',right_on='City',how='inner')
total_data_crime = total_data_crime.sort_values(by=['Mean Price'])
import plotly.plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter(
x = total_data_crime['Violent'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Violent',
marker=dict(
color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace2 = go.Scatter(
x = total_data_crime['Murder'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Murder',
marker=dict(
#color='rgba(204, 204, 204, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace3 = go.Scatter(
x = total_data_crime['Rape'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Rape',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace4 = go.Scatter(
x = total_data_crime['Robbery'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Robbery',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace5 = go.Scatter(
x = total_data_crime['Aggravated assault'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Aggravated assault',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace6 = go.Scatter(
x = total_data_crime['Property crime'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Property crime',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace7 = go.Scatter(
x = total_data_crime['Burglary'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Burglary',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace8 = go.Scatter(
x = total_data_crime['Larceny theft'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Larceny theft',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace9 = go.Scatter(
x = total_data_crime['Motor vehicle theft'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Motor vehicle theft',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
trace10 = go.Scatter(
x = total_data_crime['Arson'],
y = total_data_crime['Mean Price'],
mode='markers',
name='Arson',
marker=dict(
#color='rgba(156, 165, 196, 0.95)',
symbol='circle',
size=5,
),
text= total_data_crime['City']
)
data = [trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, trace10]
layout = go.Layout(
title="Occurrence of different crime type vs Median Price",
yaxis=dict(
title='Median House Price'
),
xaxis=dict(
title='Frequency of Crime'
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='crimemedianprice')
In order to have a first impression about the relationship between crime and house price, we plot the scatter plot of different crime types occurrences vs house price. From the plot, we can see there are some outliers. Therefore, we might need to figure out those outliers and exclued those in our analysis.
From the plot, there is no direct linear relationship between each crime occurrence and house price. Therefore, we want to see if there is any trend in crime occurrence and house price. We'll plot a stacked bar plot and the house price in the same plot to figure it out.
import plotly
plotly.tools.set_credentials_file(username='', api_key='') ## Need own username and apikey
import plotly.plotly as py
import plotly.graph_objs as go
trace1 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Violent'],
name='Violent',
yaxis = 'y1',
marker=dict(color='rgb(158,202,225)')
)
trace2 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Murder'],
name='Murder',
yaxis = 'y1'
)
trace3 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Rape'],
name='Rape',
yaxis = 'y1'
)
trace4 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Robbery'],
name='Robbery',
yaxis = 'y1'
)
trace5 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Aggravated assault'],
name='Aggravated assault',
yaxis = 'y1'
)
trace6 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Property crime'],
name='Property crime',
yaxis = 'y1'
)
trace7 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Burglary'],
name='Burglary',
yaxis = 'y1'
)
trace8 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Larceny theft'],
name='Larceny theft',
yaxis = 'y1'
)
trace9 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Motor vehicle theft'],
name='Motor vehicle theft',
yaxis = 'y1'
)
trace10 = go.Bar(
x=total_data_crime['City'],
y=total_data_crime['Arson'],
name='Arson',
yaxis = 'y1'
)
trace11 = go.Scatter(
x = total_data_crime['City'],
y = total_data_crime['Mean Price'],
name='House Median Price',
yaxis='y2'
)
data = [trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, trace10, trace11]
layout = go.Layout(
barmode='stack',
legend=dict(
x=0.01,
y=1
),
yaxis1=dict(
title='Crime',
titlefont=dict(
color='#ff7f0e'
),
tickfont=dict(
color='#ff7f0e'
)
),
yaxis2=dict(
title='House Median Price',
titlefont=dict(
color='#1f77b4'
),
tickfont=dict(
color='#1f77b4'
),
anchor='x',
overlaying='y',
side='right',
position=0.75
)
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='marker-h-bar')
#py.image.save_as(fig, 'totalcrimeandhouseprice.png')
In the plot, the blue curve is the house price and the y axis for house price is on the right.
There is also stacked bar plot in the graph indicating the occurrences of each crime type. From the stacked bar plot, we can see Los Angeles has the most occurences of crime, while San Francisco is the next.
When exclude those super metropolis in the graph, we can divide the graph into three parts (when we drag the part in the graph to zoom in):
Therefore, we may be able to conclude that there is a negative relationship between crime occurrence and house price in California.
In this section, we want to know the affect house price on news. We frist extract some housing news headlines and their release time from Cnbc . We then process each of the headlind we extract in order to find each sentiment score, 0 is negative and 1 is positive. Next, We find the mean of sentiment score of each month then merege them the price/feet of each month.
import nltk
import pandas as pd
import unicodedata
from lxml import html
import requests
import numpy as np
import os
from nltk.sentiment.vader import SentimentIntensityAnalyzer as sia
import requests_cache
from os import path
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from KaggleWord2VecUtility import KaggleWord2VecUtility
from urllib2 import Request, urlopen
from urlparse import urlparse, urlunparse
import json
from urllib2 import Request, urlopen
import urllib2
import re
from bs4 import BeautifulSoup
requests_cache.install_cache('coll_cache')
Web scraping: We use the following code to extract the headlines from the cnbc.com/housing.
# web scraping
# Find all the titles and dates on each page.
titles=[]
times =[]
opener = urllib2.build_opener()
for i in range(1,54):
url = 'http://www.cnbc.com/housing/?page='+str(i)
soup = BeautifulSoup(opener.open(url), "lxml")
time = soup.find_all('time')
times = times + ['-'.join(str(i).split()[2:4]) for i in time][4:]
titles =titles + [' '.join(title.get_text().split()) for title in soup.find_all('div', {'class' : 'headline'})][4:]
# save data to data frame then output to an excel
whole = pd.DataFrame({'titles':titles,'times':times})
whole = whole.T
writer = pd.ExcelWriter('output.xlsx')
whole.to_excel(writer,'Sheet1')
writer.save()
price = pd.read_csv("City_MedianValuePerSqft_AllHomes.csv")
Here we decided to built our own setiment analysis model since the build-in sentiment tool in NLTK is not so accurate.
# We read output.xlsx, which we add 100 correct data.
new1 = pd.read_excel("output.xlsx")
train = new1.dropna()
test = new1[new1['po-neg'].isnull()]
# building training set
train.titles = train.titles.apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
train = train.reset_index(drop=True)
train['po-neg'] = train['po-neg'].astype(dtype = np.int64)
# building testing set
test.titles = test.titles.apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
test = test.reset_index(drop = True)
# denoise the training data
clean_train_reviews = []
print "Cleaning and parsing the training set news heading...\n"
for i in xrange(0, len(train["titles"])):
clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["titles"][i],True)))
# tokenize the tarining data set
print "Creating the bag of words...\n"
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 4000)
We use RandomForestClassifier in the sklearn package. We set the number of estimators to 100, which consisting of one hundred trees a random forest is a set of decision tree. Decision tree are graphs that model the possibilities of certain outcomes.
In this case, we input train set to generate this 100 trees.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["po-neg"])
# denoise the testing data set
clean_test_reviews = []
print "Cleaning and parsing the test set headline...\n"
for i in xrange(0,len(test["titles"])):
clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["titles"][i], True)))
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
# create a classifer
# predict the result of testing set by the classifer
# save result to a data frame.
print "Predicting test label...\n"
result = forest.predict(test_data_features)
output = pd.DataFrame( data={"titles":test["titles"], "po-neg":result} )
previous_value = list(train['po-neg'].values)
previous_value = previous_value+list(output['po-neg'].values)
new1['po-neg']= previous_value
new1.titles = new1.titles.apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
new1.to_csv("sentiment_result.csv")
means = new1.groupby("times").mean()
means.to_csv("mean.csv")
The following data frame shows the mean of sentiment score of each month. The setiment status is positive if the sentiment score it above 0.5, otherwise.
mean=pd.read_csv("mean.csv")
mean
price_web = price.dropna()
average_price = pd.DataFrame(np.mean(price_web))
sentiment_month = pd.merge(average_price,mean,left_index=True,right_on='times',how='inner')
sentiment_month
positive_negtive = sentiment_month['po-neg'].values
avg_price = sentiment_month.iloc[:,0].values
sum = 0
for i in range(len(positive_negtive)-1):
if positive_negtive[i] > 0.5 and avg_price[i+1]>avg_price[i]:
sum += 1
elif positive_negtive[i] < 0.5 and avg_price[i+1]<avg_price[i]:
sum += 1
print 'accuracy of sentiment model is ',sum/(len(positive_negtive)-1)*100
We can see that our sentiment model is 100% accuracy. However, we do not know if our model is accuracy or not because house price have been increasing over the last few year. And as our model only shows positive setiment score from 2014 to 2017. We hope to get more news headline around 2006 to 2008 as testing set to test our model. Anyway, we can see that the news is affecting the house price in this case. It would be nice that if we can extract news according to city. This would give us more insight information on how the news affect the housing price in each city.
There are some interesting facts about Academic Performance Index as shown in the fifth part of the project. Moreover, after examining the relationship between API and house price, we find that there is a patter showing that in the cities with higher school ranking, the house price is also higher.
From the result of the crime and house price, the less the crime occurs, the higher the house price will be.
Therefore, when people decide where to buy a house, school ranking and crime situation are two factors that may need to be considered into price.
Furthermore, the information from the news can also provide directive prediction for the house price, even though house price has been increasing for the past few years.