In this notebook I will use data on house sales in King County to predict house prices using simple (one input) linear regression. I will:
import turicreate
Dataset is from house sales in King County, the region where the city of Seattle, WA is located.
sales = turicreate.SFrame('home_data.sframe/')
sales.head()
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront |
---|---|---|---|---|---|---|---|---|
7129300520 | 2014-10-13 00:00:00+00:00 | 221900.0 | 3.0 | 1.0 | 1180.0 | 5650.0 | 1.0 | 0 |
6414100192 | 2014-12-09 00:00:00+00:00 | 538000.0 | 3.0 | 2.25 | 2570.0 | 7242.0 | 2.0 | 0 |
5631500400 | 2015-02-25 00:00:00+00:00 | 180000.0 | 2.0 | 1.0 | 770.0 | 10000.0 | 1.0 | 0 |
2487200875 | 2014-12-09 00:00:00+00:00 | 604000.0 | 4.0 | 3.0 | 1960.0 | 5000.0 | 1.0 | 0 |
1954400510 | 2015-02-18 00:00:00+00:00 | 510000.0 | 3.0 | 2.0 | 1680.0 | 8080.0 | 1.0 | 0 |
7237550310 | 2014-05-12 00:00:00+00:00 | 1225000.0 | 4.0 | 4.5 | 5420.0 | 101930.0 | 1.0 | 0 |
1321400060 | 2014-06-27 00:00:00+00:00 | 257500.0 | 3.0 | 2.25 | 1715.0 | 6819.0 | 2.0 | 0 |
2008000270 | 2015-01-15 00:00:00+00:00 | 291850.0 | 3.0 | 1.5 | 1060.0 | 9711.0 | 1.0 | 0 |
2414600126 | 2015-04-15 00:00:00+00:00 | 229500.0 | 3.0 | 1.0 | 1780.0 | 7470.0 | 1.0 | 0 |
3793500160 | 2015-03-12 00:00:00+00:00 | 323000.0 | 3.0 | 2.5 | 1890.0 | 6560.0 | 2.0 | 0 |
view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat |
---|---|---|---|---|---|---|---|---|
0 | 3 | 7.0 | 1180.0 | 0.0 | 1955.0 | 0.0 | 98178 | 47.51123398 |
0 | 3 | 7.0 | 2170.0 | 400.0 | 1951.0 | 1991.0 | 98125 | 47.72102274 |
0 | 3 | 6.0 | 770.0 | 0.0 | 1933.0 | 0.0 | 98028 | 47.73792661 |
0 | 5 | 7.0 | 1050.0 | 910.0 | 1965.0 | 0.0 | 98136 | 47.52082 |
0 | 3 | 8.0 | 1680.0 | 0.0 | 1987.0 | 0.0 | 98074 | 47.61681228 |
0 | 3 | 11.0 | 3890.0 | 1530.0 | 2001.0 | 0.0 | 98053 | 47.65611835 |
0 | 3 | 7.0 | 1715.0 | 0.0 | 1995.0 | 0.0 | 98003 | 47.30972002 |
0 | 3 | 7.0 | 1060.0 | 0.0 | 1963.0 | 0.0 | 98198 | 47.40949984 |
0 | 3 | 7.0 | 1050.0 | 730.0 | 1960.0 | 0.0 | 98146 | 47.51229381 |
0 | 3 | 7.0 | 1890.0 | 0.0 | 2003.0 | 0.0 | 98038 | 47.36840673 |
long | sqft_living15 | sqft_lot15 |
---|---|---|
-122.25677536 | 1340.0 | 5650.0 |
-122.3188624 | 1690.0 | 7639.0 |
-122.23319601 | 2720.0 | 8062.0 |
-122.39318505 | 1360.0 | 5000.0 |
-122.04490059 | 1800.0 | 7503.0 |
-122.00528655 | 4760.0 | 101930.0 |
-122.32704857 | 2238.0 | 6819.0 |
-122.31457273 | 1650.0 | 9711.0 |
-122.33659507 | 1780.0 | 8113.0 |
-122.0308176 | 2390.0 | 7570.0 |
train_data,test_data = sales.random_split(.8,seed=0)
In order to make use of the closed form solution as well as take advantage of turi create's built in functions I will first review some important ones. In particular:
# computing the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extracting the price column of the sales SFrame -- this is now an SArray
# calculating the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = len(prices) # when prices is an SArray len() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # checking the mean function against arithmetic average
print("average price via method 1: " + str(avg_price_1))
print("average price via method 2: " + str(avg_price_2))
average price via method 1: 540088.1419053348 average price via method 2: 540088.1419053345
As we see we get the same answer both ways
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print("the sum of price squared is: " + str(sum_prices_squared))
the sum of price squared is: 9217325133550736.0
Armed with these SArray functions I can now use the closed form solution found to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.
def simple_linear_regression(input_feature, output):
# computing the sum of input_feature and output
sum_input_feature=input_feature.sum()
sum_output = output.sum()
# computing the product of the output and the input_feature and its sum
product_0= input_feature*output
sum_0=product_0.sum()
# computing the squared value of the input_feature and its sum
input_feature_squared=input_feature*input_feature
sum_input_feature_squared=input_feature_squared.sum()
# using the formula for the slope
denominator_slope= sum_input_feature_squared - (sum_input_feature*sum_input_feature)/len(input_feature)
numerator_slope = sum_0-(sum_input_feature*sum_output)/len(input_feature)
slope=numerator_slope/denominator_slope
# slope=(sum_0-(sum_input_feature*sum_output)/len(input_feature))/((sum_input_feature_squared*sum_input_feature_squared)/len(input_feature))
# slope=(sum_output*sum_input_feature_squared-sum_input_feature*sum_0)/len(input_feature)*sum_input_feature_squared-(sum_input_feature*sum_input_feature)
# useing the formula for the intercept
intercept= sum_output/len(input_feature)-sum_input_feature/len(input_feature)*slope
#intercept=(len(input_feature)*sum_0-sum_input_feature*sum_output)/len(input_feature)*sum_input_feature_squared-(sum_input_feature*sum_input_feature)
return (intercept, slope)
I can test that my function works by passing it something where I already know the answer. In particular I can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then I know both my slope and intercept would be 1
test_feature = turicreate.SArray(range(5))
test_output = turicreate.SArray(1 + 1*test_feature)
(test_intercept, test_slope) = simple_linear_regression(test_feature, test_output)
print("Intercept: " + str(test_intercept))
print("Slope: " + str(test_slope))
Intercept: 1.0 Slope: 1.0
Now that I know it works I will be building a regression model for predicting price based on sqft_living.
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])
print("Intercept: " + str(sqft_intercept))
print( "Slope: " + str(sqft_slope))
Intercept: -47116.07657494 Slope: 281.9588385676974
Now that I have the model parameters: intercept & slope I can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value.
def get_regression_predictions(input_feature, intercept, slope):
# calculating the predicted values:
predicted_values=intercept+slope*input_feature
return predicted_values
Now that I can calculate a prediction given the slope and intercept I will be making my first prediction. I will find out the estimated price for a house with 2650 squarefeet according to the squarefeet model I estimated above.
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print("The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price))
The estimated price for a house with 2650 squarefeet is $700074.85
Now that I have a model and I can make predictions. I will evaluate my model using Residual Sum of Squares (RSS). RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
# First I will get the predictions
pred=get_regression_predictions(input_feature,intercept,slope)
# then compute the residuals
residuals=output-pred
# squaring the residuals and adding them up
RSS=residuals*residuals
return(RSS)
I will now test my get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero
print(get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)) # should be 0.0
0.0
Now I will use my function to calculate the RSS on training data from the squarefeet model calculated above.
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print('The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft))
The RSS of predicting Prices based on Square Feet is : 1201918356321968.0
print("{:e}".format(get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)))
1.201918e+15
What if I want to predict the squarefoot given the price? Since I have an equation y = a + b*x I can solve the function for x. So that if I have the intercept (a) and the slope (b) and the price (y) I can solve for the estimated squarefeet (x).
def inverse_regression_predictions(output, intercept, slope):
#solve output = intercept + slope*input_feature for input_feature.
estimated_feature=(output-intercept)/slope
return estimated_feature
Now that I have a function to compute the squarefeet given the price from my simple regression model let's see how big I might expect a house that costs $800,000 to be.
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print("The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet))
The estimated squarefeet for a house worth $800000.00 is 3004
I have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. I will now use my simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms.
sqft_intercept_bedrooms, sqft_slope_bedrooms = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print("Intercept_bedrooms: " + str(sqft_intercept_bedrooms))
print( "Slope_bedrooms: " + str(sqft_slope_bedrooms))
Intercept_bedrooms: 109473.18046928808 Slope_bedrooms: 127588.95217458377
Now I have two models for predicting the price of a house. How do we know which one is better? I will calculate the RSS on the TEST data for each of those models
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print('The RSS of predicting Prices based on Square Feet is : ' + str("{:e}".format(rss_prices_on_sqft)))
The RSS of predicting Prices based on Square Feet is : 2.754029e+14
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], sqft_intercept_bedrooms, sqft_slope_bedrooms)
print('The RSS of predicting Prices based on bedrooms is : ' + str("{:e}".format(rss_prices_on_sqft)))
The RSS of predicting Prices based on bedrooms is : 4.933646e+14
It can be observed that RSS is is lower for the model based on square feet. Thus it can be said that square feet of a house is a more influential metric than number of bedrooms while deciding price of a house