Rachael Herman

Explore and Predict Housing Sale Price in R

Real estate is a critical sector in our economy, so housing market predictions and forecasting are crucial to sales and development. Many factors influence housing prices, and the industry is connected to the financial system through collateral, loan securitization, and leverage. That is why forecasting real estate models should be extensively studied. Real estate has specific price dynamics given its heterogeneity in property location and physical attributes (Ghysels, Plazzi, Tourus, & Valkanov, 2013). However, predicting housing prices comes with many challenges. For instance, there are high transaction costs and infrequent trading, making data short, as it takes a long time to buy and sell real estate, and owners often hold the asset for extended periods (Ghysels, Plazzi, Tourus, & Valkanov, 2013). This project seeks to explore and forecast housing prices in Ames, Iowa. The goal is to identify significant predicting factors, develop a simple linear regression model, and determine how well the model fits the data or the accuracy of the predictions. The training data set of 1000 observations and 25 variables will fit into the linear regression model. The testing data set of 460 observations and 25 variables will determine how well the predicted prices fit the observed costs.

Figure 1

Packages Installed, Libraries Accessed, and Data Sets Read into RStudio

Figure 2

Summary Statistics for SalePrice

a) Test Data Set

  • Training Data Set

Figure 3

Histogram and Boxplot for Sale Price Distribution in Test Set

Figure 4

Combined Training and Testing Data Set with Histogram and Boxplot

Figure 5

Linear Regression Model Using Explanatory Variables and SalePrice as the Response

Figure 6

Removal of Missing Data and 20 SalePrice Predictions Using Test Data

Figure 7

Data Frame and Line Graph to Compare Actual Prices to Predicted Prices in First 20 Rows

The Process

Following the installation of the appropriate packages and libraries, the training and testing files were read into RStudio using the setwd() function to access the working directory. The data was put into the housesTrain and housesTest data frames. The completion of this task is demonstrated in Figure 1. Figure 2 shows the summary statistics for the SalePrice in the testing data set using the summary() command. Figure 3 is a histogram and box plot charting the SalePrice in the testing data set using the hist(), par(), boxplot() and box() functions for read-in and style. Figure 4 combines the training and testing data sets using the combine() function and then replots the data in the combined set using the same commands as previously. The options() command disables scientific notation and provides a better view of the sale prices on the X-axis (Rosenberg, 2021). This made the graph much easier to understand.

Figure 5 demonstrates a linear regression model on the training set using the lm() command with all explanatory variables and SalePrice as the response variable. In Figure 6, the na.omit() function removes missing rows from the testing set. The test20 object is created to look at the first 20 rows for price predictions. Then, the predict_price object uses the predict() function to pull the predicted values from the first 20 rows using the linear regression model. Finally, figure 7 shows a data frame object with a line plot using the plot() with lines() and legend() functions to compare the actual prices in the test20 object with the predicted prices in the fit model. 

Interpretation of Results

The summary statistics for the testing data set in Figure 2 show the maximum and minimum sale prices, which helps clarify the histogram’s information. The median price of $161,750 in the testing data set is close to $163,995 in the training data set. The median is a better indicator of average prices than the mean of $177,958 in the testing set and $182,285 in the training set, given many outliers. The median price is less affected by those outliers than the mean (The Balance Editors, 2022).

As seen in Figure 3, the testing set’s histogram and boxplot are right-skewed, demonstrating that most housing prices in Ames, IA, fall on the lower end, at or below $210,000. The majority are between approximately $100,000 and $200,000. The boxplot demonstrates multiple outliers. After combining the training and testing data sets and replotting the information, as seen in Figure 4, the data is still right-skewed with multiple outliers. The box plot is narrower in the combined graph given the larger number of observations, with most prices falling at or below $200,000. Again, most houses fall between $100,000 and $200,000.   

A linear regression model uses the characteristics of each sale to predict housing prices. A summary of the linear regression model in Figure 5 shows several significant predictors for the SalePrice. These predictors include Lot Area, Overall Quality, Overall Condition, Year Built, MasVnrArea (masonry veneer area), Total BsmtSF (basement square footage), GrLivArea (ground living area), BedroomAbvGr (bedrooms above ground), KitchenAbvGr (kitchens above ground), TotRmsAbvGrd (total rooms above ground), and GarageArea. All except the BedroomAbvGr and KitchenAbvGr variables have positive coefficients. Therefore, when all variables are considered, as BedroomAbvGr and KitchenAbvGr increase, the SalePrice decreases. As all other significant variables with a positive coefficient increase, so does the SalePrice.

The significant predictor variables are indicated by the low p-value < 0.05. Overfitting could be an issue with simple regression models; however, the random sampling is large and includes all necessary variables to test. In overfitting, results may be overly optimistic and findings difficult or impossible to replicate on other data sets (Babyak, 2004). That does not look to be the case here, however. The R-squared and adjusted R-squared values can be indicative of model accuracy. The values equal 0.8473 and 0.8423, respectively, indicating a good fit model.

The predicted prices in Figure 6 are for the first 20 observations of the testing data once missing records have been removed. A data frame and line graph are created, as seen in Figure 7. This analyzes this data by comparing the actual with predicted values in table and graphical format. The actual prices are visually similar to the predicted prices, another indication of a good-fit model.


There were a few challenges in this project to address. The first was figuring out how to remove the scientific notation on the histogram, as it was causing a significant amount of confusion. After extensive research, the options() command was found through a rabbit hole of Google searches. This function disabled the scientific notation to present a clearer view of the sale prices on the histogram (Rosenberg, 2021). Another challenge was in determining how best to analyze the predicted values. In a table format, these values are essentially meaningless. The data does not tell a story when the actual and predicted prices are next to each other in a data frame. However, the comparison was much clearer once this information was placed in a graphical line chart. This took extensive research into graphing and a few additional package installations.

Overall, the training set was large enough to create a reliable, simple linear regression model to predict housing sale prices. The median prices in the summary data are the best indicators of average prices because these are less affected by the significant outliers. Outliers in this set are values over $350,000, as demonstrated on the boxplots for the testing and combined data sets. The training set was used to create a linear regression model, and the testing set helped determine how well that model fit the data, which was determined to be a good fit based on sample size and R-squared values. The linear regression model identified the significant predictors, most of which had positive coefficients. That is, as those variables increased, the SalePrice also increased. The two variables with negative coefficients—bedroom and kitchen above the ground level—indicate that as they increased, the SalePrice decreased.

Finally, once missing data were omitted from the test set, the first 20 rows were pulled to predict values using the linear regression model. These were then compared to the actual prices in a table. As that did not tell much of a story, a line graph was used to visualize the comparison. Visually, the prices were very similar, another indication of a good-fit model for predicting housing prices in Ames, IA.  


Babyak, M. A. (2004). What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosomatic Medicine, 411–421.

Ghysels, E., Plazzi , A., Tourus, W., & Valkanov, R. (2013). Forecasting Real Estate Prices. Handbook of Economic Forecasting, 509–580.

Rosenberg, M., Wisconsin School of Business. (2021, July 18). 2.5 options() Command | Analytics Using R. Retrieved September 7, 2022, from https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/s-options.html

The Balance Editors (2022, April 7). Median vs. average: What’s the difference? The Balance. Retrieved on September 7, 2022, from https://www.thebalance.com/median-vs-average-what-the-difference-2682237