Best Predictor of Selling Price statistics

 

 

Create a histogram of the variable selling price. Does it appear normally distributed? Justify your answer.
Create a correlation matrix to explore the linear correlation coefficient of each independent variable with selling price. Rank the correlation coefficients from Strongest to Weakest linear association.
Create scatterplots of the selling price vs each of the independent variables. Do all of the relationships appear linear? Does the variance appear constant? Are there any outliers?
Add a regression line to each of the scatterplots. Which one demonstrates the highest R2? Does this make sense given the visual appearance of the scatterplots? Why/Why not?
Run a simple linear regression model for each independent variable. Complete the table below.
Examine the residual plot for each independent variable. Do the residuals appear random around zero?
Are all of the variables significant predictors of selling price? Select which model you think explains best the variation around the mean selling price. Justify your choice with information from the comparison chart in 4.
List the assumptions of linear regression. Does your model violate any of these assumptions? Justify your answer (ie. what information helped you evaluate each assumption).

Sample Solution

Data Exploration:

  1. Histogram: Create a histogram of the selling price. Analyze the shape of the distribution.

    • Normally distributed data will resemble a bell-shaped curve. Look for any significant skewness or kurtosis (peakedness) that might deviate from normality.
  2. Correlation Matrix: Calculate a correlation matrix to explore the linear correlation coefficient between each independent variable and the selling price. Rank them from strongest to weakest.

    • Correlation coefficients range from -1 to 1. Values closer to 1 indicate strong positive associations, closer to -1 indicate strong negative associations, and values close to 0 indicate weak or no linear association.
  3. Scatterplots: Create scatterplots of the selling price vs each independent variable. Analyze each plot for:

    • Linearity: Look for a straight or curved pattern in the points. A straight line suggests a linear relationship, while a curve suggests a non-linear relationship.
    • Homoscedasticity: Observe if the spread of the points around the trendline is consistent throughout the plot. Inconsistent spread indicates heteroscedasticity, which can violate assumptions of linear regression.
    • Outliers: Identify any data points that fall far away from the main cluster of points. These might be outliers that could influence the analysis.
  4. Regression Lines: Add a regression line to each scatterplot. Calculate the R-squared (coefficient of determination) for each model. R-squared represents the proportion of variance in the selling price explained by the independent variable.

    • A higher R-squared indicates a stronger linear relationship between the variables. However, it doesn’t guarantee a perfect fit or the absence of other important factors.
  5. Simple Linear Regression: Run a simple linear regression model for each independent variable. Fill out a table including:

    • Independent Variable
    • Coefficient Estimate (slope)
    • P-value

Model Evaluation:

  1. Residual Plots: Examine the residual plots for each model. Residuals are the differences between the actual selling prices and the predicted values from the regression line. Ideally, residuals should be scattered randomly around zero with no apparent patterns.
  2. Significance: Analyze the p-values from the regression table. A p-value less than a significance level (usually 0.05) indicates that the independent variable is statistically significant in predicting the selling price.
  3. Best Model Selection: Based on the R-squared, p-values, and visual inspection of the plots, choose the model that you think best explains the variation around the mean selling price. Consider both statistical significance and how well the model visually captures the trend in the data.

Linear Regression Assumptions:

Linear regression relies on several assumptions:

  • Linearity: The relationship between the independent and dependent variables should be linear.
  • Independence: The errors (residuals) should be independent of each other.
  • Homoscedasticity: The variance of the errors should be constant across all values of the independent variable.
  • Normality: The errors (residuals) should be normally distributed.

Assumption Violations:

Your analysis throughout the steps above should provide information to evaluate these assumptions.

  • Look for non-linear patterns in the scatterplots to assess linearity violations.
  • Random scatter around zero in the residual plots suggests independence.
  • Consistent spread of residuals indicates homoscedasticity.
  • Normality of residuals can be assessed using normality tests.

By analyzing the data and model results, you can identify potential violations of these assumptions and determine if the linear regression models are suitable for this data.

This question has been answered.

Get Answer