This week, working on my subsetted Marscrtaer data, I will test a multiple regression model to examine my research question.
Research Question:
- Is crater diameter associated with crater depth?
In addition to my primary explanatory variable (crater diameter), I will consider the effect of a potential confounding variable (Ejecta_Layers) on the association between crater depth and crater diameter.
Null Hypothesis (H0):
- Crater depth is not associated with crater diameter.
Alternative Hypothesis (H1):
- Crater depth is associated with crater diameter.
Crater diameter and crater depth are both quantitative variables. I therefore centered crater diameter (explanatory variable) since it is quantitative. The additional explanatory variable, Ejecta_Layers, is categorical. I recoded Ejecta_Layers to ensure that the first category starts from zero. I then performed both linear and polynomial regression analyses and generated appropriate diagnostic plots in Python. The Python code is shown below.
Interpretation of Basic Linear and Multiple Regression Models
Summary of Results
After adjusting for potential confounding factors (Ejecta_Layers), crater diameter (first order Beta = 0.0596, p < 0.0001, second order Beta = -0.0011, p < 0.0001 and third order Beta = 7.943e-06, p < 0.0001) was significantly and positively associated with crater depth. Crater diameter confounds the relationship between crater depth and Ejecta_Layers after controlling for a first order linear crater diameter model. However, Ejecta_Layer becomes statistically significant when a third order polynomial fit is applied to crater diameter. The multiple regression model accounts for 54.4% of the total variability seen in the response variable (crater depth).
Statistics for Centered Explanatory Variable
First Model: Linear Regression between Crater Diameter (explanatory) and Crater Depth (response)
The first order linear regression model above shows that crater diameter (Beta = 0.044, p-value < 0.0001) is significantly and positively associated with crater depth. The blue line of best fit in the scatterplot above represents the first order linear regression model. R-squared is 0.511, which implies that crater diameter can explain 51% of the variability in crater depth. Therefore, I will examine other explanatory variables (Number of Ejecta_Layers) to investigate if the R-Squared can be improved.
Second Model: Linear Regression between Ejecta_Layers (explanatory) and Crater Depth (response)
The linear regression model above shows that Ejecta_Layers (Beta = 0.005, p-value < 0.0001) is significantly and positively associated with crater depth. R-squared is 0.102, which implies that number of Ejecta Layers can explain 10% of the variability in crater depth. Therefore, my next step is to combine both explanatory variables (Crater Diameter and Ejecta_Layers) in a multiple regression model to investigate if the R-Squared can be improved.
Third Model: Multiple Regression between First Order Crater Diameter (explanatory) + Ejecta_Layers (potential confounder) + Crater Depth (response)
The table above shows the multiple regression results of our explanatory variable (Crater Diameter) and the potential confounding variable (Ejecta_Layers). Ejecta_Layers is no longer significantly associated with crater depth after controlling for crater diameter. The p-value is 0.422 which is greater than 0.05. The value of zero also occurs between the 95% confidence interval (-0.011 to 0.005). It can be concluded that crater diameter confounds the relationship between crater depth and Ejecta_Layers because the p-value of Ejecta_Layers is no longer statistically significant when crater diameter is included in the model. The next step will be to investigate if a polynomial fit will improve the association of crater diameter with crater depth.
Fourth Model: Third Order Polynomial Model between Crater Diameter (explanatory) and Crater Depth (response)
The table above summarizes the results of the third order (cubic) polynomial fit between crater diameter and crater depth. The results show that the linear, quadratic and cubic terms, all have a significant p-value < 0.0001, indicating a significant association. A positive linear coefficient (0.0589) and a negative quadratic coefficient (-0.0011) indicates that the polynomial curve is concave. The R-Squared for this cubic polynomial fit is 0.544, which is an improvement compared to the 0.511 obtained in the first order linear regression model. So, I will keep the third order polynomial fit in my model because it captures the observed curvilinear relationship between crater diameter and crater depth. My next step therefore, will be to confirm whether a third order polynomial crater diameter will still confound the relationship between crater depth and Ejecta_Layers. The blue line of best fit in the scatterplot above represents the first order linear regression model while the green line of best fit represents the third order polynomial regression model.
Fifth Model: Multiple Regression between Third Order Polynomial Crater Diameter (explanatory) and Ejecta_Layers (explanatory) and Crater Depth (response)
From the table above, you can notice that all the p-values of the explanatory variables are less than 0.0001 indicating significant association with the response variable (crater depth). This result therefore shows that Ejecta_Layers becomes significant after controlling for a third order polynomial crater diameter. However its significance is very minimal since the R-Squared of 0.544 remained the same with the previous result. This implies that the same 54.4% of the variability in crater depth can still be explained by the combined explanatory variables (polynomial crater diameter and Ejecta_layers).
The results so far supports my alternative hypothesis, that crater diameter is associated with crater depth. Ejecta_Layers have a minimal contribution to this association, only in the case of a third order polynomial crater diameter fit.
Interpretation of Regression Diagnostic Plots
Several diagnostic plots were generated to evaluate my polynomial regression model for evidence of misspecification. The residuals between predicted and actual crater depth was used to verify how good the model and model assumptions are. Leverage (Influence) plot and standardized residuals will be used to check whether there are any outliers that might be largely influencing the estimation of the regression coefficients.
Q-Q Plot of Multiple Polynomial Regression Model
The Q-Q plot above describes my final regression model using the third order polynomial crater diameter and Ejecta_Layers as explanatory variables to predict crater depth. The plots reveal that the residuals are mostly normally distributed but deviate at lower and upper quantiles. This indicates that the residuals do not follow a perfect normal distribution. This implies that the variability pattern observed between crater depth and crater diameter may not fully be explained by my model. However, since the residuals closely follow the perfect normal distribution line, it is safe to assume that the model represents the best fit possible with the available data.
Standardized ResidualsThe above plot shows that most of the residuals lie within 2 standard deviations above or below the mean of zero (that is, within 95% confidence interval). However, there is a significant number of residual values greater 2 standard deviations (up to 4 standard deviations) from the mean in either direction. This suggests the presence of some outliers.
The plot in the upper right hand corner above shows the residuals for each observation at different values of crater diameter. We can see that the absolute values of the residuals are significantly larger mainly at lower crater diameters and at crater diameter greater than 40km. This is consistent with the partial regression plot at the lower left hand corner and other regression diagnostic plots which suggest that the model does not predict crater depth very well for very small crater diameters and crater diameters greater than 40km. These data are either possible outliers or more explanatory variables are needed to improve the model. To test for outlier significance, I generated the Leverage plot below.
Leverage (Influence) Plot
The plot above shows low leverage values (<0.01) throughout the entire observation which indicates that outliers have little influence on the model. There is significant number of residuals which lie above and below 2 standard deviations of the mean. However, they have very low leverage values (between 0 and 0.002). Most of the other observations with relatively higher leverage (0.002 to 0.01) are all within 2 standard deviations of the mean. Overall, there is no outlier with significant undue influence on the model. Based on these diagnostic plots, the model seems very good at predicting crater depth for most crater diameter values.
Python Code: Testing a Multiple Regression Model
Data Loading, Error Handling and Data Conditioning
Multiple Regression Model
Regression Diagnostics to Evaluate Model Fit
Author:
Posted on February 21, 2016 by Okechukwu Ossai
Recent Comments