This is the first week of the Data Analysis Tools course, the second course in the Data Analysis and Interpretation specialization taught by Wesleyan University via Coursera. The Data Analysis Tools course is a continuation of the Data Management and Visualization course, which is the first course in the Data Analysis and Interpretation specialization. Previous weekly assignments for the first course can be found in the blog archives.
The objective of this first week assignment is to identify a research question, formulate a hypothesis and test the hypothesis using Analysis of Variance (ANOVA) F-Test. I will continue to work with the Mars Crater dataset and will examine the research question below.
- Does crater depth depend on crater location (crater latitude)?
Null Hypothesis (H0):
- Crater depth is not associated with crater location/latitude.
Alternative Hypothesis (H1):
- Crater depth is associated with crater location/latitude.
The above research question will examine whether the depth of impact craters on Mars is related to the location of the crater. In other words, does crater depth vary or change in different regions or latitudes on Mars? To answer this question, I will look at two variables; DEPTH_RIMFLOOR_TOPOG, which is Crater Depth measured in kilometres and LATITUDE_CIRCLE_IMAGE, which is Crater Latitude measured in decimal degrees North. Both variables are quantitative. However, ANOVA F-Test requires a categorical explanatory variable and quantitative response variable. Therefore, I created a new variable called MARS_REGION by collapsing Crater Latitude into 3 categories;
- Category 3: South Pole; -90 to -45 degrees Latitude
- Category 2: Near Equator; -45 to 45 degrees Latitude
- Category 1: North Pole; 45 to 90 degrees Latitude
The ANOVA F-Test will examine differences in the mean response variable (Crater Depth) for each category (MARS_REGION) of the explanatory variable (Crater Latitude). Since I have more than 2 groups or categories, a significant ANOVA does not tell me which groups are different from the others. So, I performed a 'Post hoc' test to evaluate the difference between pairs of means using the Tukey's Honestly Significant Difference Test. The hypothesis testing was conducted in Python using the Statsmodel statistical library.
Model Results: ANOVA Summary Table
Model Interpretation for ANOVA:
I tested whether Crater Depth is associated with Crater Location / Latitude. Crater Latitude was collapsed into 3 groups / categories. The Crater Latitude groups represent 3 separate regions of Mars. A one-way ANOVA with these 3 groups (MARS_REGIONS 1, 2 and 3) indicated significant variation of crater depth in different regions of Mars for all Fresh Uneroded Craters (my sample). The ANOVA model indicated that craters located in the middle of Mars, around the Equator (MARS_REGION = 2) are deeper (Mean=0.621 km, s.d. ±0.349 km) than craters located in both the North Pole (MARS_REGION = 1; Mean=0.179 km, s.d. ±0.272 km) and South Pole (MARS_REGION = 3; Mean=0.276 km, s.d. ±0.291 km) respectively. F (2, 18062) = 2456; p = 0.00
Assumed Alpha = 0.05. The F value is very high (2456) suggesting that there is minor within-group variability compared to between group variability. A p value of 0 also suggests it is 100% likely that the association of interest would be present following repeated samples drawn from the population. So, I reject the Null Hypothesis. This confirms that there is a relationship between crater depth and the region of Mars where the craters are located.
Model Results: Post Hoc ANOVA Summary Table
Model Interpretation for Post Hoc ANOVA Results:
Post Hoc comparison of means also revealed that there is significant difference between pairs of means of the 3 groups. Craters located near the equator in the middle of Mars (MARS_REGION = 2) are deeper (higher mean) than craters in the Polar Regions (lower means).
Comparing the 2 Polar Regions; it is also evident that craters in the South Pole (MARS_REGION = 3) are deeper (higher mean) than crater in the North Pole (MARS_REGION = 1). The model confirmed that the null hypothesis should be rejected at every multiple comparison level.
Group Mean Visualization:
The differences between the group means and within group variability can be best visualized using the boxplot below.
Posted on January 03, 2016 by Okechukwu Ossai