Week 3 assignment involves making data management decisions for the variables chosen to answer my research questions. These data management decisions include removing invalid data, selecting subset working dataset or representative sample, creating secondary variables and binning or grouping continuous variables.
There are 384,343 craters in the original Mars database. On inspection, I discovered that 10 craters have negative depths while 307,529 craters have depths equal to zero. It is impractical for impact craters to have depth less than or equal to zero. These ambiguous crater depth data are possible edge effects. Therefore, all craters (307,539) with negative depths and depths equal to zero were excluded as invalid. I also made scatter plot of crater diameter vs crater depth to visualize the data trend. All data plotting away from the general data cluster were considered as outliers. Additional 292 craters were excluded as outliers.
The final working dataset includes only craters with;
- Crater depth greater than zero and crater depth less than or equal to 3km
- Crater diameter greater than zero and crater diameter less than or equal to 100km
A total of 76,512 craters out of the original 384,343 craters were retained as valid data in the new working dataset. Histogram distributions of original crater population and the working crater dataset look very similar confirming that the selected working dataset is an excellent representative sample of the population. Therefore, any conclusions inferred for the sample working dataset will be valid for the entire population.
Using the working dataset, I then created 3 new variables (DEPTH_GROUP, DIAMETER_GROUP and LATITUDE_GROUP) by binning the original variables. Frequency distribution tables were constructed for each variable.
This week's solution blog includes;
- Python codes
- 3 scatterplots of original crater database, selected working dataset and crater outliers
- Frequency distribution of 4 discrete and collapsed (binned) crater variables
Click on each link below to view solutions to Week 3 assignment, including the full Python code and interpretation of individual frequency distributions.
1. Python Data Analysis Code (Week 3): click here to view
2. Results and Interpretation (Week 3): click here to view