This project investigates datasets from Gapminder World and aims to gain insights into countries' environment and economy from their consumption of energy. The analysis covers steps ranging from wrangling data to exploratory data analysis and drawing conclusions.
- Software
- conda 4.6.3 or similar versions
- python 3.7.2 (or python 3)
- Packages
- pandas
- numpy
- copy
- matplotlib.pyplot
- seaborn
- Raw Data
- All
.csv
files of the raw data were manually downloaded from Gapminder World. The Indicator Name of each dataset is listed below in italics. The names of the python objects that contain the data in each file are provided in a parenthesis next to the Indicator Name.- Energy
- Oil > Oil consumption, total (
df_energy_oil
) - Coal > Coal consumption, total (
df_energy_coal
) - Total > Energy use, per person (
df_energy_cons
) - Total > Energy production /person (
df_energy_prod
)
- Oil > Oil consumption, total (
- Environment > Emissions > CO2 emissions yearly (
df_env_co2
) - Economy > Incomes & growth > GDP/capita (USD, inflation-adjusted) (
df_gdp
)
- Energy
- Although the sizes of the six datasets used in this project are not significantly large, the repository does not include the raw datasets and the cleaned datasets because
- the raw datasets can be manually downloaded from Gapminder World by searching for the Indicator Name.
- the cleaned datasets can be obtained by cleaning the raw datasets according to the step-by-step wrangling operations documented in the
.ipynb
file.
- All
- Introduction
- provides a motivation for investigating and analyzing the datasets covering data across Energy, Environment, and Economy.
- categorizes the investigation and analysis into two separate sections.
- CO2 Emissions vs. Oil and Coal Consumption
- GDP vs. Energy Consumption
- provides the Indicator Name of all raw datasets used in each section.
- introduces the main questions for each section which will be investigated through the following analyses.
- Data Wrangling
- identifies several general properties of the raw datasets in each section.
- builds on these general properties to define and execute the operations for cleaning the raw datasets.
- Exploratory Data Analysis
- applies data visualization and descriptive statistics to analyze the cleaned datasets and derive insights.
- answers the main questions posed in the introduction.
- Conclusions
- summarizes the findings from the exploratory data analysis, including trends, correlations, distributions, and possible explanations for the observations.
- qualifies the findings with limitations of the analysis in terms of investigating causation.
- Data Wrangling Phase > Identify Outliers
- Data Cleaning section from v1.0 mentioned that visual inspection of the scatter plot was used to determine the lower bounds of
mean_gdp
andmean_cons
for the outliers. To support this information, the initial scatter plot used to identify the outliers was newly added to the end of the Data Cleaning section for GDP vs. Energy Consumption.
- Data Cleaning section from v1.0 mentioned that visual inspection of the scatter plot was used to determine the lower bounds of
- Code Functionality > Warnings
- When cleaning the datasets
df_energy_cons_v2
anddf_gdp_v2
to createdf_energy_cons_v3
anddf_gdp_v3
,copy.copy()
was used to create copies ofdf_energy_cons_v2
anddf_gdp_v2
instead of making changes to the original datasets.
- When cleaning the datasets
- Code Functionality > Use of Functions
- In the Exploratory Data Analysis section right before the analysis for CO2 Emissions vs. Oil and Coal Consumption, a new sub-section Custom Functions was added, introducing two new functions in order to streamline generation of basic plots such as bar plot and histogram.
- plot_bar()
- requires input parameters for the dataframe, the variables for x and y axes, a list containing the labels for the title and the two axes, and the width of each bar
- generates a bar plot based on the provided input parameters
- used in CO2 Emissions vs. Oil and Coal Consumption for plotting the mean annual consumption of oil and coal for the top five countries and the mean annual emissions of CO2 for the top five contributors.
- plot_hist()
- requires input parameters for the variable, number of bins, and a list containing the labels for the title and the two axes
- generates a histogram plot based on the provided input parameters
- used in GDP vs. Energy Consumption for plotting the distribution of each of the two variables, mean annual energy use per person and mean annual GDP/capita.
- plot_bar()
- In the Exploratory Data Analysis section right before the analysis for CO2 Emissions vs. Oil and Coal Consumption, a new sub-section Custom Functions was added, introducing two new functions in order to streamline generation of basic plots such as bar plot and histogram.
- Exploration Phase > Single-variable Exploration
- CO2 Emissions vs. Oil and Coal Consumption
- Two new bar plots were added, each of which focuses on a single variable:
- mean annual consumption of oil and coal for the top five countries and
- the mean annual emissions of CO2 for the top five contributors
- Findings from reviewing these plots were used to compare the quantities observed across the five countries.
- Two new bar plots were added, each of which focuses on a single variable:
- GDP vs. Energy Consumption
- Two new histograms were added, each of which focuses on a single variable:
- mean annual energy use per person in each country and
- mean annual GDP/capita in each country
- Findings from reviewing these plots were used to elaborate on the analysis deduced from the scatter plot that the distribution of each variable must be right-skewed.
- Two new histograms were added, each of which focuses on a single variable:
- CO2 Emissions vs. Oil and Coal Consumption
- Conclusion > Limitations
- Sub-section Limitations was added to the Conclusions to discuss the limitations of the datasets and how any causality between two variables could not be deduced with certainty due to these limitations. Possible sources of improvements to these limitations were also briefly mentioned.
Jong Min (Jay) Lee [[email protected]]
- This project was completed as one of the mandatory requirements for the Data Analyst Nanodegree at Udacity.
- A variety of datasets in Gapminder World, covering various topics across industries and sectors, provided the motivation for applying the knowledge and the technical skills relevant to data analysis to analyzing real datasets and drawing insights from data.
- Online resources including discussions in Stack Overflow and technical documentations for packages and methods were referenced throughout the data wrangling and analysis processes.