BirthratePrediction

🖥️ Project Introduction

The purpose of creating a birth rate prediction model is to analyze and identify the factors that most significantly influence birth rates.

Proposing Solutions from a Socio-Economic Standpoint Based on the Identified Causes.

⚙️ Environment Setting

Python 3
Interpreter : Jupyter Notebook

🧑‍🤝‍🧑 Members

Team member 1 : 19102085 Park Sehong
Team member 2 : 19102088 Park Jinsoo
Team member 3 : 21102044 Oh Seyeon

📌 Process

Sources - links

Period : 1990 ~ 2021
15 countries(OECD)

Country Code	Country Name
AUT	Austria
CAN	Canada
CHE	Switzerland
DEU	Germany
ESP	Spain
FIN	Finland
FRA	France
GRC	Greece
HUN	Hungary
ITA	Italy
JPN	Japan
KOR	Korea
LUX	Luxembourg
POL	Poland
PRT	Portugal

Libraries Used

pandas: Utilized for data manipulation and analysis.
numpy: Used for numerical operations and calculations.
matplotlib: Employed for creating visualizations, including scatter plots.
scikit-learn (sklearn): Applied for machine learning tasks, such as model training, evaluation, and hyperparameter tuning.
xgboost: Integrated for implementing XGBoost regression models.
eli5: Used for permutation importance analysis.
shap: Employed for interpreting SHAP values and creating summary plots.

Data Merge

Collecting data
Convert collected data to the same format
Merge data based on Country ID

data.describe()

5 Examples

	FertilityRate	AvgHoursWorked	BothWorking	FirstBirthAge	MarriageAge
conunt	480.000000	453.000000	191.00000	455.000000	174.000000
mean	1.466667	1714.458318	53.27801	29.860879	28.442529
std	0.199506	165.567786	8.97173	1.456630	2.434869
min	0.810000	1319.131000	33.90000	25.600000	22.200000
25%	1.330000	1592.659000	46.70000	29.000000	26.600000
50%	1.420000	1716.341000	52.00000	29.900000	28.800000
75%	1.570000	1811.852000	61.90000	30.900000	30.475000
max	2.020000	2228.000000	71.00000	33.400000	32.900000

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 28 columns):

#	Column	Non-Null Count	Dtype
0	ID	480	object
1	Year	480	int64
2	Country Name	480	object
3	FemaleLaborParticipationRate	480	float64
4	AvgHoursWorked	453	float64
5	BothWorking	191	float64
6	FirstBirthAge	455	float64
7	MarriageAge	174	float64
8	MarriageRate	406	float64
9	EmploymentRate	465	float64
10	UnemploymentRate	434	float64
11	HousingPrice	414	float64
12	InterestRate	432	float64
13	PartTimeRate	462	float64
14	FamilyExpenditure	445	float64
15	HealthExpenditure	318	float64
16	LaborMarketExpenditure	415	float64
17	UnemploymentExpenditure	446	float64
18	GDI	470	float64
19	GDP	479	float64
20	GNI	455	float64
21	PovertyGap	308	float64
22	EduExpenditureOfGDP	415	float64
23	EduExpenditureOfGov	370	float64
24	TotalLaborParticipationRate	480	float64
25	InflationRate	480	float64
26	Population	480	float64
27	FertilityRate	480	float64

dtypes: float64(25), int64(1), object(2)
memory usage: 105.1+ KB
📁merged_data.xlsx

Data Preprocessing - Code

Initial Analysis(calculate the percentage of missing values for each feature)
Handling Missing Values

Removing Columns with more than 20% Missing Values

columns_to_drop = data.columns[data.isnull().mean() > 0.2]
data = data.drop(columns_to_drop, axis=1)

Method: Dropped columns with more than 20% missing values.
Reason: Columns with a high percentage of missing values may not provide meaningful insights.

Imputing Missing Values by Country

# Interpolated missing values for each column based on time, considering the direction of NaNs
for dataset in country_data_list:
    dataset['Year'] = pd.to_datetime(dataset['Year'], format='%Y')
    dataset.set_index('Year', inplace=True)
    
    for column in dataset.columns:
        dataset[column] = interpolate_with_direction(dataset[column])

Method: Interpolated missing values based on time, considering the direction of NaNs for each column.
Reason: Time-based interpolation is suitable for sequential data, and considering the direction of NaNs helps maintain the trend.

Creating Country-Specific Datasets
Automated Handling of Missing Values for Each Country
Merging Country-Specific Datasets and Visualization

Before

After

📁real_merged_data.xlsx

Feature Extraction - Code

Create new features from existing features
- Work Leisure Balance Index = 'AvgHoursWorked' / 'TotalLaborParticipationRate'
  It reflects the interplay between work intensity and leisure time in the labor market. A high index suggests a scenario with prolonged working hours and reduced participation, indicating an environment of intense work and limited leisure. This condition may contribute to lower birth rates. In contrast, a low index points to shorter working hours and higher participation, indicating a balanced environment with ample leisure time. Such conditions may foster a more favorable atmosphere for higher birth rates.
- Labor Market Stability = ‘EmploymentRate’ / ‘UnemploymentRate’
  It is served as a crucial indicator for evaluating the health and stability of the labor market. A higher ratio signifies a stable labor market characterized by high employment and low unemployment rates. Conversely, a lower ratio indicates an unstable labor market with lower employment and higher unemployment rates.
- Hosing Affordability Index = ‘HousingPrice’ / ‘PerCapitaGDP’
  It offers comprehensive measure of the economic feasibility of homeownership relative to individual or household income levels. A lower index value suggests that purchasing a house is more affordable, potentially alleviating financial stress on families and encouraging higher birth rates. Conversely, a higher index value indicates that buying a house is financially burdensome in relation to personal income, possibly leading to increased financial strain and acting as a deterrent to family expansion.
Deletion of Unnecessary Features
- PerCapitaGDP = ‘GDP’ / ‘Population’
- Population
- DivorceRate

📁after_extraction_data.xlsx

Preliminary Experiment - Code

Modeling with XGBoost regression to predict 'BirthRate'
Evaluates the model's performance

Correlation Analysis_RFECV - Code

Correlation Analysis using heatmap

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm',
            xticklabels=corr_matrix.columns,
            yticklabels=corr_matrix.columns)
plt.title('Correlation Matrix')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

Purpose: Examine pairwise feature correlations.
Methodology: Heatmap using Seaborn.
Result:

Feature Selection utilizing Recursive Feature Elimination with Cross-Validation (RFECV) using a RandomForestRegressor and XGBRegressor
Ranks features based on their importance
Removes less important features identified by both RandomForestRegressor and XGBRegressor

# Model selection
models = {
    'RandomForestRegressor': RandomForestRegressor(random_state=0),
    'XGBRegressor': XGBRegressor(random_state=0),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=0)
}

# Feature ranking
feature_rankings = pd.DataFrame()

for model_name, model in models.items():
    rfecv = RFECV(estimator=model, step=1, cv=5, scoring='neg_mean_squared_error')
    rfecv.fit(X, y)
    
    # Create a ranking dataframe
    ranking_df = pd.DataFrame({
        'Feature': X.columns,
        f'Ranking_{model_name}': rfecv.ranking_
    })
    
    # Merge rankings
    if feature_rankings.empty:
        feature_rankings = ranking_df
    else:
        feature_rankings = feature_rankings.merge(ranking_df, on='Feature')

# Sort and display rankings
feature_rankings = feature_rankings.sort_values(by='Ranking_RandomForestRegressor')
feature_rankings

Purpose: Identify vital features for predicting FertilityRate.
Methodology: RFECV applied using RandomForest, XGBoost, and DecisionTree models.
Result:

Feature	RandomForest	XGBRegressor	DecisionTree
FemaleLaborParticipationRate	1	1	1
WorkLeisureBalanceIndex	1	4	8
TotalLaborParticipationRate	1	1	4
EduExpenditureOfGDP	1	1	1
GDP	1	1	1
GDI	1	1	1
UnemploymentExpenditure	1	1	1
LaborMarketStability	1	3	6
FamilyExpenditure	1	1	1
InterestRate	1	7	2
HousingPrice	1	1	1
UnemploymentRate	1	1	1
EmploymentRate	1	1	1
FirstBirthAge	1	1	1
AvgHoursWorked	1	1	1
HousingAffordabilityIndex	1	1	5
MarriageRate	2	2	9
InflationRate	3	8	1
GNI	4	6	7
PartTimeRate	5	5	3

📁Final_Data.xlsx

Final Data Analysis - Code

Train/Valid/Test Dataset

Train : 1990-2025
Valid : 2016-2028
Test : 2019-2021

Model Training

RandomForestRegressor

a machine learning model that uses an ensemble of decision trees to make predictions, particularly useful for regression tasks. It's known for its high accuracy and ability to handle large datasets with numerous input variables.
Hyperparameter
- 'n_estimators': The number of trees in the forest
- 'max_depth': The maximum depth of each tree
- 'min_samples_split': The minimum number of samples required to split an internal node
- 'min_samples_leaf': The minimum number of samples required to be at a leaf node
Best Hyperparameter
- Check all combinations to determine best hyperparameters based on MSE
- 'n_estimators': 100
- 'max_depth': 20
- 'min_samples_split': 2
- 'min_samples_leaf': 1
Performance
- MSE : 0.028746
- R2 : 0.276193
- MAE : 0.121922

XGBoostRegressor

This model uses a gradient boosting method that combines multiple decision trees to make predictions, and is known for its high performance and fast processing speed. XGBoostRegressor provides high accuracy even in complex datasets, and can optimize model performance by adjusting various hyperparameters.
Hyperparameter
- 'n_estimators': The number of boosting stages to be run
- 'max_depth': The maximum depth of a tree
- 'learning_rate': The step size shrinkage used in updating tree weights
Best Hyperparameter
- 'n_estimators': 300
- 'max_depth': 3
- 'learning_rate': 0.1
Performance
- MSE : 0.012142
- R2 : 0.694262
- MAE : 0.089414

DecisionTreeRegressor

a machine learning model used for regression tasks, which predicts continuous output values. It builds a tree-like model of decisions, where each node represents a feature, and branches represent decision rules, ultimately leading to a prediction at the leaves. This model is known for its simplicity, interpretability, and ability to capture non-linear relationships in data.
Hyperparameter
- 'max_depth': The maximum depth of each tree
- 'min_samples_split': The minimum number of samples required to split an internal node
- 'min_samples_leaf': The minimum number of samples required to be at a leaf node
Best Hyperparameter
- 'max_depth': 10
- 'min_samples_split': 2
- 'min_samples_leaf': 4

Model Selection

XGBoostRegressor : Based on the results, the XGBoostRegressor is the best choice among the three models. It achieves the lowest mean squared error (MSE) on the test set (0.012142) and the highest R² score (0.694262).

Compare performance

RandomForest Regressor

XGBoost Regressor

DecisionTree Regressor

Results

SHAP(XGBoost Regressor)

Why we used "SHAP"
- We used SHAP to determine which characteristics are most influential on fertility. By calculating the impact of each trait on the forecast, we can understand the importance of the traits and use this to inform policy discussions around traits.
- SHAP (SHapley Additive exPlanations) is a tool that helps explain model predictions and understand how much each trait contributes to the prediction. SHAP values evaluate the contribution of any combination of traits, so it can be computationally expensive for large datasets or complex models, but we felt that we were able to account for relatively all of them, and it helped us understand the importance of traits.

The results are not surprising, but they are generally understood. The most influential is the age at first birth, with higher ages having a negative effect on fertility. The data below can be interpreted in a similar way to the age at first birth.

Limitations

The time-based interpolation method was used to fill in the missing values of the data taken from OECD data, but the accuracy of the data and the reliability of the predicted values are poor. To secure this, it seems that a more accurate data set should be obtained during data collection and a more accurate interpolation method should be used during data preprocessing. In addition, there are too many factors that contribute to Korea's declining fertility rate. It is doubtful that the specific characteristics of the 15-country dataset can be applied to Korea and lead to a positive change in the fertility rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BirthratePrediction

🖥️ Project Introduction

⚙️ Environment Setting

🧑‍🤝‍🧑 Members

📌 Process

Sources - links

Libraries Used

Data Merge

Data Preprocessing - Code

Before

After

Feature Extraction - Code

Preliminary Experiment - Code

Correlation Analysis_RFECV - Code

Final Data Analysis - Code

Train/Valid/Test Dataset

Model Training

Model Selection

Compare performance

RandomForest Regressor

XGBoost Regressor

DecisionTree Regressor

Results

SHAP(XGBoost Regressor)

Limitations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
Code		Code
Data		Data
README.md		README.md

jinny908/BirthratePrediction

Folders and files

Latest commit

History

Repository files navigation

BirthratePrediction

🖥️ Project Introduction

⚙️ Environment Setting

🧑‍🤝‍🧑 Members

📌 Process

Sources - links

Libraries Used

Data Merge

Data Preprocessing - Code

Before

After

Feature Extraction - Code

Preliminary Experiment - Code

Correlation Analysis_RFECV - Code

Final Data Analysis - Code

Train/Valid/Test Dataset

Model Training

Model Selection

Compare performance

RandomForest Regressor

XGBoost Regressor

DecisionTree Regressor

Results

SHAP(XGBoost Regressor)

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages