-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge SPSS files, both data frame and metadata #133
Comments
I have written some code that does this and I am happy to share this. |
Hi @MDS-JAnthony, thanks for the update. Yes, it is my understanding cause after the left or right merge you will get only the left or right data. |
I guess you are talking about the key column? That is one possibility. The other is try to keep everything anyway. What in the case of an inner join or outer join and the metadata in both dataframes differ? How do you reconciliate? |
Hi @ofajardo, thank you for your comments. You are totally right, there are a lot of details. However including some extra parameters, something like main_meta='left' (or right). What if for the non-key columns, if they are repeated, keep them for another version? |
I think for no key columns pandas rename them as column_x, column_y? If so, the same has to happen here. You would need to check all possible things AND write tests for every possible corner case. Orherwise you could just share your code snippet for the people to use it as is. Then they can modify it if they need and no need to maintain and deal with user issues in the future. |
I will be tidying up my code for this today/tomorrow, so should be able to share an initial version of the code. |
Here is my initial code. Hopefully people can modify this to their needs. Any quetions just ask.
|
I did this: import pandas as pd
import pyreadstat
def observe_metadata(metadata):
print("==================================== ENTER: OBSERVE METADATA ====================================")
# Metadata nesnesinin içeriğini daha ayrıntılı görmek için pprint kullanın.
# pprint.pprint(metadata.__dict__)
print("Sütun (Değişken) adları:")
print(metadata.column_names)
# print("Sütun (Değişken) adları ve etiketleri:")
# print(metadata.column_names_to_labels)
print("Satır sayısı: ", metadata.number_rows)
print("Sütun sayısı:", metadata.number_columns)
print("Dosya kodlama biçimi: ", metadata.file_encoding)
print("==================================== OUT: OBSERVE METADATA ====================================")
# PISA 2022 veri seti dosyasının yolu:
file_path_stu = 'CY08MSP_STU_QQQ_Türkiye_and_filtered_variables.sav'
file_path_sch = 'CY08MSP_SCH_QQQ_Türkiye_and_filtered_variables.sav'
# Filtrelenmiş veriyi .sav dosyası olarak kaydetmek için yol belirt.
output_file_path = 'CY08MSP_STU_and_SCH_QQQ_Türkiye_and_filtered_variables_merged_4.sav'
on_column = 'CNTSCHID'
# PISA 2022 verisininin metadatasını oku.
df_stu, meta_stu = pyreadstat.read_sav(
file_path_stu, metadataonly=True, encoding="UTF-8",)
df_sch, meta_sch = pyreadstat.read_sav(
file_path_sch, metadataonly=True, encoding="UTF-8",)
print("ÖĞRENCİ ORİJİNAL VERİ:")
observe_metadata(meta_stu)
print("OKUL ORİJİNAL VERİ:")
observe_metadata(meta_sch)
common_columns = []
# Find common columns based on both column names and labels
for column_name, label in meta_stu.column_names_to_labels.items():
if column_name in meta_sch.column_names_to_labels.keys() and meta_sch.column_names_to_labels[column_name] == label:
common_columns.append(column_name)
print("Ortak sütunlar: ", common_columns)
print("Ortak sütun sayısı: ", len(common_columns))
common_columns.remove(on_column)
print("Ortak sütunlar: ", common_columns)
print("Ortak sütun sayısı: ", len(common_columns))
selected_columns = list(meta_sch.column_names_to_labels.keys())
for common_column in common_columns:
selected_columns.remove(common_column)
# Load the two datasets
student_data, student_meta = pyreadstat.read_sav(
file_path_stu, encoding="UTF-8")
school_data, school_meta = pyreadstat.read_sav(
file_path_sch, usecols=selected_columns, encoding="UTF-8")
# Merge the datasets
merged_data = student_data.merge(school_data, on=on_column, how="outer")
# Set metadata for merged_data
merged_meta = student_meta
for variable in school_meta.column_names_to_labels.keys():
merged_meta.column_names_to_labels[variable] = school_meta.column_names_to_labels[variable]
for variable in school_meta.variable_value_labels.keys():
merged_meta.variable_value_labels[variable] = school_meta.variable_value_labels[variable]
print(merged_meta.column_names_to_labels)
# Save the merged dataset with metadata
pyreadstat.write_sav(merged_data, output_file_path, column_labels=merged_meta.column_names_to_labels,
variable_value_labels=merged_meta.variable_value_labels)
df_output, meta_output = pyreadstat.read_sav(
output_file_path, metadataonly=True, encoding="UTF-8",)
print("==================================== ENTER: CHECK THE PROCESS ====================================")
calculated_col_count = meta_stu.number_columns + \
meta_sch.number_columns - len(common_columns) - 1
print(
f"Hesaplanan sütun sayısı: stu_col_count + stu_col_count - common_col_count - 1 = {meta_stu.number_columns} + {meta_sch.number_columns} - {len(common_columns)} - 1 = {calculated_col_count}")
print("Çıktı verisindeki sütun sayısı: ", meta_output.number_columns)
print(
f"{'Değişken sayısı kontrolü: İşlem başarılı. ' + str(calculated_col_count) + ' == ' + str(meta_output.number_columns) if calculated_col_count == meta_output.number_columns else 'Değişken sayısı kontrolü: İşlem başarısız. ' + str(calculated_col_count) + ' != ' + str(meta_output.number_columns)}")
print(f"Filtrelenmiş veri {output_file_path} dosyasına kaydedildi.")
print("==================================== OUT: CHECK THE PROCESS ====================================")
print("ÇIKTI VERİSİ:")
observe_metadata(meta_output) Result: ÖĞRENCİ ORİJİNAL VERİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'ESCS', 'MEAN_PVMATH']
Satır sayısı: 7250
Sütun sayısı: 5
Dosya kodlama biçimi: UTF-8
==================================== OUT: OBSERVE METADATA ====================================
OKUL ORİJİNAL VERİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'MEAN_ESCS']
Satır sayısı: 196
Sütun sayısı: 4
Dosya kodlama biçimi: UTF-8
==================================== OUT: OBSERVE METADATA ====================================
Ortak sütunlar: ['CNT', 'CNTRYID', 'CNTSCHID']
Ortak sütun sayısı: 3
Ortak sütunlar: ['CNT', 'CNTRYID']
Ortak sütun sayısı: 2
{'CNT': 'Country code 3-character', 'CNTRYID': 'Country Identifier', 'CNTSCHID': 'Intl. School ID', 'ESCS': 'Index of economic, social and cultural status', 'MEAN_PVMATH': 'Mean of the\xa0Plausible Values in Mathematics', 'MEAN_ESCS': 'Mean index of economic, social and cultural status (ESCS) of schools by Intl. School ID (CNTSCHID)'}
==================================== ENTER: CHECK THE PROCESS ====================================
Hesaplanan sütun sayısı: stu_col_count + stu_col_count - common_col_count - 1 = 5 + 4 - 2 - 1 = 6
Çıktı verisindeki sütun sayısı: 6
Değişken sayısı kontrolü: İşlem başarılı. 6 == 6
Filtrelenmiş veri CY08MSP_STU_and_SCH_QQQ_Türkiye_and_filtered_variables_merged_4.sav dosyasına kaydedildi.
==================================== OUT: CHECK THE PROCESS ====================================
ÇIKTI VERİSİ:
==================================== ENTER: OBSERVE METADATA ====================================
Sütun (Değişken) adları:
['CNT', 'CNTRYID', 'CNTSCHID', 'ESCS', 'MEAN_PVMATH', 'MEAN_ESCS']
Satır sayısı: 7250
Sütun sayısı: 6
Dosya kodlama biçimi: UTF-8
==================================== OUT: OBSERVE METADATA ==================================== |
Dataframes can be easily merged using
pd.merge()
; on the other hand, merge metadata is a pain. It would be great if it would be possible to have apyreadstat.merge_sav()
method with the same parameters ofpd.merge
plus metadata for left and right.The text was updated successfully, but these errors were encountered: