Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Sao Paulo pipeline #12

Open
LotteNotelaers opened this issue Sep 10, 2024 · 1 comment
Open

Running Sao Paulo pipeline #12

LotteNotelaers opened this issue Sep 10, 2024 · 1 comment

Comments

@LotteNotelaers
Copy link

Dear,

I am trying to run the Sao Paulo pipeline. I have two questions.

  1. When executing stage data.spatial.zones__{...}, I get the following error: "OverflowError: Python int too large to convert to C long"
    I found out that the zone_id numbers (column AP_2010_CH in the spatial datafiles) are very big (e.g. 3550308005107). How did you resolve this error when you were running the Sao Paulo pipeline?
    I was thinking about renaming the zone_ids but I have not a clear overview in which stages of the pipeline these zone numbers are used (e.g. to connect spatial and census data?). I do not want to break these connections in the pipeline. Could you explain in which stages these zone_ids are used? Or do you know a better way for solving this issue?

  2. When looking at the raw.py for the census data, I noticed the columns that are selected from the original census datafile are these: ['V0001', 'V0011', 'V0221', 'V0222', 'V0601', 'V6036', 'V0401', 'V1004', 'V0010', 'V0641', 'V0642', 'V0643', 'V0644', 'V0628', 'V6529', 'V0504']. Later in the code these are renamed to ["federationCode", "areaCode", "householdWeight", "metropolitanRegion", "personNumber", "gender", "age", "goingToSchool", "employment", "onLeave", "helpsInWork", "farmWork", "householdIncome", "motorcycleAvailability", "carAvailability", "numberOfMembers"]. I looked up the meaning of these codes ("V...") in the documentation accompanying the census data. That is when I noticed that the order of the V-codes and the column names used for renaming is different. Is this correct? Or did I miss a processing step in the pipeline that makes sure that the correct column names are given to the V-codes?

Could you help me with these two questions?

Kind regards,
Lotte

@balacmi
Copy link
Contributor

balacmi commented Sep 12, 2024

Hi @LotteNotelaers , first I have to say that we had no errors when we last used this pipeline. Since that time, the Python version has substantially changed, and so have the libraries. So it might be that something in the meantime stopped working.

  1. Zones are essential, and a better alternative is to find a different type when loading them. What is the exact error message you get? Maybe using int instead of np.int could help.

  2. Where did you observe inconsistency, in which variables? Which documentation are you reading?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants