Running Sao Paulo pipeline #12

LotteNotelaers · 2024-09-10T17:25:44Z

Dear,

I am trying to run the Sao Paulo pipeline. I have two questions.

When executing stage data.spatial.zones__{...}, I get the following error: "OverflowError: Python int too large to convert to C long"
I found out that the zone_id numbers (column AP_2010_CH in the spatial datafiles) are very big (e.g. 3550308005107). How did you resolve this error when you were running the Sao Paulo pipeline?
I was thinking about renaming the zone_ids but I have not a clear overview in which stages of the pipeline these zone numbers are used (e.g. to connect spatial and census data?). I do not want to break these connections in the pipeline. Could you explain in which stages these zone_ids are used? Or do you know a better way for solving this issue?
When looking at the raw.py for the census data, I noticed the columns that are selected from the original census datafile are these: ['V0001', 'V0011', 'V0221', 'V0222', 'V0601', 'V6036', 'V0401', 'V1004', 'V0010', 'V0641', 'V0642', 'V0643', 'V0644', 'V0628', 'V6529', 'V0504']. Later in the code these are renamed to ["federationCode", "areaCode", "householdWeight", "metropolitanRegion", "personNumber", "gender", "age", "goingToSchool", "employment", "onLeave", "helpsInWork", "farmWork", "householdIncome", "motorcycleAvailability", "carAvailability", "numberOfMembers"]. I looked up the meaning of these codes ("V...") in the documentation accompanying the census data. That is when I noticed that the order of the V-codes and the column names used for renaming is different. Is this correct? Or did I miss a processing step in the pipeline that makes sure that the correct column names are given to the V-codes?

Could you help me with these two questions?

Kind regards,
Lotte

balacmi · 2024-09-12T13:34:02Z

Hi @LotteNotelaers , first I have to say that we had no errors when we last used this pipeline. Since that time, the Python version has substantially changed, and so have the libraries. So it might be that something in the meantime stopped working.

Zones are essential, and a better alternative is to find a different type when loading them. What is the exact error message you get? Maybe using int instead of np.int could help.
Where did you observe inconsistency, in which variables? Which documentation are you reading?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Sao Paulo pipeline #12

Running Sao Paulo pipeline #12

LotteNotelaers commented Sep 10, 2024

balacmi commented Sep 12, 2024

Running Sao Paulo pipeline #12

Running Sao Paulo pipeline #12

Comments

LotteNotelaers commented Sep 10, 2024

balacmi commented Sep 12, 2024