Aggregate functions #6

fgregg · 2017-05-09T17:51:34Z

It would be great if census_area handled the aggregations of census variables correctly.

Prior art

patwater · 2018-11-21T16:53:16Z

@fgregg hey we're looking to develop this functionality at ARGO. Key need to aggregate census statistics like median income correctly for our California water agency partners which have service area boundaries that don't align nicely with census boundaries. You know the story :)

We have a team of CUSP grad students looking to sprint on this mid December to mid January and would love your thoughts. Plan is a simple fork for the sprint and then can PR assuming everything works nicely :)

fgregg · 2018-11-21T19:25:44Z

Sounds great!

I think I would start by following the Census's guidance on aggregating statistics https://www.census.gov/content/dam/Census/library/publications/2018/acs/acs_general_handbook_2018_ch08.pdf

It would be very, very nice to make use of the variance data that the census has started to make available. https://www.census.gov/programs-surveys/acs/data/variance-tables.html but that's probably a phase II or phase III project.

I'd also recommend that you develop the aggregation code in a separate files from the existing ones, as it may be nice, in the future, to pull the aggregate code into a separate library.

dmarulli · 2018-11-26T18:58:11Z

Hey @fgregg - I put together an initial project board for our team of students. I will be continuing to update that, but wanted to drop it in this thread for those interested.

I also wanted to run the actual technical approach by you all to increase the probability of things lining up nicely.

So right now looks like there is a family of .geo_X() methods that can return geojson-like structures with statistics and geometries for lower level census geographies within higher level ones as well as for arbitrary geometries. (Though for sf3, the naming convention changes?)

One approach came to mind that would act pretty independently of the existing codebase, which would allow us to pull things into a separate library if that ends up feeling better. In this approach, one would create a new aggregator function that takes as inputs the statistic and geometry outputs of the .geo_X() methods along with the type of statistic to aggregate and the geometry to aggregate to--thinking is that this last piece would be necessary to properly downscale the statistics for the partial edge geometries.

So something like:


def new_aggregator_function(
          list_of_dictionaries_with_statistic_and_geometry,
          type_of_statistic,
          geometry_to_aggregate_to
     ):

     areally_interpolated_statistics = check_for_edge_geometries_and_downscale_statistics(...)

     aggregated_statistic = aggregate(areally_interpolated_statistics, type_of_statistic)

     return aggregated_statistic

Any feedback there?

Lastly, on the Census Data API side of things, the table and attribute names do seem cryptic--e.g.B25034_010E. I found this reference, but still feels pretty dense.

The human-readable table/attribute name --> code direction might be tough, but the other direction doesn't seem too far-fetched and it would really be great if these codes were parsable for type of statistic. This could be used to help prevent statistical gotchas like trying to aggregate a median like an average. Not sure if you all have thought about this bit. May be for down the road though. Hopefully explicitly asking the user to provide type of statistic is a reasonable enough solution for now.

cc: @patwater @christophertull

fgregg · 2018-11-26T19:53:59Z

could you you tell me a little bit more about what you mean by "necessary to properly downscale the statistics for the partial edge geometries"?
I think it's reasonable to have the user supply the type of aggregation in the first phase. There's a lot that could be done to infer what type of aggregation is appropriate, but that can wait.

fgregg · 2018-11-26T20:01:05Z

Do you mean that the desired shape can cut across census geographies, and you'll need to figure out what data to apportion?

dmarulli · 2018-11-26T20:07:01Z

Yep, that's all I meant by that. We see that with California water district boundaries for example.

fgregg · 2018-11-26T20:25:14Z

Okay, finding the intersections is a fairly expensive operation.

When we do it here:

census_area/census_area/core.py

Lines 62 to 63 in 5e62f7d

    
           if intersection.area/area_geo.area > 0.1: 
        
               yield area

It would be probably be a good idea to go ahead and return the proportion of the census tract covered falling withing the target geography, and stuff it into the statistics dictionary.

That's coverage proportion is probably what you need you would be calculating with check_for_edge_geometries_and_downscale_statistics anyway.

If you did it that way, you would only need "sequence of statistics", "sequences of weights", "type of statistics"

dmarulli · 2018-11-26T20:42:44Z

Nice, thanks a lot Forest. I'll look into that.

fgregg · 2018-11-26T20:49:38Z

weights are going to be important as, for example, sometimes you'll want to know size of the associated population. Anyway, i think you have enough to move forward.

fgregg · 2018-12-14T19:29:27Z

@dmarulli, any updates on your project?

patwater · 2018-12-15T16:18:23Z

His student team has there kickoff call scheduled for this upcoming Friday 12/21 so probably not.

patwater · 2019-01-28T16:40:10Z

@fgregg FYI the functionality to calculate the areal interpolation is getting pretty close though some outstanding refactoring to clean up the student code. See here for the latest: https://github.com/argo-marketplace/census_area/tree/dev_branch

Do you A) have any stylistic preferences on integration to note and B) capacity to help with that integration (bit swamped on our end)? Thanks much!

fgregg · 2019-01-28T19:50:31Z

Hi @patwater, this looks like it's pretty far from ready to be brought in. There are some nice ideas in here, but

there are many extraneous files
the interface is very different than the current library
the code needs to be split out of the one giant method
it's out of sync with master
there are no tests

I'm sorry to hear that you don't have the bandwidth to work on the integration. Let me know when you do.

patwater · 2019-01-28T23:50:17Z

Yeah I hear you. Part of working with grad students early in their program... will keep you posted.

christophertull · 2019-10-09T23:59:56Z

Some interest reviving here (also I want my Hacktoberfest contributions ;).

@fgregg I see your reference to census-data-aggregator above. Would it make sense to use census_area to fetch the data for our census units of interest and then feed that into census-data-aggregator?

This was referenced Dec 21, 2018

Modify datamade code to return areal weights with output of .geo_X() family of methods argo-marketplace/census_area#3

Open

Figure out how to implement any other aggregations approaches technically argo-marketplace/census_area#5

Open

fgregg mentioned this issue Jul 2, 2019

Any interest with combining with areal aggregation datadesk/census-data-aggregator#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate functions #6

Aggregate functions #6

fgregg commented May 9, 2017

patwater commented Nov 21, 2018 •

edited

Loading

fgregg commented Nov 21, 2018 •

edited

Loading

dmarulli commented Nov 26, 2018 •

edited

Loading

fgregg commented Nov 26, 2018

fgregg commented Nov 26, 2018

dmarulli commented Nov 26, 2018

fgregg commented Nov 26, 2018

dmarulli commented Nov 26, 2018

fgregg commented Nov 26, 2018

fgregg commented Dec 14, 2018

patwater commented Dec 15, 2018

patwater commented Jan 28, 2019 •

edited

Loading

fgregg commented Jan 28, 2019 •

edited

Loading

patwater commented Jan 28, 2019

christophertull commented Oct 9, 2019

Aggregate functions #6

Aggregate functions #6

Comments

fgregg commented May 9, 2017

Prior art

patwater commented Nov 21, 2018 • edited Loading

fgregg commented Nov 21, 2018 • edited Loading

dmarulli commented Nov 26, 2018 • edited Loading

fgregg commented Nov 26, 2018

fgregg commented Nov 26, 2018

dmarulli commented Nov 26, 2018

fgregg commented Nov 26, 2018

dmarulli commented Nov 26, 2018

fgregg commented Nov 26, 2018

fgregg commented Dec 14, 2018

patwater commented Dec 15, 2018

patwater commented Jan 28, 2019 • edited Loading

fgregg commented Jan 28, 2019 • edited Loading

patwater commented Jan 28, 2019

christophertull commented Oct 9, 2019

patwater commented Nov 21, 2018 •

edited

Loading

fgregg commented Nov 21, 2018 •

edited

Loading

dmarulli commented Nov 26, 2018 •

edited

Loading

patwater commented Jan 28, 2019 •

edited

Loading

fgregg commented Jan 28, 2019 •

edited

Loading