Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post_search.iterate() iterator does not terminate cleanly #500

Open
chmreid opened this issue Dec 23, 2019 · 0 comments
Open

post_search.iterate() iterator does not terminate cleanly #500

chmreid opened this issue Dec 23, 2019 · 0 comments
Labels

Comments

@chmreid
Copy link
Contributor

chmreid commented Dec 23, 2019

Am seeing an issue when iterating over DSS search results with post_search.iterate() - I believe this is a corner case that occurs when the number of results returned is exactly the same as the page size, and the error happens when the iterator tries to return the second page.

Here is the setup: I start by creating a DSS client, and I write an ElasticSearch query that returns exactly 10 results (the page size of the returned results, when metadata is included). Here is the code to do that:

import hca.dss, json
client = hca.dss.DSSClient()

method = "*10x*"
organ = "liver"

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "files.library_preparation_protocol_json.library_construction_method.text": {
                            "value": method
                        }
                    }
                },
                {
                    "match": {
                        "files.specimen_from_organism_json.organ.text": organ
                    }
                }
            ]
        }
    }
}

executing query with post_search

Now if we execute this query with a call to post_search(), we can see that there are exactly 10 results returned:

search_results = client.post_search(
    es_query=query, replica='aws', output_format='raw')
print("post_search() found %d results"%(search_results['total_hits']))
print("post_search() returned %d results"%(len(search_results['results'])))

which results in

post_search() found 10 results
post_search() returned 10 results

executing query with post_search.iterate

Now if we want to iterate over all results returned by the query, we should use post_saerch.iterate() instead of post_search(). Swapping out the call:

results_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')

for bundle in results_generator:
    print(f"Now processing bundle {bundle['bundle_fqid']}")

which results in the following exception:

Now processing bundle fd7a46db-1e90-4bfd-8e70-a77baa01faa5.2019-09-23T173116.106310Z
Now processing bundle fbda9910-5076-47a6-83d6-cfff39d17606.2019-09-26T051748.268160Z
Now processing bundle fb2ae8b7-06b0-4881-ad9f-1f37255b91b6.2019-09-23T173116.107225Z
Now processing bundle c65efd23-bbc4-459a-ac60-d3cde705193d.2019-09-23T173116.107641Z
Now processing bundle c59a8de8-d4f3-424b-b716-06b7152b980a.2019-09-23T173116.106782Z
Now processing bundle be9f2d04-77ee-4f59-a0f7-f0b58034cf8c.2019-09-23T173116.105576Z
Now processing bundle 82164816-64d4-4975-a248-b66c4fdad6f8.2019-09-26T054646.254919Z
Now processing bundle 56cce395-634e-4c53-976c-931727d22dfa.2019-09-26T074801.713933Z
Now processing bundle 3a7af639-ac18-49a7-aef9-2eb4b1ecf598.2019-09-26T072342.935554Z
Now processing bundle 2f62f508-6503-4c2e-a714-8298f55bdaa2.2019-09-26T064659.900169Z

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-5afaae8a16d1> in <module>
      1 results_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')
      2 
----> 3 for bundle in results_generator:
      4     print(f"Now processing bundle {bundle['bundle_fqid']}")

~/codes/data-consumer-vignettes/vp/lib/python3.6/site-packages/hca/util/__init__.py in iterate(self, **kwargs)
    235                     yield file
    236             else:
--> 237                 for collection in page.json().get('collections'):
    238                     yield collection
    239 

TypeError: 'NoneType' object is not iterable

If I modify the query to search for a different organ type, the number of results returned is different - not a multiple of 10 - and so this bug does not occur. This bug only occurs when the number of results returned is exactly equal to the size of each page. The error occurs because it does not handle the case of the second page being completely empty (which only happens when number of results is an exact multiple of the page size).

@chmreid chmreid added the BUG label Dec 23, 2019
@chmreid chmreid changed the title post_search.iterate() iterator does not terminate cleanly post_search.iterate() iterator does not terminate cleanly Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant