-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add threshold logic to adjust flags for certain input sizes. #489
base: master
Are you sure you want to change the base?
Conversation
tneymanov
commented
Jun 11, 2019
- Extracts manually supplied flags from argv.
- Uses supplied args and input size estimations to check whether flags should be enabled or not.
- Extracts manually supplied flags from argv. - Uses supplied args and input size estimations to check whether flags should be enabled or not.
Pull Request Test Coverage Report for Build 1736
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tural! I have added a few comments.
@@ -0,0 +1,157 @@ | |||
# Copyright 2019 Google Inc. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new copyright format?
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Util class used to optimize default values for flags, based on provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A module doc string starts with one line summary. http://go/pystyle#382-modules
|
||
class Dimensions(object): | ||
"""Contains dimensions of the input data and the manually supplied args.""" | ||
def __init__(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add one empty line.
class Dimensions(object): | ||
"""Contains dimensions of the input data and the manually supplied args.""" | ||
def __init__(self, | ||
line_count=None, # type: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave two spaces between ,
and # type
class Threshold(Dimensions): | ||
"""Describes the limits the input needs to pass to enable a certain flag. | ||
|
||
Unlike Dimensions object, should not have supplied_args set and not all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the extra spaces in the beginning of this line and the followed one. https://engdoc.corp.google.com/eng/doc/devguide/py/style/index.md?cl=head#38-comments-and-docstrings
known_args.optimize_for_large_inputs = True | ||
if INFER_HEADERS_TS.soft_pass(input_dimensions, operator.le): | ||
known_args.infer_headers = True | ||
if NUM_BIGQUERY_WRITE_SHARDS_TS.soft_pass(input_dimensions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, To help decide the threshold, the 1000 genomes in the integration tests need to have this flag set, while the latest release doesn't need it. You can test on these dataset.
if INFER_ANNOTATION_TYPES_TS.soft_pass(input_dimensions, operator.le): | ||
known_args.infer_annotation_types = True | ||
if SHARD_VARIANTS_TS.soft_pass(input_dimensions, operator.le): | ||
known_args.shard_variants = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think shard variants is more related with the variants count in each file: if we have a lot of small files, which means there are lots of records as well, we don't need to do sharding in this case.
known_args.shard_variants = False | ||
|
||
def _optimize_pipeline_args(pipeline_args, known_args, input_dimensions): | ||
if NUM_WORKERS_TS.hard_pass(input_dimensions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, I think setting the number of workers doesn't mean much since no matter what you set, Dataflow will kind of ignore this number and do autoscaling. But it do matter in the annotation pipeline since it means how many VMs we are going to bring up to run VEP.
Another one we can consider is the worker-machine-type, since the job may fail or too slow due to memory issues. I would recommend to try to optimize the machine-type using the input_dimensions.
file_count) | ||
self.flag_name = flag_name | ||
|
||
def not_supplied(self, state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think replacing state with input_dimensions (you also used in some other places) is easier to follow.
class Threshold(Dimensions): | ||
"""Describes the limits the input needs to pass to enable a certain flag. | ||
|
||
Unlike Dimensions object, should not have supplied_args set and not all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overriding seems weird, especially it should not have supplied_args. How about having Dimensions
just describes the dimensions of the input data (by removing supplied_args)? In this way, you can also remove the assignments of these data to know_args (https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/vcf_to_bq.py#L271).
Moreover, this class describes the limit the input dimension needs to pass to enable the flag, I think we can move the checking of whether the corresponding flag is supplied or not to the outer layer rather than checking inside this class.