-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network parameter not being read #696
Comments
Hi there, Thanks for the note. There are a few code changes that have been made recently. The first one you should checkout is here. There are a couple of changes, one is using region instead of zones. If you want to specify your network and subnet, follow these instructions (which we will push to the main branch soon). docker run gcr.io/cloud-lifesciences/gcp-variant-transforms Please let us know if using these updates work. Thanks! |
I tried using this script:
I got:
Suggestion for what to try next? NOTE: I updated to: |
I tried eliminating the
|
You listed: |
I tried:
Does the "network resource" mean the source or destination? |
Are you using a custom network or the default network? If custom VPC are you using automode or custom-mode subnets? |
I'm in an AI notbook. |
Oh, the network. Hold on... |
I tried:
|
Try adding this to the $COMMAND (we're updating some documentation here in GitHub but haven't been able to do a release yet) --network $NETWORK |
Yeah, I think not ready for prime time. Doesn't like the network option.
|
Hmm...The last time I ran this (in the last 2 weeks) this is what I used for both a custom VPC and custom subnet. If you have a custom VPC (not the "default" and named the same) you need to pass the network name. If you created a custom VPC, but auto-mode subnets, you don't need to pass the --subnet option. Parameters to replace:The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.GOOGLE_CLOUD_PROJECT=PROJECT docker run -v ~/.config:/root/.config |
@rcowin-gcp It looks like in the code the command parser is missing the https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25
And the subsequent case expression below that does not assign the network either if one is specified. It probably was an assumption in the coding that networks are not required if a subnet is assigned but the other way is the case based on other documentation. Probably that logic should be coded explicitly so there's no guess-work on the expectations. Hope it helps, |
You only need to specify the Network and Subnetwork if you are using a non-default network and/or custom-mode subnets. |
Thank you, I read the REST spec below -- I was just providing a recommendation on where the code can be updated with a simple fix: https://lifesciences.googleapis.com/$discovery/rest?version=v2beta |
Thank you. We're working on a handful of small updates to variant transforms and the documentation both, stay tuned. :) |
Howdy. I'm getting precisely the same problem. Any updates on when this will be fixed? It's a complete blocker, sigh... |
@slagelwa -- you had some insight into the where the code needed to be fixed. Perhaps if you posted it, maybe someone could write the patch. |
Is there any reason to think the transform will work if run from github per https://github.com/googlegenomics/gcp-variant-transforms#running-from-github? |
@abalter I think @slagelwa is referring that in the code the command parser is missing the https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25
|
@ypouliot running from GitHub will not provide different results as the parser is still the same which doesn't include network. Correct that it does not parse for network, just subnetwork. As it's not currently possible to run across multiple regions, the only argument it takes is the subnetwork so you will provide the subnet that corresponds to the region in question. Regarding the other issue about format of the subnet, you just use the name of the subnet, not the full path with project or regions. The docs have been updated to clarify that. |
@moschetti You mean not possible because it's not implemented in the Life Sciences API, or that GCP would not allow it? The Life Sciences API below seems to have an option for it (and GCP/Google Storage allows for multi-region configs), plus was suggested also by @rcowin-gcp:
Thanks, |
@pgrosu Dataflow does not support splitting workers across different regions. Life Sciences API does, but in the case of Variant Transforms Life Sciences API is only starting the first VM that kicks off the Dataflow job. So since the Dataflow workers will all run in a single region we can infer the network based on the single subnetwork that was provided. However, if that isn't working for folks, we'd be curious to hear the use case to better understand how it's being used and if this is something to look into. |
@moschetti Of course you can have multi-region DataFlows via standard functional programming methodologies when it is data-driven with the global Cloud Logging, as one DataFlow in one region can launch another set of DataFlows in multiple regions given the geo-location information about the data. This way you have a distributed global flow where the code moves adaptively to the geo-location of distributed data guided via the global log, which saves on costs. |
@pgrosu valid point. Variant Transforms, however, is not set up to be able to start Dataflow jobs in multiple regions so I'm not sure if the network flag gains anything in this use case. If lack of the network flag is blocking something, I'd be glad to hear the concern to see if it's something we need to address. But I believe that for Variant Transforms which is ingesting from buckets that multi-region wouldn't gain much as you could also run multiple pipelines for each region if you had lots of input data in different regions as the overhead of the job orchestrator is relatively small. Happy to listen though if that is a concern for you. |
@moschetti Not all buckets are created equal ;) Regional buckets provide huge cost-savings over multi-region ones, which is why one would prefer that the code co-locate accordingly to those sites. For example, here's the monthly Cloud Storage cost for 100 TB calculated using the Google Cloud Pricing Calculator for a regional site (Iowa) as compared to a multi-region (whole US). The result is that an additional cost of $614/month for multi-region buckets would be necessary, which can be quite a lot for some folks that might need it for other Cloud resources during their analysis: For the Iowa (regional) location ($2048/month)
For the US (multi-region) location ($2662/month)
Additional Egress Costs ($1024/month)On top of that there could be egress charges for data moves, which can for instance add to the total cost an extra $1024 ($3,072 - $2,048), making even more the case for the free cost of the code moves:
Hope it helps, |
I'm using this script:
And, yet, the error says that the network was not specified, and the network slot is empty in the JSON output.
What change do I need to make to my script? Or, is some other format needed to specify the network?
The script template doesn't include a network or subnet parameter at all.
The text was updated successfully, but these errors were encountered: