Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network parameter not being read #696

Open
abalter opened this issue May 11, 2021 · 27 comments
Open

Network parameter not being read #696

abalter opened this issue May 11, 2021 · 27 comments

Comments

@abalter
Copy link

abalter commented May 11, 2021

I'm using this script:

#!/bin/bash
# Parameters to replace:
# The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.
GOOGLE_CLOUD_PROJECT=psjh-eacri-data
INPUT_PATTERN=https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz
# INPUT_PATTERN=gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz
OUTPUT_TABLE=eacri-genomics:gnomad.gnomad_hg19_2_1_1
TEMP_LOCATION=gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp

COMMAND="vcf_to_bq \
    --input_pattern ${INPUT_PATTERN} \
    --output_table ${OUTPUT_TABLE} \
    --temp_location ${TEMP_LOCATION} \
    --job_name vcf-to-bigquery \
    --runner DataflowRunner \
    --zones us-east1-b \
    --network projects/phs-205720/global/networks/psjh-shared01 \
    --subnet projects/phs-205720/regions/us-east1/subnetworks/subnet01"
    
docker run -v ~/.config:/root/.config \
    gcr.io/cloud-lifesciences/gcp-variant-transforms \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --temp_location ${TEMP_LOCATION} \
    "${COMMAND}"

And, yet, the error says that the network was not specified, and the network slot is empty in the JSON output.

What change do I need to make to my script? Or, is some other format needed to specify the network?

The script template doesn't include a network or subnet parameter at all.

base) jupyter@balter-genomics:~$ ./script.sh
 --project 'psjh-eacri-data' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq     --input_pattern gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz     --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1     --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp     --job_name vcf-to-bigquery     --runner DataflowRunner     --zones us-east1-b     --subnet subnet03'
Your active configuration is: [variant]
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp --job_name vcf-to-bigquery --runner DataflowRunner --zones us-east1-b --subnet subnet03 --project psjh-eacri-data --region us-east1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210510_230717.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/13027962545459232820" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210510_230717.log"
23:07:26 Worker "google-pipelines-worker-ab367d994b1cd7881ebf66950fec6c17" assigned in "us-east1-b" on a "g1-small" machine
23:07:26 Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found.
23:07:27 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/13027962545459232820" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)
(base) jupyter@balter-genomics:~$ ./script.sh
 --project 'psjh-eacri-data' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq     --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz     --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1     --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp     --job_name vcf-to-bigquery     --runner DataflowRunner     --zones us-east1-b     --subnet subnet03'
Your active configuration is: [variant]
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp --job_name vcf-to-bigquery --runner DataflowRunner --zones us-east1-b --subnet subnet03 --project psjh-eacri-data --region us-east1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210511_000846.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/3293803574088782620" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210511_000846.log"
00:08:56 Worker "google-pipelines-worker-e05c2864661a5ba9f1b29012de1ac56d" assigned in "us-east1-d" on a "g1-small" machine
00:08:56 Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found.
00:08:57 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/3293803574088782620" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)
@rcowin-gcp
Copy link
Contributor

rcowin-gcp commented May 14, 2021

Hi there,

Thanks for the note. There are a few code changes that have been made recently. The first one you should checkout is here. There are a couple of changes, one is using region instead of zones.

If you want to specify your network and subnet, follow these instructions (which we will push to the main branch soon).

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms
--project "${GOOGLE_CLOUD_PROJECT}"
--region us-central1
--location us-central1
--temp_location "${TEMP_LOCATION}"
--subnetwork regions/us-central1/subnetworks/my-subnet
"${COMMAND}"

Please let us know if using these updates work. Thanks!

@abalter
Copy link
Author

abalter commented May 17, 2021

I tried using this script:

#!/bin/bash
    
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=$$$$$$$$$
GOOGLE_CLOUD_REGION=us-east1-b
GOOGLE_CLOUD_LOCATION=us-east1-b
TEMP_LOCATION=gs://$$$$$$$$/*.vcf.bgz/tmp
INPUT_PATTERN=https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz
OUTPUT_TABLE=eacri-genomics:gnomad.gnomad_hg19_2_1_1

COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery \
  --runner DataflowRunner"

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --location "${GOOGLE_CLOUD_LOCATION}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

I got:

 --project 'eacri-genomics' --region 'us-east1-b' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'us-east1-b' 'vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner'
getopt: unrecognized option '--location'

Suggestion for what to try next?

NOTE: I updated to: gcp-variant-transforms:latest

@abalter
Copy link
Author

abalter commented May 17, 2021

I tried eliminating the --location tag, and this happened:

 --project 'psjh-eacri-data' --region 'us-east1-b' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner'
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --job_name vcf-to-bigquery --runner DataflowRunner --project psjh-eacri-data --region us-east1-b --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210517_170123.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1-b"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/5404432620223078014" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210517_170123.log"
17:01:32 Execution failed: allocating: selecting resources: selecting region and zone: no regions/zones match request
"run": operation "projects/447346450878/locations/us-central1/operations/5404432620223078014" failed: executing pipeline: Execution failed: allocating: selecting resources: selecting region and zone: no regions/zones match request (reason: NOT_FOUND)

@moschetti
Copy link
Member

moschetti commented May 17, 2021

You listed: --region 'us-east1-b' The letter though indicates a zone. So it should be --region 'us-east1' OR --zone 'us-east1-b'

@abalter
Copy link
Author

abalter commented May 17, 2021

I tried: --region us-east1

"run": operation "projects/447346450878/locations/us-central1/operations/13169224402019687354" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)

Does the "network resource" mean the source or destination?

@rcowin-gcp
Copy link
Contributor

rcowin-gcp commented May 17, 2021

Are you using a custom network or the default network? If custom VPC are you using automode or custom-mode subnets?

@abalter
Copy link
Author

abalter commented May 17, 2021

I'm in an AI notbook.

@abalter
Copy link
Author

abalter commented May 17, 2021

Oh, the network. Hold on...

@abalter
Copy link
Author

abalter commented May 17, 2021

image

@abalter
Copy link
Author

abalter commented May 17, 2021

I tried:

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region us-east1 \
  --temp_location "${TEMP_LOCATION}" \
  --subnetwork psjh-shared01/subnet01  \
  "${COMMAND}"
17:47:51 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/5710526366526649430" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].subnetwork': 'projects/psjh-eacri-data/regions/us-east1/subnetworks/psjh-shared01/subnet01'. The URL is malformed. (reason: INVALID_ARGUMENT)

@rcowin-gcp
Copy link
Contributor

Try adding this to the $COMMAND (we're updating some documentation here in GitHub but haven't been able to do a release yet)

--network $NETWORK
--subnetwork regions/$REGION/subnetworks/$SUBNETWORK

@abalter
Copy link
Author

abalter commented May 17, 2021

Yeah, I think not ready for prime time. Doesn't like the network option.

 --project 'psjh-eacri-data' --region 'us-east1' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' --subnetwork 'regions//subnetworks/subnet01' -- 'psjh-shared01'
getopt: unrecognized option '--network'
./gcp_vcf.sh: line 27: vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner: No such file or directory

@rcowin-gcp
Copy link
Contributor

Hmm...The last time I ran this (in the last 2 weeks) this is what I used for both a custom VPC and custom subnet. If you have a custom VPC (not the "default" and named the same) you need to pass the network name. If you created a custom VPC, but auto-mode subnets, you don't need to pass the --subnet option.

Parameters to replace:

The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.

GOOGLE_CLOUD_PROJECT=PROJECT
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=PROJECT:DATASET.TABLE
TEMP_LOCATION=gs://na1287-test-042621/temp
NETWORK=CUSTOM_NETWORK
SUBNETWORK=CUSTOM_MODE_SUBNET
COMMAND="vcf_to_bq
--input_pattern ${INPUT_PATTERN}
--output_table ${OUTPUT_TABLE}
--job_name vcf-to-bigquery
--runner DataflowRunner
--network $NETWORK
--subnetwork regions/$REGION/subnetworks/$SUBNETWORK"

docker run -v ~/.config:/root/.config
gcr.io/cloud-lifesciences/gcp-variant-transforms
--project "${GOOGLE_CLOUD_PROJECT}"
--region REGION
--temp_location "${TEMP_LOCATION}"
"${COMMAND}"

@pgrosu
Copy link

pgrosu commented May 18, 2021

@rcowin-gcp It looks like in the code the command parser is missing the network parameter. If you look at this, it does not parse for the network option, which is probably why it is always empty:

https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25

getopt -o '' -l project:,temp_location:,docker_image:,region:,subnetwork:,use_public_ips:,service_account:,location: -- "$@"

And the subsequent case expression below that does not assign the network either if one is specified. It probably was an assumption in the coding that networks are not required if a subnet is assigned but the other way is the case based on other documentation. Probably that logic should be coded explicitly so there's no guess-work on the expectations.

Hope it helps,
~p

@rcowin-gcp
Copy link
Contributor

You only need to specify the Network and Subnetwork if you are using a non-default network and/or custom-mode subnets.
The documentation is not up-to-date right now but we are working on a new release.

@pgrosu
Copy link

pgrosu commented May 18, 2021

Thank you, I read the REST spec below -- I was just providing a recommendation on where the code can be updated with a simple fix:

https://lifesciences.googleapis.com/$discovery/rest?version=v2beta

@rcowin-gcp
Copy link
Contributor

Thank you. We're working on a handful of small updates to variant transforms and the documentation both, stay tuned. :)

@ypouliot
Copy link

Howdy. I'm getting precisely the same problem. Any updates on when this will be fixed? It's a complete blocker, sigh...

@abalter
Copy link
Author

abalter commented Oct 17, 2021

@slagelwa -- you had some insight into the where the code needed to be fixed. Perhaps if you posted it, maybe someone could write the patch.

@ypouliot
Copy link

Is there any reason to think the transform will work if run from github per https://github.com/googlegenomics/gcp-variant-transforms#running-from-github?

@pgrosu
Copy link

pgrosu commented Oct 17, 2021

@abalter I think @slagelwa is referring that in the code the command parser is missing the network parameter. If you look at this, it does not parse for the network option, which is probably why it is always empty:

https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25

getopt -o '' -l project:,temp_location:,docker_image:,region:,subnetwork:,use_public_ips:,service_account:,location: -- "$@"

@moschetti
Copy link
Member

@ypouliot running from GitHub will not provide different results as the parser is still the same which doesn't include network.

Correct that it does not parse for network, just subnetwork. As it's not currently possible to run across multiple regions, the only argument it takes is the subnetwork so you will provide the subnet that corresponds to the region in question.

Regarding the other issue about format of the subnet, you just use the name of the subnet, not the full path with project or regions. The docs have been updated to clarify that.

@pgrosu
Copy link

pgrosu commented Oct 19, 2021

@moschetti You mean not possible because it's not implemented in the Life Sciences API, or that GCP would not allow it? The Life Sciences API below seems to have an option for it (and GCP/Google Storage allows for multi-region configs), plus was suggested also by @rcowin-gcp:

        "network": {
          "type": "string",
          "description": "The network name to attach the VM's network interface to. The value will be prefixed with `global/networks/` unless it contains a `/`, in which case it is assumed to be a fully specified network resource URL. If unspecified, the global default network is used."
        },
        "subnetwork": {
          "description": "If the specified network is configured for custom subnet creation, the name of the subnetwork to attach the instance to must be specified here. The value is prefixed with `regions/*/subnetworks/` unless it contains a `/`, in which case it is assumed to be a fully specified subnetwork resource URL. If the `*` character appears in the value, it is replaced with the region that the virtual machine has been allocated in.",
          "type": "string"
        },

Thanks,
~p

@moschetti
Copy link
Member

@pgrosu Dataflow does not support splitting workers across different regions. Life Sciences API does, but in the case of Variant Transforms Life Sciences API is only starting the first VM that kicks off the Dataflow job. So since the Dataflow workers will all run in a single region we can infer the network based on the single subnetwork that was provided.

However, if that isn't working for folks, we'd be curious to hear the use case to better understand how it's being used and if this is something to look into.

@pgrosu
Copy link

pgrosu commented Oct 20, 2021

@moschetti Of course you can have multi-region DataFlows via standard functional programming methodologies when it is data-driven with the global Cloud Logging, as one DataFlow in one region can launch another set of DataFlows in multiple regions given the geo-location information about the data. This way you have a distributed global flow where the code moves adaptively to the geo-location of distributed data guided via the global log, which saves on costs.

@moschetti
Copy link
Member

@pgrosu valid point. Variant Transforms, however, is not set up to be able to start Dataflow jobs in multiple regions so I'm not sure if the network flag gains anything in this use case. If lack of the network flag is blocking something, I'd be glad to hear the concern to see if it's something we need to address.

But I believe that for Variant Transforms which is ingesting from buckets that multi-region wouldn't gain much as you could also run multiple pipelines for each region if you had lots of input data in different regions as the overhead of the job orchestrator is relatively small. Happy to listen though if that is a concern for you.

@pgrosu
Copy link

pgrosu commented Oct 21, 2021

@moschetti Not all buckets are created equal ;) Regional buckets provide huge cost-savings over multi-region ones, which is why one would prefer that the code co-locate accordingly to those sites. For example, here's the monthly Cloud Storage cost for 100 TB calculated using the Google Cloud Pricing Calculator for a regional site (Iowa) as compared to a multi-region (whole US). The result is that an additional cost of $614/month for multi-region buckets would be necessary, which can be quite a lot for some folks that might need it for other Cloud resources during their analysis:

For the Iowa (regional) location ($2048/month)

1x Standard Storage
Location: Iowa
Total Amount of Storage: 102,400 GiB
Egress - Data moves within the same location: 0 GiB
Always Free usage included: No
USD 2,048.000

For the US (multi-region) location ($2662/month)

1x Standard Storage
Location: United States
Total Amount of Storage: 102,400 GiB
Egress - Data moves within the same location: 0 GiB
Always Free usage included: No
USD 2,662.400

Additional Egress Costs ($1024/month)

On top of that there could be egress charges for data moves, which can for instance add to the total cost an extra $1024 ($3,072 - $2,048), making even more the case for the free cost of the code moves:

1x Standard Storage
Location: Iowa
Total Amount of Storage: 102,400 GiB
Egress - Data moves between different locations on the same continent: 102,400 GiB
Always Free usage included: No
USD 3,072.000

Hope it helps,
Paul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants