Network parameter not being read #696

abalter · 2021-05-11T00:15:48Z

I'm using this script:

#!/bin/bash
# Parameters to replace:
# The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.
GOOGLE_CLOUD_PROJECT=psjh-eacri-data
INPUT_PATTERN=https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz
# INPUT_PATTERN=gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz
OUTPUT_TABLE=eacri-genomics:gnomad.gnomad_hg19_2_1_1
TEMP_LOCATION=gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp

COMMAND="vcf_to_bq \
    --input_pattern ${INPUT_PATTERN} \
    --output_table ${OUTPUT_TABLE} \
    --temp_location ${TEMP_LOCATION} \
    --job_name vcf-to-bigquery \
    --runner DataflowRunner \
    --zones us-east1-b \
    --network projects/phs-205720/global/networks/psjh-shared01 \
    --subnet projects/phs-205720/regions/us-east1/subnetworks/subnet01"
    
docker run -v ~/.config:/root/.config \
    gcr.io/cloud-lifesciences/gcp-variant-transforms \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --temp_location ${TEMP_LOCATION} \
    "${COMMAND}"

And, yet, the error says that the network was not specified, and the network slot is empty in the JSON output.

What change do I need to make to my script? Or, is some other format needed to specify the network?

The script template doesn't include a network or subnet parameter at all.

base) jupyter@balter-genomics:~$ ./script.sh
 --project 'psjh-eacri-data' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq     --input_pattern gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz     --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1     --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp     --job_name vcf-to-bigquery     --runner DataflowRunner     --zones us-east1-b     --subnet subnet03'
Your active configuration is: [variant]
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp --job_name vcf-to-bigquery --runner DataflowRunner --zones us-east1-b --subnet subnet03 --project psjh-eacri-data --region us-east1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210510_230717.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/13027962545459232820" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210510_230717.log"
23:07:26 Worker "google-pipelines-worker-ab367d994b1cd7881ebf66950fec6c17" assigned in "us-east1-b" on a "g1-small" machine
23:07:26 Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found.
23:07:27 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/13027962545459232820" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)
(base) jupyter@balter-genomics:~$ ./script.sh
 --project 'psjh-eacri-data' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq     --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz     --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1     --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp     --job_name vcf-to-bigquery     --runner DataflowRunner     --zones us-east1-b     --subnet subnet03'
Your active configuration is: [variant]
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp --job_name vcf-to-bigquery --runner DataflowRunner --zones us-east1-b --subnet subnet03 --project psjh-eacri-data --region us-east1 --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210511_000846.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/3293803574088782620" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210511_000846.log"
00:08:56 Worker "google-pipelines-worker-e05c2864661a5ba9f1b29012de1ac56d" assigned in "us-east1-d" on a "g1-small" machine
00:08:56 Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found.
00:08:57 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/3293803574088782620" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)

The text was updated successfully, but these errors were encountered:

rcowin-gcp · 2021-05-14T21:56:34Z

Hi there,

Thanks for the note. There are a few code changes that have been made recently. The first one you should checkout is here. There are a couple of changes, one is using region instead of zones.

If you want to specify your network and subnet, follow these instructions (which we will push to the main branch soon).

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms
--project "${GOOGLE_CLOUD_PROJECT}"
--region us-central1
--location us-central1
--temp_location "${TEMP_LOCATION}"
--subnetwork regions/us-central1/subnetworks/my-subnet
"${COMMAND}"

Please let us know if using these updates work. Thanks!

abalter · 2021-05-17T17:00:34Z

I tried using this script:

#!/bin/bash
    
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=$$$$$$$$$
GOOGLE_CLOUD_REGION=us-east1-b
GOOGLE_CLOUD_LOCATION=us-east1-b
TEMP_LOCATION=gs://$$$$$$$$/*.vcf.bgz/tmp
INPUT_PATTERN=https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz
OUTPUT_TABLE=eacri-genomics:gnomad.gnomad_hg19_2_1_1

COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery \
  --runner DataflowRunner"

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --location "${GOOGLE_CLOUD_LOCATION}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

I got:

 --project 'eacri-genomics' --region 'us-east1-b' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'us-east1-b' 'vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner'
getopt: unrecognized option '--location'

Suggestion for what to try next?

NOTE: I updated to: gcp-variant-transforms:latest

abalter · 2021-05-17T17:02:53Z

I tried eliminating the --location tag, and this happened:

 --project 'psjh-eacri-data' --region 'us-east1-b' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' -- 'vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner'
{
  "pipeline": {
    "actions": [
      {
        "commands": [
          "-c",
          "mkdir -p /mnt/google/.google/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "commands": [
          "-c",
          "/opt/gcp_variant_transforms/bin/vcf_to_bq --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1 --job_name vcf-to-bigquery --runner DataflowRunner --project psjh-eacri-data --region us-east1-b --temp_location gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-lifesciences/gcp-variant-transforms",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      },
      {
        "alwaysRun": true,
        "commands": [
          "-c",
          "gsutil -q cp /google/logs/output gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210517_170123.log"
        ],
        "entrypoint": "bash",
        "imageUri": "gcr.io/cloud-genomics-pipelines/io",
        "mounts": [
          {
            "disk": "google",
            "path": "/mnt/google"
          }
        ]
      }
    ],
    "environment": {
      "TMPDIR": "/mnt/google/.google/tmp"
    },
    "resources": {
      "regions": [
        "us-east1-b"
      ],
      "virtualMachine": {
        "disks": [
          {
            "name": "google",
            "sizeGb": 10
          }
        ],
        "machineType": "g1-small",
        "network": {},
        "serviceAccount": {
          "scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
            "https://www.googleapis.com/auth/devstorage.read_write"
          ]
        }
      }
    }
  }
}
Pipeline running as "projects/447346450878/locations/us-central1/operations/5404432620223078014" (attempt: 1, preemptible: false)
Output will be written to "gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp/runner_logs_20210517_170123.log"
17:01:32 Execution failed: allocating: selecting resources: selecting region and zone: no regions/zones match request
"run": operation "projects/447346450878/locations/us-central1/operations/5404432620223078014" failed: executing pipeline: Execution failed: allocating: selecting resources: selecting region and zone: no regions/zones match request (reason: NOT_FOUND)

moschetti · 2021-05-17T17:09:06Z

You listed: --region 'us-east1-b' The letter though indicates a zone. So it should be --region 'us-east1' OR --zone 'us-east1-b'

abalter · 2021-05-17T17:22:45Z

I tried: --region us-east1

"run": operation "projects/447346450878/locations/us-central1/operations/13169224402019687354" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].network': ''. The referenced network resource cannot be found. (reason: INVALID_ARGUMENT)

Does the "network resource" mean the source or destination?

rcowin-gcp · 2021-05-17T17:25:01Z

Are you using a custom network or the default network? If custom VPC are you using automode or custom-mode subnets?

abalter · 2021-05-17T17:41:40Z

I'm in an AI notbook.

abalter · 2021-05-17T17:41:48Z

Oh, the network. Hold on...

abalter · 2021-05-17T17:45:04Z

abalter · 2021-05-17T17:48:50Z

I tried:

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region us-east1 \
  --temp_location "${TEMP_LOCATION}" \
  --subnetwork psjh-shared01/subnet01  \
  "${COMMAND}"

17:47:51 Worker released
"run": operation "projects/447346450878/locations/us-central1/operations/5710526366526649430" failed: executing pipeline: Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.networkInterfaces[0].subnetwork': 'projects/psjh-eacri-data/regions/us-east1/subnetworks/psjh-shared01/subnet01'. The URL is malformed. (reason: INVALID_ARGUMENT)

rcowin-gcp · 2021-05-17T17:57:28Z

Try adding this to the $COMMAND (we're updating some documentation here in GitHub but haven't been able to do a release yet)

--network $NETWORK
--subnetwork regions/$REGION/subnetworks/$SUBNETWORK

abalter · 2021-05-17T18:54:30Z

Yeah, I think not ready for prime time. Doesn't like the network option.

 --project 'psjh-eacri-data' --region 'us-east1' --temp_location 'gs://psjh-eacri/balter/gnomad_tmp/vcf/exomes/*.vcf.bgz/tmp' --subnetwork 'regions//subnetworks/subnet01' -- 'psjh-shared01'
getopt: unrecognized option '--network'
./gcp_vcf.sh: line 27: vcf_to_bq   --input_pattern https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.*.vcf.bgz   --output_table eacri-genomics:gnomad.gnomad_hg19_2_1_1   --job_name vcf-to-bigquery   --runner DataflowRunner: No such file or directory

rcowin-gcp · 2021-05-17T19:06:36Z

Hmm...The last time I ran this (in the last 2 weeks) this is what I used for both a custom VPC and custom subnet. If you have a custom VPC (not the "default" and named the same) you need to pass the network name. If you created a custom VPC, but auto-mode subnets, you don't need to pass the --subnet option.

Parameters to replace:

The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.

GOOGLE_CLOUD_PROJECT=PROJECT
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=PROJECT:DATASET.TABLE
TEMP_LOCATION=gs://na1287-test-042621/temp
NETWORK=CUSTOM_NETWORK
SUBNETWORK=CUSTOM_MODE_SUBNET
COMMAND="vcf_to_bq
--input_pattern ${INPUT_PATTERN}
--output_table ${OUTPUT_TABLE}
--job_name vcf-to-bigquery
--runner DataflowRunner
--network $NETWORK
--subnetwork regions/$REGION/subnetworks/$SUBNETWORK"

docker run -v ~/.config:/root/.config
gcr.io/cloud-lifesciences/gcp-variant-transforms
--project "${GOOGLE_CLOUD_PROJECT}"
--region REGION
--temp_location "${TEMP_LOCATION}"
"${COMMAND}"

pgrosu · 2021-05-18T17:13:18Z

@rcowin-gcp It looks like in the code the command parser is missing the network parameter. If you look at this, it does not parse for the network option, which is probably why it is always empty:

https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25

getopt -o '' -l project:,temp_location:,docker_image:,region:,subnetwork:,use_public_ips:,service_account:,location: -- "$@"

And the subsequent case expression below that does not assign the network either if one is specified. It probably was an assumption in the coding that networks are not required if a subnet is assigned but the other way is the case based on other documentation. Probably that logic should be coded explicitly so there's no guess-work on the expectations.

Hope it helps,
~p

rcowin-gcp · 2021-05-18T17:25:31Z

You only need to specify the Network and Subnetwork if you are using a non-default network and/or custom-mode subnets.
The documentation is not up-to-date right now but we are working on a new release.

pgrosu · 2021-05-18T17:33:05Z

Thank you, I read the REST spec below -- I was just providing a recommendation on where the code can be updated with a simple fix:

https://lifesciences.googleapis.com/$discovery/rest?version=v2beta

rcowin-gcp · 2021-05-18T18:18:23Z

Thank you. We're working on a handful of small updates to variant transforms and the documentation both, stay tuned. :)

ypouliot · 2021-10-17T00:13:06Z

Howdy. I'm getting precisely the same problem. Any updates on when this will be fixed? It's a complete blocker, sigh...

abalter · 2021-10-17T00:24:58Z

@slagelwa -- you had some insight into the where the code needed to be fixed. Perhaps if you posted it, maybe someone could write the patch.

ypouliot · 2021-10-17T00:32:35Z

Is there any reason to think the transform will work if run from github per https://github.com/googlegenomics/gcp-variant-transforms#running-from-github?

pgrosu · 2021-10-17T00:33:45Z

@abalter I think @slagelwa is referring that in the code the command parser is missing the network parameter. If you look at this, it does not parse for the network option, which is probably why it is always empty:

https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docker/pipelines_runner.sh#L25

getopt -o '' -l project:,temp_location:,docker_image:,region:,subnetwork:,use_public_ips:,service_account:,location: -- "$@"

moschetti · 2021-10-18T16:29:28Z

@ypouliot running from GitHub will not provide different results as the parser is still the same which doesn't include network.

Correct that it does not parse for network, just subnetwork. As it's not currently possible to run across multiple regions, the only argument it takes is the subnetwork so you will provide the subnet that corresponds to the region in question.

Regarding the other issue about format of the subnet, you just use the name of the subnet, not the full path with project or regions. The docs have been updated to clarify that.

pgrosu · 2021-10-19T00:31:12Z

@moschetti You mean not possible because it's not implemented in the Life Sciences API, or that GCP would not allow it? The Life Sciences API below seems to have an option for it (and GCP/Google Storage allows for multi-region configs), plus was suggested also by @rcowin-gcp:

        "network": {
          "type": "string",
          "description": "The network name to attach the VM's network interface to. The value will be prefixed with `global/networks/` unless it contains a `/`, in which case it is assumed to be a fully specified network resource URL. If unspecified, the global default network is used."
        },
        "subnetwork": {
          "description": "If the specified network is configured for custom subnet creation, the name of the subnetwork to attach the instance to must be specified here. The value is prefixed with `regions/*/subnetworks/` unless it contains a `/`, in which case it is assumed to be a fully specified subnetwork resource URL. If the `*` character appears in the value, it is replaced with the region that the virtual machine has been allocated in.",
          "type": "string"
        },

Thanks,
~p

moschetti · 2021-10-19T04:22:59Z

@pgrosu Dataflow does not support splitting workers across different regions. Life Sciences API does, but in the case of Variant Transforms Life Sciences API is only starting the first VM that kicks off the Dataflow job. So since the Dataflow workers will all run in a single region we can infer the network based on the single subnetwork that was provided.

However, if that isn't working for folks, we'd be curious to hear the use case to better understand how it's being used and if this is something to look into.

pgrosu · 2021-10-20T03:33:16Z

@moschetti Of course you can have multi-region DataFlows via standard functional programming methodologies when it is data-driven with the global Cloud Logging, as one DataFlow in one region can launch another set of DataFlows in multiple regions given the geo-location information about the data. This way you have a distributed global flow where the code moves adaptively to the geo-location of distributed data guided via the global log, which saves on costs.

moschetti · 2021-10-20T15:39:07Z

@pgrosu valid point. Variant Transforms, however, is not set up to be able to start Dataflow jobs in multiple regions so I'm not sure if the network flag gains anything in this use case. If lack of the network flag is blocking something, I'd be glad to hear the concern to see if it's something we need to address.

But I believe that for Variant Transforms which is ingesting from buckets that multi-region wouldn't gain much as you could also run multiple pipelines for each region if you had lots of input data in different regions as the overhead of the job orchestrator is relatively small. Happy to listen though if that is a concern for you.

pgrosu · 2021-10-21T01:39:14Z

@moschetti Not all buckets are created equal ;) Regional buckets provide huge cost-savings over multi-region ones, which is why one would prefer that the code co-locate accordingly to those sites. For example, here's the monthly Cloud Storage cost for 100 TB calculated using the Google Cloud Pricing Calculator for a regional site (Iowa) as compared to a multi-region (whole US). The result is that an additional cost of $614/month for multi-region buckets would be necessary, which can be quite a lot for some folks that might need it for other Cloud resources during their analysis:

For the Iowa (regional) location ($2048/month)

1x Standard Storage
Location: Iowa
Total Amount of Storage: 102,400 GiB
Egress - Data moves within the same location: 0 GiB
Always Free usage included: No
USD 2,048.000

For the US (multi-region) location ($2662/month)

1x Standard Storage
Location: United States
Total Amount of Storage: 102,400 GiB
Egress - Data moves within the same location: 0 GiB
Always Free usage included: No
USD 2,662.400

Additional Egress Costs ($1024/month)

On top of that there could be egress charges for data moves, which can for instance add to the total cost an extra $1024 ($3,072 - $2,048), making even more the case for the free cost of the code moves:

1x Standard Storage
Location: Iowa
Total Amount of Storage: 102,400 GiB
Egress - Data moves between different locations on the same continent: 102,400 GiB
Always Free usage included: No
USD 3,072.000

Hope it helps,
Paul

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network parameter not being read #696

Network parameter not being read #696

abalter commented May 11, 2021 •

edited

Loading

rcowin-gcp commented May 14, 2021 •

edited

Loading

abalter commented May 17, 2021

abalter commented May 17, 2021

moschetti commented May 17, 2021 •

edited

Loading

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021 •

edited

Loading

abalter commented May 17, 2021

abalter commented May 17, 2021

abalter commented May 17, 2021

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021

pgrosu commented May 18, 2021

rcowin-gcp commented May 18, 2021

pgrosu commented May 18, 2021

rcowin-gcp commented May 18, 2021

ypouliot commented Oct 17, 2021

abalter commented Oct 17, 2021

ypouliot commented Oct 17, 2021

pgrosu commented Oct 17, 2021

moschetti commented Oct 18, 2021

pgrosu commented Oct 19, 2021

moschetti commented Oct 19, 2021

pgrosu commented Oct 20, 2021

moschetti commented Oct 20, 2021

pgrosu commented Oct 21, 2021

Network parameter not being read #696

Network parameter not being read #696

Comments

abalter commented May 11, 2021 • edited Loading

rcowin-gcp commented May 14, 2021 • edited Loading

abalter commented May 17, 2021

abalter commented May 17, 2021

moschetti commented May 17, 2021 • edited Loading

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021 • edited Loading

abalter commented May 17, 2021

abalter commented May 17, 2021

abalter commented May 17, 2021

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021

abalter commented May 17, 2021

rcowin-gcp commented May 17, 2021

Parameters to replace:

The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.

pgrosu commented May 18, 2021

rcowin-gcp commented May 18, 2021

pgrosu commented May 18, 2021

rcowin-gcp commented May 18, 2021

ypouliot commented Oct 17, 2021

abalter commented Oct 17, 2021

ypouliot commented Oct 17, 2021

pgrosu commented Oct 17, 2021

moschetti commented Oct 18, 2021

pgrosu commented Oct 19, 2021

moschetti commented Oct 19, 2021

pgrosu commented Oct 20, 2021

moschetti commented Oct 20, 2021

pgrosu commented Oct 21, 2021

For the Iowa (regional) location ($2048/month)

For the US (multi-region) location ($2662/month)

Additional Egress Costs ($1024/month)

abalter commented May 11, 2021 •

edited

Loading

rcowin-gcp commented May 14, 2021 •

edited

Loading

moschetti commented May 17, 2021 •

edited

Loading

rcowin-gcp commented May 17, 2021 •

edited

Loading