Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update downloader.sh #18

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 76 additions & 34 deletions google/downloader.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,58 +6,100 @@
# From: https://gist.github.com/jeffmccune/e7d635116f25bc7e12b2a19efbafcdf8
# From: https://gist.github.com/n0531m/f3714f6ad6ef738a3b0a

# Script to retrieve and organize Google and Google Cloud IP ranges.

set -euo pipefail
set -x

# Check for required dependencies
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependencies check not need for this script, because its run only in github workers

for cmd in curl dig jq mktemp; do
if ! command -v "$cmd" &> /dev/null; then
echo "Error: $cmd is not installed or not in PATH" >&2
exit 1
fi
done

# Create a temporary directory and ensure cleanup on exit
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its not need:

  • /tmp availiable every time
  • worker live less 1 minute

temp_dir=$(mktemp -d)
trap 'rm -rf -- "$temp_dir"' EXIT

# Function to download files with retries and error handling
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries not need too, ~ 3 years this repo exists and I've never had this problem

download_file() {
local url=$1
local output_file=$2
local retries=3
local count=0
until curl -s "$url" -o "$output_file"; do
count=$((count + 1))
if [[ $count -ge $retries ]]; then
echo "Error: Failed to download $url after $retries attempts"
exit 1
fi
sleep 2 # wait before retrying
done
}

# get from public ranges
curl -s https://www.gstatic.com/ipranges/goog.txt > /tmp/goog.txt
curl -s https://www.gstatic.com/ipranges/cloud.json > /tmp/cloud.json
# Parallel downloads with retries
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much process cores available in github workers? Why would this be necessary when downloading files smaller than 1MB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the free plan if I am not wrong you may use up to 2 cores :)

download_file "https://www.gstatic.com/ipranges/goog.txt" "$temp_dir/goog.txt" &
download_file "https://www.gstatic.com/ipranges/cloud.json" "$temp_dir/cloud.json" &
download_file "https://developers.google.com/search/apis/ipranges/googlebot.json" "$temp_dir/googlebot.json" &
wait # Ensure all downloads finish

# Public GoogleBot IP ranges
# From: https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot
curl -s https://developers.google.com/search/apis/ipranges/googlebot.json > /tmp/googlebot.json
# Fetch Google netblocks using dig command
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be useful

fetch_netblocks() {
local idx=2
local txt
txt="$(dig TXT _netblocks.google.com +short @8.8.8.8 || true)"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| true will hide network problems from us that we might have noticed, in which case we just won't know anything?

Copy link
Author

@fabriziosalmi fabriziosalmi Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm.. maybe the best can be to collect errors as artifacts for further investigation.. but as you already said.. it works and I completely agree with you, this PR is not needed :) Keep it as my thanks message for your immensely useful repo. I use it every day since years :)

while [[ -n "$txt" ]]; do
echo "$txt" | tr '[:space:]+' "\n" | grep ':' | cut -d: -f2- >> "$temp_dir/netblocks.txt"
txt="$(dig TXT _netblocks${idx}.google.com +short @8.8.8.8 || true)"
((idx++))
done
}

# get from netblocks
txt="$(dig TXT _netblocks.google.com +short @8.8.8.8)"
idx=2
while [[ -n "${txt}" ]]; do
echo "${txt}" | tr '[:space:]+' "\n" | grep ':' | cut -d: -f2- >> /tmp/netblocks.txt
txt="$(dig TXT _netblocks${idx}.google.com +short @8.8.8.8)"
((idx++))
done
fetch_netblocks

# get from other netblocks
# Function to resolve DNS SPF records recursively with validation
get_dns_spf() {
dig @8.8.8.8 +short txt "$1" |
tr ' ' '\n' |
while read entry; do
while read -r entry; do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-r do not allow backslashes to escape any characters

What's that for?

case "$entry" in
ip4:*) echo "${entry#*:}" ;;
ip6:*) echo "${entry#*:}" ;;
ip4:*) echo "${entry#*:}" ;;
ip6:*) echo "${entry#*:}" ;;
include:*) get_dns_spf "${entry#*:}" ;;
esac
done
done || {
echo "Error: Failed to fetch DNS SPF records for $1"
exit 1
}
}

get_dns_spf "_cloud-netblocks.googleusercontent.com" >> /tmp/netblocks.txt
get_dns_spf "_spf.google.com" >> /tmp/netblocks.txt
# Fetch additional SPF-based netblocks with error handling
get_dns_spf "_cloud-netblocks.googleusercontent.com" >> "$temp_dir/netblocks.txt"
get_dns_spf "_spf.google.com" >> "$temp_dir/netblocks.txt"

# Separate IPv4 and IPv6 ranges
grep -v ':' "$temp_dir/goog.txt" > "$temp_dir/google-ipv4.txt"
jq -r '.prefixes[] | select(.ipv4Prefix != null) | .ipv4Prefix' "$temp_dir/cloud.json" >> "$temp_dir/google-ipv4.txt"
jq -r '.prefixes[] | select(.ipv4Prefix != null) | .ipv4Prefix' "$temp_dir/googlebot.json" >> "$temp_dir/google-ipv4.txt"
grep -v ':' "$temp_dir/netblocks.txt" >> "$temp_dir/google-ipv4.txt"

# save ipv4
grep -v ':' /tmp/goog.txt > /tmp/google-ipv4.txt
jq '.prefixes[] | [.ipv4Prefix][] | select(. != null)' -r /tmp/cloud.json >> /tmp/google-ipv4.txt
jq '.prefixes[] | [.ipv4Prefix][] | select(. != null)' -r /tmp/googlebot.json >> /tmp/google-ipv4.txt
grep -v ':' /tmp/netblocks.txt >> /tmp/google-ipv4.txt
grep ':' "$temp_dir/goog.txt" > "$temp_dir/google-ipv6.txt"
jq -r '.prefixes[] | select(.ipv6Prefix != null) | .ipv6Prefix' "$temp_dir/cloud.json" >> "$temp_dir/google-ipv6.txt"
jq -r '.prefixes[] | select(.ipv6Prefix != null) | .ipv6Prefix' "$temp_dir/googlebot.json" >> "$temp_dir/google-ipv6.txt"
grep ':' "$temp_dir/netblocks.txt" >> "$temp_dir/google-ipv6.txt"

# save ipv6
grep ':' /tmp/goog.txt > /tmp/google-ipv6.txt
jq '.prefixes[] | [.ipv6Prefix][] | select(. != null)' -r /tmp/cloud.json >> /tmp/google-ipv6.txt
jq '.prefixes[] | [.ipv6Prefix][] | select(. != null)' -r /tmp/googlebot.json >> /tmp/google-ipv6.txt
grep ':' /tmp/netblocks.txt >> /tmp/google-ipv6.txt
# Sort and deduplicate results, and ensure target directory exists
output_dir="google"
mkdir -p "$output_dir"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directory already exists, this file in it

sort -u "$temp_dir/google-ipv4.txt" > "$output_dir/ipv4.txt"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort -V:

   -V, --version-sort
         natural sort of (version) numbers within text

Without this sorting, it's gonna look bad, see #5 - Q3

sort -u "$temp_dir/google-ipv6.txt" > "$output_dir/ipv6.txt"

# Verify files are written correctly
if [[ ! -s "$output_dir/ipv4.txt" || ! -s "$output_dir/ipv6.txt" ]]; then
echo "Error: Output files are empty or failed to generate."
exit 1
fi

# sort & uniq
sort -V /tmp/google-ipv4.txt | uniq > google/ipv4.txt
sort -V /tmp/google-ipv6.txt | uniq > google/ipv6.txt
echo "IP ranges saved in $output_dir/ipv4.txt and $output_dir/ipv6.txt"