Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update downloader.sh #18

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fabriziosalmi
Copy link

Helo my Lord.... I'm trying to contribute to this gem, then I wanna check if PR are welcome just by proposing a simple update for a single downloader, if appreciated I can cover all downloaders quickly :)

List of improvements:

  • Dependency check: Ensures curl, dig, jq, and mktemp are installed before proceeding.
  • Error handling in downloads: Added retries and error messages for failed downloads.
  • Background download: Parallelized downloads for faster execution using & and wait.
  • Error handling in dig calls: Added fallback || true to prevent script failure if DNS resolution fails.
  • Recursive SPF record resolution: Improved robustness with additional error handling.
  • Temp directory with trap: Ensured that the temporary directory is always cleaned up upon script exit.
  • Improved sorting: Used sort -u for efficient sorting and deduplication in one step.
  • Output directory creation: Ensured the output directory is created before saving the results.
  • File size check: Added validation to check if the output files are empty or failed to generate.
  • Logging: Enhanced logging for error messages and successful file creations.

List of improvements:

- Dependency check: Ensures curl, dig, jq, and mktemp are installed before proceeding.
- Error handling in downloads: Added retries and error messages for failed downloads.
- Background download: Parallelized downloads for faster execution using & and wait.
- Error handling in dig calls: Added fallback || true to prevent script failure if DNS resolution fails.
- Recursive SPF record resolution: Improved robustness with additional error handling.
- Temp directory with trap: Ensured that the temporary directory is always cleaned up upon script exit.
- Improved sorting: Used sort -u for efficient sorting and deduplication in one step.
- Output directory creation: Ensured the output directory is created before saving the results.
- File size check: Added validation to check if the output files are empty or failed to generate.
- Logging: Enhanced logging for error messages and successful file creations.
Copy link
Owner

@lord-alfred lord-alfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great MR, but I don't understand why so many changes, as this script works in github workers and everything works fine. After these optimizations it looks like the script can run on any machine, but there is no such purpose.

set -euo pipefail
set -x

# Check for required dependencies
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependencies check not need for this script, because its run only in github workers

fi
done

# Create a temporary directory and ensure cleanup on exit
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its not need:

  • /tmp availiable every time
  • worker live less 1 minute


# Function to download files with retries and error handling
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries not need too, ~ 3 years this repo exists and I've never had this problem

# get from public ranges
curl -s https://www.gstatic.com/ipranges/goog.txt > /tmp/goog.txt
curl -s https://www.gstatic.com/ipranges/cloud.json > /tmp/cloud.json
# Parallel downloads with retries
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much process cores available in github workers? Why would this be necessary when downloading files smaller than 1MB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the free plan if I am not wrong you may use up to 2 cores :)

# Public GoogleBot IP ranges
# From: https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot
curl -s https://developers.google.com/search/apis/ipranges/googlebot.json > /tmp/googlebot.json
# Fetch Google netblocks using dig command
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be useful

fetch_netblocks() {
local idx=2
local txt
txt="$(dig TXT _netblocks.google.com +short @8.8.8.8 || true)"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| true will hide network problems from us that we might have noticed, in which case we just won't know anything?

Copy link
Author

@fabriziosalmi fabriziosalmi Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm.. maybe the best can be to collect errors as artifacts for further investigation.. but as you already said.. it works and I completely agree with you, this PR is not needed :) Keep it as my thanks message for your immensely useful repo. I use it every day since years :)

get_dns_spf() {
dig @8.8.8.8 +short txt "$1" |
tr ' ' '\n' |
while read entry; do
while read -r entry; do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-r do not allow backslashes to escape any characters

What's that for?

grep ':' /tmp/netblocks.txt >> /tmp/google-ipv6.txt
# Sort and deduplicate results, and ensure target directory exists
output_dir="google"
mkdir -p "$output_dir"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directory already exists, this file in it

# Sort and deduplicate results, and ensure target directory exists
output_dir="google"
mkdir -p "$output_dir"
sort -u "$temp_dir/google-ipv4.txt" > "$output_dir/ipv4.txt"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort -V:

   -V, --version-sort
         natural sort of (version) numbers within text

Without this sorting, it's gonna look bad, see #5 - Q3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants