Skip to content

Latest commit

 

History

History
343 lines (247 loc) · 16.7 KB

README.md

File metadata and controls

343 lines (247 loc) · 16.7 KB

movie stills
Movie Stills

A Go CLI application to scrap various websites in order to get high-quality movie snapshots. See the list of Supported Websites.

CI Go Report Card Test Scrapers Docker Image Size

Installation

There are various ways to install or use the application:

Binaries

Download the latest binary from the releases page for your OS. Then you can simply execute the binary like this:

# list all the implemented scrapers
./moviestills --list

# You can also use environment variables 
# instead of CLI arguments.
# scrap the blubeaver website with default settings
WEBSITE=blubeaver ./moviestills

See Usage to check what settings you can pass through CLI arguments or environment variables.

Docker images

Docker comes to the rescue, providing an easy way how to run moviestills on most platforms.

GitHub Registry

The "latest" image is built from the master branch on every push. You can see all the other tags (releases) available here.

docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --rm ghcr.io/kinoute/moviestills:latest \
    --website movie-screencaps \
    --async

Docker Hub

You can can see all the image tags available on the Docker Hub here.

docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --env WEBSITE=blubeaver \
    --env ASYNC=true \
    --rm hivacruz/moviestills:latest

As you can see, you can also use environment variables instead of CLI arguments.

By default, the docker run command above will always pull before running to get the latest image changes for the specified tag. If you don't like this behavior, remove the --pull=always flag from the command.

Usage

Output of ./moviestills --help:

Usage: moviestills [--website WEBSITE] [--list] [--parallel PARALLEL] [--delay DELAY] [--async] [--timeout TIMEOUT] [--proxy PROXY] [--cache-dir CACHE-DIR] [--data-dir DATA-DIR] [--hash] [--debug] [--no-colors] [--no-style]

Options:
  --website WEBSITE, -w WEBSITE
                         Website to scrap movie stills on [env: WEBSITE]
  --list, -l             List all available scrapers implemented [default: false, env: LIST]
  --parallel PARALLEL, -p PARALLEL
                         Limit the maximum parallelism [default: 2, env: PARALLEL]
  --delay DELAY, -r DELAY
                         Add some random delay between requests [default: 1s, env: RANDOM_DELAY]
  --async, -a            Enable asynchronus running jobs [default: false, env: ASYNC]
  --timeout TIMEOUT, -t TIMEOUT
                         Set the default request timeout for the scraper [default: 15s, env: TIMEOUT]
  --proxy PROXY, -x PROXY
                         The proxy URL to use for scraping [env: PROXY]
  --cache-dir CACHE-DIR, -c CACHE-DIR
                         Where to cache scraped websites pages [default: cache, env: CACHE_DIR]
  --data-dir DATA-DIR, -f DATA-DIR
                         Where to store movie snapshots [default: data, env: DATA_DIR]
  --hash                 Hash image filenames with md5 [default: false, env: HASH]
  --debug, -d            Set Log Level to Debug to see everything [default: false, env: DEBUG]
  --no-colors            Disable colors from output [default: false, env: NO_COLORS]
  --no-style             Disable styling and colors entirely from output [default: false, env: NO_STYLE]
  --help, -h             display this help and exit
  --version              display version and exit

Note: CLI arguments will always override environment variables. Therefore, if you set WEBSITE as an environment variable and also use —website as a CLI argument, only the latter will be passed to the app.

For boolean arguments such as --async or --debug, their equivalent as environment variables is, for example, ASYNC=false or ASYNC=true.

Examples

The simple goal of this CLI app is to scrap movie snapshots from a few supported websites. You can use either the binary or the Docker image to run moviestills.

See supported websites

To see the list of supported websites on which we can download movie snapshots, you can do:

# see all supported websites
./moviestills --list
# or
LIST=true ./moviestills

The list of supported websites with way more details can also be seen below.

Scrap a website with asynchronous jobs

Asynchronous jobs make the scraping faster. You can also increase the maximum number of simultaneous requests by changing the --parallel flag or the PARALLEL environment variable (set to 2 by default).

# with CLI arguments
./moviestills --website film-grab --async

# with environment variables
WEBSITE=film-grab ASYNC=true ./moviestills

# increase to 10 the maximum number of simultaneous requests
./moviestills --website dvdbeaver --async --parallel 10

Docker, hashed filenames, random delays and no colors

A complete example that scraps a website with:

  • Docker ;
  • Asynchronous jobs ;
  • Random delays between requests ;
  • Hashed filenames ;
  • No colors on the terminal output.
# docker with CLI arguments
docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --rm ghcr.io/kinoute/moviestills:latest \
    --website movie-screencaps \
    --async \
    --delay 5s \
    --hash \
    --no-colors

With Docker, settings can also be set with environment variables instead of CLI arguments with the —env or -e flag.

Proxies

You can set up a proxy URL to use for scraping using the --proxy CLI agument or the PROXY environment variable. At the moment, you can set only one proxy but the app might support multiple proxies in a round robin fashion later.

Cache

By default, every scraped page will be cached in the cache folder. You can change the name or path to the folder through the options with —cache-dir or the CACHE_DIR environment variable. This is an important folder as it stores everything that was scraped.

It avoids requesting again some websites pages when there is no need to. It is a nice thing as we don't want to flood these websites with thousands of useless requests. It is also handy to continue an early-stopped scraping job.

In case you are using our Docker image to run moviestills, don't forget to change the volume path to the new internal cache folder, if you set up a custom internal cache folder. But you should not bother editing this internal cache folder anyway, since you have volumes and can set the desired path on your host machine for the cache folder.

Data

By default, each scraped website will have its own subfolder in the data folder. Inside, every movie will have its own folder with the scraped movie snapshots found on the website.

Example:

data # where to store movie snapshots
├── blubeaver # website's name
│   ├── 12\ Angry\ Men # movie's title
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_1.jpg
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_2.jpg
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_3.jpg

You can change the default data folder with the —data-dir CLI argument or the DATA_DIR environment variable.

If you use our Docker image to run moviestills, don't forget to change the volume path in case you edited the internal data folder. Again, you should not even bother editing the internal data folder's path or name anyway as you have volumes to store and get access to these files on the host machine.

Hash filenames

To get some consistency, you can use the MD5 hash function to normalize image filenames. All images will then use 32 hexadecimal digits as filenames. To enable the hashing, use the —hash CLI argument or the HASH=true environment variable.

Supported Websites

As today, scrapers were implemented for the following websites in moviestills:

Website Simplified Name 1 Description Movies 3
BluBeaver blubeaver Extension of DVDBeaver, this time dedicated to Blu-Ray reviews only. Reviews are great to check the quality of BD releases with lot of technical details. Only snapshots on "free" access are scraped. ~5069
BlusScreens bluscreens Website with high resolution screen captures taken directly from different Blu-ray releases by Blusscreens. ~452
DVDBeaver dvdbeaver Not recommended. 2 A massive list of DVD reviews with a lot of movie snapshots. This task only includes the DVD reviews. BD snapshots are available in BluBeaver. It is advised to use the latter instead. ~4098
EvanERichards evanerichards A short but interesting list of movies with a lot of snapshots for each. Also includes some TV Series but they are ignored by the scraper. ~245
Film-Grab film-grab A great list of movies with a few snapshots for each. Snapshots were cherry-picked and show nice cinematography. ~2829
HighDefDiscNews highdefdiscnews A few hundreds movies featured with high-quality snapshots (png, lossless) in native resolution. ~209
Movie-Screencaps movie-screencaps Website with DVD, BD, and 4K BD movie snapshots. Since thousands of snapshots are available for each movie (one per second or so), we only take some of them per paginated page. ~715
ScreenMusings screenmusings A small list of movies but with nice cherry-picked snapshots. ~260
StillsFrmFilms stillsfrmfilms A very small list of movies but, again, the snapshots were nicely chosen and depict perfectly the atmosphere of each movie. ~63

1 : The simplified name column displays the website's name to use with moviestills to start the scraping job. Eg —website blubeaver for the BluBeaver.ca website.

2 : While DVDBeaver provides a lot of movie snapshots from great DVD reviews, it is harder to filter correctly the images on the reviews pages. Expect a lot of false positives (DVD covers, banners etc) and average quality overall.

3 : Approximate number of movies calculated on October 5th, 2021.

Contribute: If you want to fix or add a new website to the scraper, please read how to set up a development workflow and how to contribute.

Development

Clone the repo

If you already have Go installed – 1.22+ supported –, you can simply clone the repo and run the application from the folder:

# clone the repo
git clone https://github.com/kinoute/moviestills

# go inside the project
cd moviestills

Then:

# download the dependencies
go mod download
go mod tidy

# you can run the app without compiling
go run . --help

# optional: compile the binary
go build -v

# use the compiled app (example on Linux)
./moviestills --help

Docker

If you don't have Go installed, you can build a Docker Image to start developing in a Go environment. To do that, clone the repository and inside the folder:

# build development image
docker build --tag moviestills-dev . \
    --target base

Then start and go inside the container:

# start the development container
# and go inside it
docker run \
    --name moviestills-dev \
    --volume "${PWD}:/app" \
    --interactive \
    --tty \
    --rm moviestills-dev

A volume is created to report any change you make to the code inside the container.

To run your code, you can use go run . inside the container, test it, build it etc. Do note that these commands might be slow at first.

Build final image

You can also build locally the final image of the app to test it. To do that, run in the repository:

# build final image with binary
docker build -t moviestills .

# test it
docker run \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --rm moviestills:latest \
    --website film-grab \
    --debug 

Contribute

This is my first public project in Golang. Therefore, pull requests, suggestions or bug reports are appreciated. A major refactoring is not excluded since I'm still learning the language.

A lot of things included here might look overkill for such a small app (linting, packages etc) but I thought setting a complete workflow would be interesting. Don't hesitate to make it better!

Add a new website

You can contribute to this scraper by adding a new website which provides high-quality movie snapshots. To do that, there are four steps:

  1. Create a new file in the websites folder with the simplified name of the website (eg. yahoo.go). Check how other websites were implemented and scraped with the Colly library. You need to define a main URL as a constant (the first URL that will get visited) and a main function such as yahooScraper() where your Colly logic will do the work.
  2. Once you created a scraper for a website, you need to add this new scraper in the available options of the app in main.go. Edit the sites map by adding the simplified name of the website as a key and the scraper's function as a value (eg. websites.yahooScraper).
  3. Create a unit test for the website, eg yahoo_test.go. For that test, we are not going to test with Colly but only with GoQuery, a library that makes HTML/CSS parsing easy, on which Colly is based. We just want to make sure the CSS selectors we use in our scraper are still up-to-date and are still filtering correctly the data we are looking for.
  4. Edit the Supported Websites table in the README file and write detailed informations about the website you added – please make sure websites are sorted alphabetically in the table.

Support

Most of the websites we are scraping are owned by individuals who just want to share nice movie snapshots. Please don't abuse these websites and limit your scraping activity to not arm them in any way.

You can support some of the webmasters behind these nice galleries:

  • The owner of DVDBeaver/BluBeaver has a Patreon page where you can support him and also get access to a lot more movie snapshots in great quality, which we don't scrap. Here is the link: https://www.patreon.com/dvdbeaver ;
  • The man behind Film-Grab also has a Patreon page where you can send him some love: https://www.patreon.com/filmgrab.

I couldn't find any support page for the other websites but you can support some of them by using their affiliated links for example.

Just be gentle while scraping: some are hosting the images on their own servers!

Credits