Inspired by the Six Degrees of Kevin Bacon and The Oracle of Bacon.
How many degrees of separation between two movie actors? MovieGraph creates a graph from available IMDb datasets and uses BFS to find the shortest path between two actors. Find it at http://movie-graph.com.
IMDb's datasets don't include images, so MovieGraph fetches images from TMDb using tmdbsimple.
Information on the available datasets can be found on their website. TSV files (zipped) can be downloaded directly from https://datasets.imdbws.com/.
-
Clone the repo and
cd
into the root project directory -
Download the following datasets from IMDb into the data directory (
./scores/data
): -
Set up and activate a virtual environment. You can set one up with Python 3:
python3 -m venv /path/to/virtual/env
-
Install the project dependencies:
pip install -r requirements.txt
-
Create a .env file in the moviegraph subdirectory (not the root project directory) to set up environment variables (
touch ./moviegraph/.env
) and add the following line to the file:DEBUG=True
The
.env
file is read from bothsettings.py
(in the same directory) and the scores app. This means when the scores app reads.env
, it generates a warning that the.env
file is in a different directory as the file that's reading it; you can ignore this. See django-dotenv docs for more information. -
Obtain a Django Secret Key using MiniWebTool and add it to your .env file:
SECRET_KEY=<secret key goes here>
-
Obtain an API key from TMDb and add it to your .env file:
TMDB_API_KEY=<api key goes here>
-
Create a postgres database for the project (Digital Ocean has a good walkthrough of the process) and add it to your
.env
file:- Log into postgres:
psql
- Create a database:
CREATE DATABASE <databasename>;
- Create a user:
CREATE USER <user> WITH PASSWORD '<password>';
- Modify connection parameters for the user:
ALTER ROLE <user> SET client_encoding TO 'utf8';
ALTER ROLE <user> SET default_transaction_isolation TO 'read committed';
ALTER ROLE <user> SET timezone TO 'UTC';
- Grant user access rights to the database:
GRANT ALL PRIVILEGES ON DATABASE <databasename> TO <user>;
- Grant user permission to create a database (for testing):
ALTER USER <user> CREATEDB;
- Add the database URL to
.env
:
DATABASE_URL=postgresql://<user>:<password>@localhost:5432/<databasename>
- Log into postgres:
-
Set up static files to run locally by commenting out the following line in
settings.py
:STATICFILES_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'
MovieGraph is set up to load its static files from S3 for deployment. To continue using S3, specify the appropriate values for
AWS_STORAGE_BUCKET_NAME
andAWS_S3_REGION_NAME
insettings.py
. Also make sure to add the appropriate AWS access keys in.env
:AWS_ACCESS_KEY_ID=<access key id>
AWS_SECRET_ACCESS_KEY=<secret access key>
You'll also need to configure your S3 bucket to allow access from the Django app.
-
Run migrations:
python manage.py makemigrations
python manage.py migrate
-
Load IMDb data from tsv files into the database:
python manage.py importdata names
python manage.py importdata titles
python manage.py importdata principals
N.B. Because only partial datasets are available, some of the title ids and name ids in the join table are not found in the Name and Title tables when you try to import principals. If this happens, uncomment out the code in
load_principals()
that will check every entry and print out a set of bad name ids and a set of bad title ids and run the command to load principals (this takes about a day). Once this has completed, replace the existing constants,BAD_NAMES
andBAD_TITLES
, inimportdata.py
with the new information, comment out the code again, and then rerun the command. -
Generate and load graph into the database:
-
Create a csv file of the graph,
graph.csv
, in the same directory as the tsv files:python manage.py importgraph
-
Load the graph into the database by copying the csv file directly into postgres:
- Log into postgres with
psql <databasename>
- Run the following command:
COPY scores_graph(id, star_id, costar_id, titles) FROM 'full/path/to/file.csv' WITH DELIMITER E'\t';
- Log into postgres with
-
Update the
in_graph
column in the Name table for those actors who are in the graph:- In postgres:
UPDATE scores_name SET in_graph=True WHERE scores_name.id IN (SELECT DISTINCT(star_id) FROM scores_graph);
- In postgres:
-
-
IMDb data only includes principal cast, not full cast, so Bacon scores won't always match (can be higher than) what you find on google or the Oracle of Bacon, etc.
-
How do you find the right Emma Stone in IMDb?
There's no good fix for this--right now, MovieGraph just searches for a movie actor named Emma Stone, whose birth date is not null, and then fetches the first result if there's more than one Emma Stone.
Sorting by movie actors, and then not null birth dates means that there's often not more than one result returned, but this is still limiting. What if you wanted to expand the search beyond actors? Or what if you wanted the lesser known Emma Stone, who was in White Angel (and who doesn't have a birth date in IMDb)?
-
How do you find the right Emma Stone in TMDb (images)?
Ideally, when pulling info from the API, you'd match against the IMDB id, but the tmdbsimple API wrapper doesn't return the IMDB ids for titles or names. It returns results based on popularity (so if there's more than one Spencer Tracy, MovieGraph assumes you're looking for the most famous/popular one).
To fix this, MovieGraph would need to implement in its own API wrapper that incorporates the IMDb id strings into its interface. There's an existing API wrapper that does this, but it isn't compatible with Python 3.