Skip to content

Latest commit

 

History

History
122 lines (82 loc) · 5.55 KB

README.md

File metadata and controls

122 lines (82 loc) · 5.55 KB

iMADS CircleCI

Website for searching and creating transcription factor binding predictions/preferences. Searches predictions and preference data by gene lists and custom ranges. Creates predictions/preferences for user uploaded DNA sequences.

Major Components

Predictions Config File

imadsconf.yaml - this config file determines what will be downloaded and how prediction/preference database will work

Predictions Database

Postgres database contains indexed gene lists, custom user data and predictions/preference data for use by webserver.py

Database Loading Script

load.py - downloads files and loads the database based on imadsconf.yaml

Webserver

webserver.py serves web portal and API for accessing the 'pred' database

Database Vacuum Script

vacuum.py deletes old user data from the 'pred' database

Web Portal

Directory portal/ contains the reactjs project that builds static/js/bunde.js for webserver.py to serve.

Custom Prediction/Preference Worker

Calculates predictions and preferences for user uploaded sequences. https://github.com/Duke-GCB/iMADS-worker

Running

Deployment

We use playbook imads.yml from https://github.com/Duke-GCB/gcb-ansible.

Run via docker-compose

Download docker-compose.yml and .env_sample. Rename .env_sample to .env Change DB_PASS_ENV and POSTGRES_PASSWORD to be whatever password you want. Start the database and webserver.

docker-compose up -d

Populate the database. (This will take quite a while depending upon imadsconf.yaml)

docker-compose run --no-deps --rm web python load.py

Javascript unit tests

Requires mocha and chai. Setup:

cd portal
npm install -g mocha
npm install --dev

To run:

cd portal
npm run test

Python unit tests

From the root directory run this:

nosetests

Integration tests are skipped (they are run by circleci). See tests/test_integration.py skip_postgres_tests for instructions for running them manually.

Config file updates

Under the util directory there is a python script for updating the config file. It can be run like so:

cd util
python create_conf.py

This will lookup the latest predictions based on the DATA_SOURCE_URL in create_conf.yaml. If you want to add a new gene list you will need to update GENOME_SPECIFIC_DATA in create_conf.yaml.

Data provenance

This database consists of datasets generated using the following programs:

Binding Predictions

Binding predictions were generated for each transcription factor on both fasta-formatted hg19 and hg38 genome assemblies using predict_tf_binding.py in https://github.com/Duke-GCB/Predict-TF-Binding. The work was divided to run the program for each combination of:

  • Genome Assembly (hg19, hg38)
  • Chromosome (chr1, chr2, chr3, ...)
  • Model/core combination (E2f1 GCGC, E2f1 GCGG, E2f4 GCGC, E2f4 GCGG, ...)

Configuration arguments for the model/core combinations are decoded from tracks-predictions.yaml. Each invocation of predict_tf_binding.py produced a BED format file containing genomic coordinates and the probability (score) that the considered TF will bind at that site.

These per-chromosome and per-model/core files were combined to produce a single bigBed format file for each transcription factor on each assembly (hg19 E2f1, hg38 E2f1, hg19 E2f4, hg38 E2f4), using a CWL workflow: bigbed-workflow-no-resize.cwl in https://github.com/Duke-GCB/TrackHubGenerator/.

The browser tracks are published at http://trackhub.genome.duke.edu/gordanlab/tf-dna-binding-predictions/. Scores from these tracks are ingested using load.py.

Binding Preferences

Binding Preferences were generated for the pairs of transcription factors in a family, enumerated in predict-TF-preference.R. The preference data are derived from the prediction data, starting with the BED format files generated by predict_tf_binding.py.

The collections of per-assembly-chromosome and per-model/core files were fed into predict-TF-preference.R in https://github.com/Duke-GCB/Predict-TF-Preference.

The preference scores were generated for each of the pairs using a CWL workflow: preference-bigbed-workflow.cwl in https://github.com/Duke-GCB/TrackHubGenerator/. This workflow considers the binding prediction at each site, determines preference using predict-TF-preference.R, and filters out insignificant preferences. This produced a bigBed format track, containing the preference score at each site.

The browser tracks are published at http://trackhub.genome.duke.edu/gordanlab/tf-dna-preferences/. Scores from these tracks are ingested using load.py.