This document describes the architecture of GN2. Because GN2 is evolving, only a high-level overview is given here.
Reproducible data analysis and software interoperability should be key goals for any system that aims to bring research groups together. These goals are increasingly relevant with growing data sizes and increasingly complex analysis pipelines. Rigor, reproducibility, and robustness starts with data that should abide by Findable, Accessible, Interoperable, and Re-usable (FAIR) principles (see the Wilkinson Nature paper on FAIR Guiding Principles for scientific data management and stewardship).
GeneNetwork (GN2) solves this by assigning unique identifiers (cryptographic HASH values calculated over immutable data content), including these values in file or directory names, and making them available through web interfaces (e.g., through a through a REST API). This means that at any point in the future the exact same data can be retrieved using a known non-changeable identifier (see also https://github.com/pjotrp/genenetwork2/blob/staging/doc/submit-data.org).
Synchronisation, integrity checking and backups become trivial using these HASH values, even for very large datasets. Since everything is managed at the file system level we can also use Unix authorisation systems. HIPAA compliancy is achieved by using HASH references and bringing the software into the controlled HIPAA environment.
In the context of GeneNetwork we are using git for version control of software source code (https://github.com/genenetwork/). Software can be treated just like data, i.e., git uses HASH identifiers to retrieve specific versions of source. I.e., versions of source code are identifiable and retrievable and can be matched with data into an analysis pipeline. The combination of software and data, again, makes a unique HASH value which identifies the analysis pipeline.
For combining runnable software and data into an analysis pipeline we use GNU Guix which, yet again, turns everything into a unique HASH value which allows for exact retrieval and reproducibility. Not only that, GNU Guix gives control of the software and all its dependencies, use GNU Guix which, yet again, turns everything into a unique HASH value which allows for exact retrieval and reproducibility. Not only that, GNU Guix gives control of the software and all its dependencies, calculating a HASH value for all dependencies, all the way down to versions of R, BLAS and glibc. This way of packaging software ascertains that identical software pipelines are easily setup on different system or in the Cloud. Meaning that everyone ends up using the exact same combination of software versions in a pipeline.
For software development we use GNU Guix for integration testing and deployment (described in JOSS paper). We also use automated test tools (Ruby mechanize) for integration testing of the web services and we use unit testing of all backend services. All our software source code is published as `free and open source software’ (FOSS) which means that anyone can view code on github, comment on, or even contribute to. GeneNetwork is becoming increasingly modular and has a growing number of contributers who subscribe to the principles of THE SMALL TOOLS MANIFESTO FOR BIOINFORMATICS (https://github.com/pjotrp/bioinformatics) which we drew up and was signed by over fifty bioinformaticians.
The main GN2 webserver is built on Python flask and this GN2 source code can be found on github in the wqflask directory. The routing tables are defined in views.py. For example the main page is loaded from a template named index_page.html in the templates directory. In the template you can find get the form gets filled by a Javascript routine defined in data_select_menu.js which picks up a static JSON file for the menu. This static file is generated with gen_select_dataset.py. Note that this JSON data is served by gn_server in the latest version, see GnServer (REST).
When you hit a search with, for example, ‘http://localhost:5003/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=&search_terms_and=MEAN%3D%2815+16%29+LRS%3D%2823+46%29+&FormID=searchResult’ it has the menu items as parameters. According to the routing table, the search is executed and Redis caching is used (we’ll probably change that to the level of the gn_server). The logic is in search_result.py which invokes database functions in wqflask/dbFunction/webqtlDatabaseFunction.py, for example. The receiving template lives at search_result_page.html.
For what happens at the database level see database.org.
A view consists of an HTML template with JS libraries for managing menus, tables etc. For example, for the search results see the search_result_page.html which is a Flask template. The first section puts the search in plain English, e.g. ‘We searched Hippocampus Consortium M430v2 (Jun06) PDNN to find all records with MEAN between 15 and 16 and with LRS between 23 and 46.’. Then the results are added to a table which is displayed using a JS DataTable container.
The GnServer REST API is built on high performance Elixir with Maru. Mainly the GnServer serves JSON requests, for example to fetch data from the database. To get the menu data in YAML you can do something like
curl localhost:8880/int/menu/main.json|ruby extra/json2yaml.rb
(json2yaml.rb is in the gn_server repo). For the current API definition see GnServer REST API documentation.
GnExec, also written in Elixir, executes commands using a separate daemon.
Phenotypes are stored in the SQL database. For what happens at the database level see database.org. A test database can be downloaded - see the installation instructions.
Genotypes are stored in genotype files. These are part of the GNU Guix distribution, see the installation instructions. Genotype files are currently in GN1 format, and will be aligned with the R/qtl2 formats.
GN1-style (still default GN2) for the stored file BXD.geno:
@name:BXD
@type:riset
@mat:B
@pat:D
@het:H
@unk:U
Chr Locus cM Mb BXD1 BXD2 BXD5 BXD6 BXD8 BXD9 BXD11 BXD12 BXD13 BXD14 BX
D15 BXD16 BXD18 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24a BXD24 BXD25 BXD27 BXD28 BX
D29 BXD30 BXD31 BXD32 BXD33 BXD34 BXD35 BXD36 BXD37 BXD38 BXD39 BXD40 BXD41 BXD4
2 BXD43 BXD44 BXD45 BXD48 BXD49 BXD50 BXD51 BXD52 BXD53 BXD54 BXD55 BXD56 BXD59
BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD66 BXD67 BXD68 BXD69 BXD70 BXD71 BXD72 BX
D73 BXD74 BXD75 BXD76 BXD77 BXD78 BXD79 BXD80 BXD81 BXD83 BXD84 BXD85 BXD86 BXD8
7 BXD88 BXD89 BXD90 BXD91 BXD92 BXD93 BXD94 BXD95 BXD96 BXD97 BXD98 BXD99 BXD100
BXD101 BXD102 BXD103
1 rs6269442 0.0 3.482275 B B D D D B B D B B D D B D D D D B B B D B D D B B B
B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B
D D D D D H B D D B D B B D D B D D B B B B B B B D
1 rs6365999 0.0 4.811062 B B D D D B B D B B D D B D D D D B B B D B D D B B B
B B B B B B D B D B B D B B H H B D B B H H B B D D D D D B B H B B B B D B D B
D D D D D H B D D B D B B D D B D D B B B B B B U D
...
and, for example, in the method run_rqtl_geno this file gets loaded. For GnServer, however, we only want to deal with standardized R/qtl formatted data, so with gn_extra we convert the original format into R/qtl format with geno2rqtl with one adaptation: the geno table is transposed so now becomes
marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103
1,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D
2,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
3,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
...
i.e. individuals are columns and markers are rows. Alternatively it could look like
marker,BXD1,BXD2,BXD5,BXD6,BXD8,BXD9,BXD11,BXD12,BXD13,BXD14,BXD15,BXD16,BXD18,BXD19,BXD20,BXD21,BXD22,BXD23,BXD24a,BXD24,BXD25,BXD27,BXD28,BXD29,BXD30,BXD31,BXD32,BXD33,BXD34,BXD35,BXD36,BXD37,BXD38,BXD39,BXD40,BXD41,BXD42,BXD43,BXD44,BXD45,BXD48,BXD49,BXD50,BXD51,BXD52,BXD53,BXD54,BXD55,BXD56,BXD59,BXD60,BXD61,BXD62,BXD63,BXD64,BXD65,BXD66,BXD67,BXD68,BXD69,BXD70,BXD71,BXD72,BXD73,BXD74,BXD75,BXD76,BXD77,BXD78,BXD79,BXD80,BXD81,BXD83,BXD84,BXD85,BXD86,BXD87,BXD88,BXD89,BXD90,BXD91,BXD92,BXD93,BXD94,BXD95,BXD96,BXD97,BXD98,BXD99,BXD100,BXD101,BXD102,BXD103
rs6269442,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,B,D
rs6365999,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,B,D,B,B,H,H,B,D,B,B,H,H,B,B,D,D,D,D,D,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
rs6376963,B,B,D,D,D,B,B,D,B,B,D,D,B,D,D,D,D,B,B,B,D,B,D,D,B,B,B,B,B,B,B,B,B,D,B,D,B,D,D,B,B,H,H,B,B,B,B,H,H,B,B,D,D,D,D,B,B,B,H,B,B,B,B,D,B,D,B,D,D,D,D,D,H,B,D,D,B,D,B,B,D,D,B,D,D,B,B,B,B,B,B,U,D
This is also the format provided by R/qtl in https://github.com/rqtl/qtl2data/tree/master/DO_Recla which we will use as the base line for the REST server. In the meta json file the genotype data is tagged as transposed:
{
"description": "DO data from Recla et al. (2014) Mamm Genome 25:211-222",
"crosstype": "do",
"geno": "recla_geno.csv",
"geno_transposed": true,
"founder_geno": "recla_foundergeno.csv",
"founder_geno_transposed": true,
"genotypes": {
"1": "1",
"2": "2",
"3": "3"
},
"pheno": "recla_pheno.csv",
"pheno_transposed": false,
"covar": "recla_covar.csv",
"sex": {
"covar": "Sex",
"female": "female",
"male": "male"
},
"x_chr": "X",
"cross_info": {
"covar": "ngen"
},
"gmap": "recla_gmap.csv",
"pmap": "recla_pmap.csv",
"alleles": ["A", "B", "C", "D", "E", "F", "G", "H"]
}
Meanwhile the gmap file looks like
marker,chr,pos,Mb
rs6269442,1,0.0,3.482275
rs6365999,1,0.0,4.811062
rs6376963,1,0.895,5.008089
rs3677817,1,1.185,5.176058