- Please update to this version! Merging and meta analyses did not work properly in previous versions.
- Fixed issue with MAF calculation in .gwama()
Older patchnodes can be found in PATCHNOTES.md
A python package for working with GWAS summary statistics data in Python.
This package is designed to make it easy to read summary statistics, perform QC, merge summary statistics and perform meta-analysis.
Meta-analysis can be performed with .meta()
with inverse-variance weighted or samplesize-weighted methods.
GWAMA as described in Baselmans, et al. (2019) can be performed using the .gwama()
function in merged summary statistics.
The plotting package uses matplotlib.pyplot for generating figures, so the functions are generally compatible with matplotlib.pyplot colors, and Figure and Axis objects.
Warning: merging with low_memory enabled is still highly experimental.
Using the pysumstats package for a publication, or something similar? That is awesome!
There is no publication attached to this package,
and I am not going to force anyone to reference me or make me a co-author or whatever, I want this to remain easily accessible.
But I would greatly appreciate it if you add a link to this github, or a reference to it in the acknowledgements or something like that.
If you have any questions, want to help add methods or want to let me know you are planning a publication with this, you can get in touch via the pypi website of this project.
If you use the .gwama()
method, please refer to the original publication: Baselmans, et al. (2019).
This package was made for Python 3.7. Clone the package directly from this github, or install with
pip3 install --upgrade pysumstats
import pysumstats as sumstats
s1 = sumstats.SumStats("sumstats1.csv.gz", phenotype='GWASsummary1')
s2 = sumstats.SumStats("sumstats2.txt.gz", phenotype='GWASsummary2', gwas_n=350492)
s3 = sumstats.SumStats("sumstats3.csv", phenotype='GWASsummary3',
column_names={
'rsid': 'weird_name_for_rsid',
'chr': 'weird_name_for_chr',
'bp': 'weird_name_for_bp',
'ea': 'weird_name_for_ea',
'oa': 'weird_name_for_oa',
'maf': 'weird_name_for_maf',
'b': 'weird_name_for_b',
'se': 'weird_name_for_se',
'p': 'weird_name_for_p',
'hwe': 'weird_name_for_p_hwe',
'info': 'weird_name_for_info',
'n': 'weird_name_for_n',
'eaf': 'weird_name_for_eaf',
'oaf': 'weird_name_for_oaf'})
s1.qc(maf=.01)
s2.qc(maf=.01, hwe=1e-6, info=.9)
s3.qc() # MAF .01 is the default
merge1 = s1.merge(s2)
n_weighted_meta = merge1.meta_analyze(name='meta1', method='samplesize') # N-weighted meta analysis
ivw_meta = merge1.meta_analyze(name='meta1', method='ivw') # Standard inverse-variance weighted meta analysis
gwama = merge1.gwama(name='meta1', method='ivw') # GWAMA as described in Baselmans, et al. (2019)
exc_meta = exc.gwama(h2_snp={'ntr_exc': .01, 'ukb_ssoe': .02}, name='exc', method='ivw')
# Either read it from a file:
import pandas as pd
cov_z = pd.read_csv('my_cov_z.csv') # Note it should be pandas dataframe with column names and index names equal to your phenotypes
# Or generate it from a phenotype file yourself:
phenotypes = pd.read_csv('my_phenotype_file.csv')
cov_z = sumstats.cov_matrix_from_phenotype_file(phenotypes, phenotypes=['GWASsummary1', 'GWASsummary2'])
gwama = exc.gwama(cov_matrix=cov_z, h2_snp={'GWASsummary1': .01, 'GWASsummary2': .02}, name='meta1', method='ivw')
gwama.describe()
gwama.head()
gwama.head(n_chromosomes=23)
gwama.manhattan(filename='meta_manhattan.png')
gwama.qqplot(filename='meta_qq.png')
exc.save('exc_sumstats.csv')
exc.save('exc_sumstats.pickle')
merged = gwama.merge(s3)
merged.prep_for_mr(exposure='GWASsummary3', outcome='meta1',
filename=['GWAS3-Meta.csv', 'Meta-GWAS3.csv'],
p_cutoff=5e-8, bidirectional=True, index=False)
The resulting files will have the following column names, per specification of the MendelianRandomization package in R:
rsid chr bp exposure.A1 exposure.A2 outcome.A1 outcome.A2 exposure.se exposure.b outcome.se outcome.b
# See column names of the file
gpc_neuro.columns
# SumStats support for standard indexing is growing:
exc[0] # Get the full output of the first SNP
exc[:10] # Get the full output of the first 10 SNPs
exc[:10, 'p'] # Get the p value of the first 10 SNPs
exc['p'] # Get the p values of all SNPs
exc['rs78948828'] # Get the full output of 1 specific rsid
exc[['rs78948828', 'rs6057089', 'rs55957973']] # Get the full output of multiple specific rsids
exc[['rs78948828', 'rs6057089', 'rs55957973'], 'p'] # Get the p-value for specific rsids
# If for whatever reason you want to do stuff with each SNP individually you can also loop over the entire file
for snp_output in exc:
if exc['p'] < 5e-8:
print('Yay significant SNP!')
# do something
# If you only want to loop over some specific columns, you can
for rsid, b, se, p in exc[['rsid', 'b', 'se', 'p']].values:
if p < 5e-8:
print('Yay significant SNP!')