-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the py-icare wiki!
Example datasets are provided at the data/
directory of this repository. Users can use them to explore the different features of iCARE and examine the outputs that they generate.
-
breast_cancer_model_formula.txt
: a patsy formula, which is a symbolic description of the covariate model to be fitted. Patsy is a Python substitute for R's formula class objects. If you are an R programmer, please read the patsy manual for differences from R since patsy is not a perfect drop-in replacement for R's formula syntax. -
breast_cancer_72_snps_info.csv
: published information (SNP name, odds ratio, and allele frequency) on 72 breast cancer-associated SNPs. Reference: Michailidou, Kyriaki, et al. "Association analysis identifies 65 new breast cancer risk loci." Nature 551.7678 (2017): 92-94. -
breast_cancer_model_log_odds_ratios.json
: breast cancer log odds ratios associated with each risk factor in the covariate model (breast_cancer_model_formula.txt
). They were estimated from cohort studies participating in the Breast and Prostate Cancer Cohort Consortium (BPC3). Reference: Maas, Paige, et al. "Breast cancer risk from modifiable and nonmodifiable risk factors among white women in the United States." JAMA oncology 2.10 (2016): 1295-1302. -
breast_cancer_model_log_odds_ratios_post_50.json
: breast cancer log odds ratios associated with each risk factor in the covariate model (breast_cancer_model_formula.txt
) for women aged 50 years or older. -
reference_covariate_data.csv
: a simulated reference dataset specifying some of the breast cancer-associated risk factors (see table below) for 14,137 individuals. The simulation is based on the National Health Interview Survey (NHIS) and National Health and Nutrition Examination Survey (NHANES). This dataset is representative of the US population. Reference: 1) 2010 National Health Interview Survey (NHIS) Public Use Data Release, NHIS Survey Description. 2011. (Accessed at ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2010/srvydesc.pdf.); 2) Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2010.
Variable name | Description | Value encoding |
---|---|---|
id |
Subject ID | A unique identifier for each individual. |
family_history |
Family history of breast cancer among first degree relatives. | {0: "absence" (reference), 1: "presence"} |
age_at_menarche |
Age at menarche (years) | {<=11 (reference), 11-11.5, 11.5-12, 12-13, 13-14, 14-15, >=15} |
parity |
Parity (number of full-term pregnancies) | {nulliparous (reference), 1, 2, 3, >=4} |
age_at_first_child_birth |
Age at first child birth (years) | {<=19 (reference), 19-22, 22-23, 23-25, 25-27, 27-30, 30-34, 34-38, >=38} |
age_at_menopause |
Age at menopause (years) | {<=40 (reference), 40-45, 45-47, 47-48, 48-50, 50-51, 51-52, 52-53, 53-55, >=55} |
height |
Height (meters) | {<=1.55 (reference), 1.55-1.57, 1.57-1.60, 1.60-1.61, 1.61-1.63, 1.63-1.65, 1.65-1.66, 1.66-1.68, 1.68-1.71, >=1.71} |
bmi |
Body mass index (kg/m2) | {21.5 (reference), 21.5-23, 23-24.2, 24.2-25.3, 25.3-26.5, 26.5-27.8, 27.8-29.3, 29.3-31.4, 31.4-34.6, >=34.6} |
menopause_hrt |
Use of Hormone Replacement Therapy (HRT) | {0: "pre-menopausal" (reference), 1: "post-menopausal and never HRT user", 2: "post-menopausal and ever HRT user"} |
menopause_hrt_e |
Use of estrogen-only therapy | {0: "otherwise" (reference), 1: "post-menopausal and ever user of estrogen-only therapy"} |
menopause_hrt_c |
Use of estrogen + progesterone combined therapy | {0: "otherwise" (reference), 1: "post-menopausal and ever user of combined therapy"} |
current_hrt |
Current use of HRT | {0: "otherwise" (reference), 1: "post-menopausal and current HRT user"} |
alcohol_consumption |
Alcohol (drinks/week) | {"none" (reference), 0-0.4, 0.4-0.8, 0.8-1.5, 1.5-3.2, 3.2-5.7, 5.7-9.8, >9.8} |
smoking_status |
Smoking status | {"never" (reference), "ever"} |
-
reference_covariate_data_post_50.csv
: this is another simulated reference dataset specifying some of the breast cancer-associated risk factors for a new set of 14,137 individuals. This dataset is similar toreference_covariate_data.csv
. It was also simulated based on NHIS and NHANES, and is representative of the US population. The primary difference is that the covariate distribution represents women over the age of 50 years, which is considered as the median age of menopause. It is well-documented that the distribution of many known breast-cancer associated risk factors is different between pre-menopausal women and post-menopausal women. This dataset provides the risk factor distribution for post-menopausal women.reference_covariate_data.csv
dataset provides the risk factor distribution for pre-menopausal women. To fit absolute risk models separately for the two groups, please see theicare.absolute_risk_main.compute_absolute_risk_split_interval()
function. -
age_specific_breast_cancer_incidence_rates.csv
: age-specific breast cancer incidence rates. Reference: Surveillance, Epidemiology, and End Results (SEER) Program SEER*Stat Database: Incidence - SEER 18 Regs Research Data, Nov 2011 Sub, Vintage 2009 Pops (2000-2009) <Katrina/Rita Population Adjustment> - Linked To County Attributes - Total U.S., 1969-2010 Counties. In: National Cancer Institute D, Surveillance Research Program, Surveillance Systems Branch, ed. SEER18 ed. -
age_specific_all_cause_mortality_rates.csv: age-specific all-cause mortality rates. Reference: Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). Underlying Cause of Death 1999-2011 on CDC WONDER Online Database, released 2014. Data are from the Multiple Cause of Death Files, 1999-2011, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at http://wonder.cdc.gov/ucd-icd10.html on Aug 26, 2014.
-
query_covariates_profile.csv
: a query dataset specifying the risk factors (same variables as in the reference covariate datasetreference_covariate_data.csv
) for three hypothetical individuals. Missing values, if present, are handled by iCARE. -
query_snp_profile.csv
: a query dataset specifying the allele dosages for the breast cancer-associated SNPs (same SNPs as in thebreast_cancer_72_snps_info.csv
file) for three hypothetical individuals. Note that some of the SNPs for some individuals are missing. These are handled by iCARE. -
validation_cohort_data.csv
: a simulated dataset of a full cohort study of 50,000 individuals. This dataset helps illustrate the model validation capabilities of iCARE. The risk factors included in this dataset is the same as in the reference covariates dataset (reference_covariate_data.csv
). Additionally, the variables listed in the table below are also included.
Variable name | Description | Value encoding |
---|---|---|
study_entry_age |
Age at study entry (years) | continuous (integer) |
study_exit_age |
Age at study exit (years) | continuous (integer) |
observed_outcome |
Disease status | {0: "normal", 1: "case"} |
time_of_onset |
Time (in years) from study entry to the development of the disease. Set to Inf if the subject did not develop the disease during the follow-up period. |
continuous (float) |
observed_followup |
Number of years that the subject was followed-up in the study i.e. the difference between the age at study entry and the age at study exit. | continuous (integer) |
inclusion |
Is the individual selected for nested case-control study? If so, the sample is included in validation_nested_case_control_data.csv
|
{0: "no", 1: "yes"} |
-
validation_nested_case_control_data.csv
: a simulated dataset of a case-control study of 5,285, nested within a cohort study (seevalidation_cohort_data.csv
). In addition to the variables in the cohort study, this dataset contains the allele dosages of the 72 breast cancer-associated SNPs (seebreast_cancer_72_snps_info.csv
).