Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New section describing familial PEDIGREE headers #413

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 30 additions & 8 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ \subsubsection{Contig field format}


\subsubsection{Sample field format}
\label{meta-sample}
It is possible to define sample to genome mappings as shown below:
{\scriptsize
\begin{verbatim}
Expand All @@ -256,13 +257,38 @@ \subsubsection{Pedigree field format}
##PEDIGREE=<ID=SomaticNonTumour,Original=GermlineID>
##PEDIGREE=<ID=ChildID,Father=FatherID,Mother=MotherID>
##PEDIGREE=<ID=SampleID,Name_1=Ancestor_1,...,Name_N=Ancestor_N>
##PEDIGREE=<ID=ChildID,MonozygoticTwin=OtherChildID>
##PEDIGREE=<ID=ChildID,DizygoticTwin=AnotherChildID>
##PEDIGREE=<ID=ChildID,Sibling=SiblingChildID>
\end{verbatim}
\noindent or a link to a database:

The first two lines assert that the DNA in genomes TumourSample and SomaticNonTumour is asexually or clonally derived with mutations from the DNA in genome OriginalID.
The third line describes a family relationship between genomes.
A VCF will therefore contain one entry per trio.
The fourth line is an example of the most general form of a pedigree line.
It means that the genome SampleID is derived from the N $\ge$ 1 genomes Ancestor1, ..., AncestorN.
The fifth and sixth lines describe relationships between twins.
Regular siblings can be inferred implicitly from trios like the third line, but if the parents are unknown, the seventh line describes a sibling relationship explicitly.

Mother and Father are optional (e.g.\ if unknown) and have the same meaning as in PED files. Consider the following example PED line (the columns are Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype, Genotypes):
\begin{verbatim}
FAM001 9 7 8 1 2 A A
\end{verbatim}

The family described in that line can be expressed in VCF as:

\begin{verbatim}
##PEDIGREE=<ID=9,Father=7,Mother=8>
\end{verbatim}

Phenotypes can be expressed as explained in Section \ref{meta-sample}, and genotypes as in Section \ref{genotype-fields}.

If samples and the relationships between them are described in an external resource such as a database or PED file, is it also possible to provide a link:
\begin{verbatim}
##pedigreeDB=URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the line I'm interested in but I can't comment on the other one. PED does not support more than 2 ancestors, do we want to do it in VCF? I have never seen this used so I don't think a lot of people will miss it if we drop it. That will make trivial to add some examples for trios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pd3 @lbergelson what should we do about this? Dropping support for more than 2 ancestors would render some files incorrect, but as I said in my previous comment I have never seen that syntax being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there would be a lot of pushback against dropping multiple ancestor lines because they seem pretty ambiguously defined at the moment and I haven't ever seen one used... That said, we should maybe not be making breaking changes to an existing spec?

Are they intended only for asexual ancestry where each is the parent of the next? Or are they just an unsorted bag of ancestors that could represent any tree of parentage?

i.e does <ID=SampleID,Name_1=Ancestor_1,...,Name_N=Ancestor_N>
imply

SampleID -> Ancestor_1 -> Ancestor 2 

or could it also mean

 Ancestor_1 <- SampleID -> Ancestor_2

I lean towards removing or deprecating it if we don't know exactly what it means and no one seems to be using it.

I'm also not clear on which of these are controlled vocabulary. Is there a specific ontology of relationships that are allowed? Are we allowed to specify something like Sibling in the case where we don't have parent in the vcf or is that handled with dummy trios that point to unique but not present parents ID's

Are you allowed to include only 1 parent or are trios required?

I assume the example would address some of these questions.

\end{verbatim}

\noindent See \ref{PedigreeInDetail} for details.
\noindent See Section \ref{ClonalPedigree} for more details about clonal relationships.


\subsection{Header line syntax}
Expand Down Expand Up @@ -390,6 +416,7 @@ \subsubsection{Fixed fields}
\end{itemize}

\subsubsection{Genotype fields}
\label{genotype-fields}
If genotype information is present, then the same types of data must be present for all samples.
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
This is followed by one data block per sample, with the colon-separated data corresponding to the types specified in the format.
Expand Down Expand Up @@ -1261,7 +1288,7 @@ \subsubsection{Sample mixtures}
\normalsize

\subsubsection{Clonal derivation relationships}
\label{PedigreeInDetail}
\label{ClonalPedigree}
In cancer, each VCF file represents several genomes from a patient, but one genome is special in that it represents the germline genome of the patient.
This genome is contrasted to a second genome, the cancer tumor genome.
In the simplest case the VCF file for a single patient contains only these two genomes.
Expand All @@ -1281,11 +1308,6 @@ \subsubsection{Clonal derivation relationships}
\end{verbatim}

This line asserts that the DNA in genome DerivedID is asexually or clonally derived with mutations from the DNA in genome OriginalID.
This is the asexual analog of the VCF format that has been proposed for family relationships between genomes, i.e., there is one entry per trio of the form:

\begin{verbatim}
##PEDIGREE=<ID=ChildID,Mother=MotherID,Father=FatherID>
\end{verbatim}

Let's consider a cancer patient VCF file with 4 genomes: germline, primary\_tumor, secondary\_tumor1, and secondary\_tumor2 as illustrated in Figure 10.
The primary\_tumor is derived from the germline and the secondary tumors are each derived independently from the primary tumor, in all cases by clonal derivation with mutations.
Expand Down