-
Notifications
You must be signed in to change notification settings - Fork 172
/
VCFv4.5.tex
2834 lines (2358 loc) · 160 KB
/
VCFv4.5.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[8pt]{article}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{longtable}
\usepackage{lscape}
\usepackage{makecell}
\usepackage[margin=0.75in]{geometry}
\usepackage[pdfborder={0 0 0}]{hyperref}
\usepackage{listings}
\lstset{
basicstyle=\ttfamily,
mathescape
}
\usepackage{color}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\begin{document}
\input{VCFv4.5.ver}
\title{The Variant Call Format Specification \\ \vspace{0.5em} \large VCFv4.5 and BCFv2.2}
\date{\headdate}
\maketitle
\begin{quote}\small
The master version of this document can be found at \url{https://github.com/samtools/hts-specs}.\\
This printing is version~\commitdesc\ from that repository, last modified on the date shown above.
\end{quote}
\vspace*{1em}
\newpage
\tableofcontents
\newpage
\section{The VCF specification}
VCF is a text file format (most likely stored in a compressed manner).
It contains meta-information lines (prefixed with ``\verb|##|''), a header line (prefixed with ``\verb|#|''), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs).
Zero length fields are not allowed, a dot (``.'') must be used instead.
In order to ensure interoperability across platforms, VCF compliant implementations must support both LF (``\verb|\n|'') and CR+LF (``\verb|\r\n|'') newline conventions.
\subsection{An example}
\scriptsize
\begin{verbatim}
##fileformat=VCFv4.5
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
\end{verbatim}
\normalsize
This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is called monomorphic reference (i.e.\ with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T).
Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes.
The microsatellite calls are unphased.
\subsection{Character encoding, non-printable characters and characters with special meaning}
\label{character-encoding}
The character encoding of VCF files is UTF-8.
UTF-8 is a multi-byte character encoding that is a strict superset of 7-bit ASCII and has the property that none of the bytes in any multi-byte characters are 7-bit ASCII bytes.
As a result, most software that processes VCF files does not have to be aware of the possible presence of multi-byte UTF-8 characters.
VCF files must not contain a byte order mark.
Note that non-printable characters U+0000--U+0008, U+000B--U+000C, U+000E--U+001F are disallowed.
Line separators must be CR+LF or LF and they are allowed only as line separators at end of line.
Some characters have a special meaning when they appear (such as field delimiters `\verb|;|' in INFO or `\verb|:|' FORMAT fields), and for any other meaning they must be represented with the capitalized percent encoding:
\begingroup\footnotesize
\begin{tabular}{l l l}
\%3A & : & (colon) \\
\%3B & ; & (semicolon) \\
\%3D & = & (equal sign) \\
\%25 & \% & (percent sign) \\
\%2C & , & (comma) \\
\%0D & CR & \\
\%0A & LF & \\
\%09 & TAB &
\end{tabular}
\endgroup
\subsection{Data types}
Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit IEEE-754, formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively),%
\footnote{Note Java's {\tt Double.valueOf} is particular about capitalisation, so additional code is needed to parse all VCF infinite/NaN values.}
Flag, Character, and String.
For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.
\subsection{Meta-information lines}
File meta-information lines start with ``\verb|##|'' and must appear first in the VCF file, before the header line (section~\ref{header-line}) and data record lines (section~\ref{data-lines}).
They may be either \emph{unstructured} or \emph{structured}.
An \emph{unstructured} meta-information line consists of a~\emph{key} (denoting the type of meta-information recorded) and a~\emph{value} (which may not be empty and must not start with a `\verb|<|' character), separated by an `\verb|=|' character:
\begin{quote}
\verb|##|\emph{key}\verb|=|\emph{value}
\end{quote}
Several unstructured meta-information lines are defined in this specification, notably \verb|##fileformat|.
Others not defined by this specification, e.g.\ \verb|##fileDate| and \verb|##source|, are commonly found in VCF files.
These typically have meanings that are obvious, or they are immaterial for processing the file, or both.
A \emph{structured} meta-information line is similar, but the value is itself a comma-separated list of key=value pairs, enclosed within `\verb|<|' and `\verb|>|' characters:
\begin{quote}
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>|
\end{quote}
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, optional fields can be included.
For example:
\begin{verbatim}
##INFO=<ID=ALLELEID,Number=A,Type=String,Description="Allele ID",Source="ClinVar",Version="20220804">
\end{verbatim}
In the above example, the optional fields of ``Source'' and ``Version'' are provided.
The values of optional fields must be written as quoted strings, even for numeric values.
Other structured lines not defined by this specification may also be used; the only required field for such lines is the required \verb|ID| field.
It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file.
These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).
To aid human readability, the order of fields should be ID, Number, Type, Description, then any optional fields.
Implementations must not rely on the order of the fields within structured lines and are not required to preserve field ordering.
Meta-information lines are optional, but if they are present then they must be completely well-formed.
Other than \verb|##fileformat|, they may appear in any order.
Note that BCF, the binary counterpart of VCF, requires that all entries are present.
It is recommended to include meta-information lines describing the entries used in the body of the VCF file.
\subsubsection{File format}
A single `fileformat' line is always required, must be the first line in the file, and details the VCF format version number.
For VCF version 4.5, this line is:
\begin{verbatim}
##fileformat=VCFv4.5
\end{verbatim}
\subsubsection{Information field format}
INFO meta-information lines are structured lines with required fields ID, Number, Type, and Description, and recommended optional fields Source and Version:
\begin{verbatim}
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">
\end{verbatim}
Possible Types for INFO fields are: Integer, Float, Flag, Character, and String.
The Number entry is an Integer that describes the number of values that can be included with the INFO field.
For example, if the INFO field contains a single number, then this value must be $1$; if the INFO field describes a pair of numbers, then this value must be $2$ and so on.
There are also certain special characters used to define special cases:
\begin{itemize}
\item A: The field has one value per alternate allele.
The values must be in the same order as listed in the ALT column (described in section \ref{data-lines}).
\item R: The field has one value for each possible allele, including the reference.
The order of the values must be the reference allele first, then the alternate alleles as listed in the ALT column.
\item G: The field has one value for each possible genotype.
The values must be in the same order as prescribed in section \ref{genotype-fields:genotype-ordering} (see \textsc{Genotype Ordering}).
\item . (dot): The number of possible values varies, is unknown or unbounded.
\end{itemize}
The `Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number must be $0$ in this case.
The Description value must be surrounded by double-quotes.
Double-quote character must be escaped with backslash $\backslash$ and backslash as $\backslash\backslash$.
Source and Version values likewise must be surrounded by double-quotes and specify the annotation source (case-insensitive, e.g.\ \verb|"dbsnp"|) and exact version (e.g.\ \verb|"138"|), respectively for computational use.
\subsubsection{Filter field format}
FILTER meta-information lines are structured lines with required fields ID and Description that define the possible content of the FILTER column in the VCF records:
\begin{verbatim}
##FILTER=<ID=ID,Description="description">
\end{verbatim}
\subsubsection{Individual format field format}
FORMAT meta-information lines are structured lines with required fields ID, Number, Type, and Description that define the possible content of the per-sample/genotype columns in the VCF records:
\begin{verbatim}
##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
\end{verbatim}
Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field).
The Number field is defined as per the INFO Number field with the following additional possibilities:
\begin{itemize}
\item LA: Identical to A except the only alternate alleles defined in the $LAA$ field are considered present.
\item LR: Identical to R except the only alternate alleles defined in the $LAA$ field are considered present.
\item LG: Identical to G except the only alternate alleles defined in the $LAA$ field are considered present.
\item P: The field has one value for each allele value defined in $GT$.
\item M: The field has one value for each possible base modification for the corresponding ChEBI ID.
\end{itemize}
The cardinality of M fields is determined by genotype and number of possible base modifications for the corresponding alleles.
The ID of all M fields must end with A, C, G, T, U, or N which defines the base(s) that the modification can occur on.
U must be treated as synonymous with T.
If any base modification key is present for a sample, GT must be defined for that sample.
The number of base modification values for a given allele is the number of bases on either strand in the allele sequence that could contain the base modification.
The order of the base modification values is the order that these bases occur in the allele.
For N base modifications, the field contains values for both the positive and negative strands with the negative strand value immediately after the positive strand value.
For example, an allele of CGA has 2 M5mC values, the first defining the methylation rate on forward strand C at the first base pair, and the second defining the methylation rate for reverse strand C at the second base pair.
The order and number of alleles encoded in these fields is determined by the order and phasing in the genotype.
Base modifications values are encoded in their GT order with one value for each possible base modification in the concatenated genotype allele bases.
Unphased allele values are aggregated and encoded at the position of the first occurrence of the unphased allele value.
MISSING allele values and symbolic alleles are treated as containing no relevant bases thus encode no base modification values.
Unstranded base modification information should be stored at the base with the lowest POS with the other values MISSING.
Unstranded N base modifications should be stored on the positive strand with the values MISSING.
For example, unstranded 5mC CpG methylation should be stored on the VCF recording containing the C with the M5mC value of the subsequent G set to MISSING or omitted entirely. Similarly, unstranded MxaoN values should be stored in the positive strand value with the negative strand value MISSING.
Examples:
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & REF & ALT & FORMAT & SAMPLE\\
chr & $10$ & C & A & GT:M5mC & \tt{0/1:0.95}\\
chr & $20$ & C & CTAG & GT:M5mC & \tt{0/1:0,0.5,0.7}\\
chr & $30$ & C & . & GT:M5mC:M5hmC & \tt{0|0:0.9,0:0,0.1}\\
chr & $40$ & C & A,T,G,ACG & GT:M5mC & \tt{/3|1/0|4|0/0/3/1:0.25,0.1,0.5,0.6,.}\\
\end{tabular}
The first record encodes a 95 percent methylation on the REF C.
Since the ALT A cannot be 5mC methylated, only one value is present.
The second record encodes the methylation of the REF (since it's the first allele occurring the GT field), followed by the methylation values of the first and fourth base of the CTAG ALT.
The third record encodes that both 5mC and 5hmC modifications are present at the homozygous C but they are mutually exclusive allele: 90 percent 5mC and no 5hmC on the first haplotype, and 10 percent 5hmC with no 5mC on the second haplotype.
The fourth record demonstrates the encoded ordering of the methylation state of a partially phased locally-octoploid sample.
The first allele value (unphased G) encodes a 25 percent methylation of the 2 unphased copies of the G allele (encoded first since /3 occurs first in GT).
The second allele value (phased A) is not relevant to 5mC methylation so there is nothing to encode.
The third allele value (unphased C) encodes a 10 precent methylation rate for both unphased copies of the C REF allele.
The fourth allele value (phased ACG) encoding the 50 and 60 percent methylation rates of the second and third base pairs of the ACG allele.
The fifth allele value (phased C) encodes an unknown methylation rate of the single phased copy of the C REF allele.
The sixth allele value (unphased C) was already encoded as part of the third allele value so there is nothing more to encode.
The seventh allele value (unphased G) was already encoded as part of the first allele value so there is nothing more to encode.
The eighth allele value (unphased A) is not relevant to 5mC methylation so there is nothing to encode.
\subsubsection{Alternative allele field format} \label{altfield}
ALT meta-information lines are structured lines with require fields of ID and Description that describe the possible symbolic alternate alleles in the ALT column of the VCF records:
\begin{verbatim}
##ALT=<ID=type,Description="description">
\end{verbatim}
\noindent \textbf{Structural Variants} \newline
In symbolic alternate alleles for structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes.
ID values are case sensitive strings and must not contain whitespace, commas or angle brackets (See \ref{fixed-fields}.\ref{fixed-fields-alt})
The first level type must be one of the following:
\begin{itemize}
\item DEL Region of lowered copy number relative to the reference, or a deletion breakpoint
\item INS Insertion of novel sequence relative to the reference
\item DUP Region of elevated copy number relative to the reference, or a tandem duplication breakpoint
\item INV Inversion of reference sequence
\item CNV Region of uniform copy number (may be deletion, duplication or copy number neutral)
\end{itemize}
The CNV symbolic allele should not be used when a more specific one (e.g. DEL, CNV:TR) can be applied.
Implementations are free to define their own subtypes.
The presence of a subtype does not change either the copy number or breakpoint interpretation of a symbolic structural variant allele.
The following subtypes are recommended:
\begin{itemize}
\item CNV:TR Tandem repeat. See \ref{tandem-repeats} for further details.
\item DUP:TANDEM Tandem duplication
\item DEL:ME Deletion of mobile element relative to the reference
\item INS:ME Insertion of a mobile element relative to the reference
\end{itemize}
Note that the position of symbolic structural variant alleles is the position of the base immediately preceding the variant.
\bigskip
\noindent \textbf{IUPAC ambiguity codes} \newline
Symbolic alleles can be used also to represent genuinely ambiguous data in VCF, for example:
\begin{verbatim}
##ALT=<ID=R,Description="IUPAC code R = A/G">
##ALT=<ID=M,Description="IUPAC code M = A/C">
\end{verbatim}
\subsubsection{Assembly field format}
Breakpoint assemblies for structural variations may use an external file:
\begin{verbatim}
##assembly=url
\end{verbatim}
The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key.
\subsubsection{Contig field format}
\label{sec-contig-field}
It is recommended for VCF, and required for BCF, that the header includes tags describing the contigs referred to in the file.
The structured \texttt{contig} field must include the ID attribute and can include additional optional attributes with
the following ones reserved:
\begin{itemize}
\item length: the length of the sequence
\item md5: MD5 checksum of the sequence as defined in the Sam specification v1\footnote{See Reference MD5 calculation
section in \href{https://samtools.github.io/hts-specs/SAMv1.pdf}{\tt SAM Format Specification}.} Briefly, the digest
is calculated excluding all characters outside of the inclusive range 33 (`\char33') to 126 (`\char126').
and all lowercase characters converted to uppercase. The MD5 digest is calculated as described in
\href{https://tools.ietf.org/html/rfc1321}{\sl RFC 1321} and presented as a 32 character lowercase hexadecimal number.
\item URL: tag to indicate where the sequence can be found
\end{itemize}
For example:
{\scriptsize
\begin{verbatim}
##contig=<ID=ctg1,length=81195210,URL=ftp://somewhere.example/assembly.fa,md5=f126cdf8a6e0c7f379d618ff66beb2da,...>
\end{verbatim}
}
\noindent
Contig names follow the same rules as the SAM format's reference sequence names:
they may contain any printable ASCII characters in the range \verb|[!-~]| apart from `{\tt\verb|\|\,,\,"`'\,()\,[]\,\verb|{}|\,<>}' and may not start with `{\tt *}' or `{\tt =}'.
Thus they match the following regular expression:
\begin{verbatim}
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
\end{verbatim}
\noindent
In particular, excluding commas facilitates parsing \verb|##contig| lines, and excluding the characters `\verb|<>[]|' and initial~`{\tt *}' avoids clashes with symbolic alleles.
The contig names must not use a reserved symbolic allele name.
\subsubsection{Sample field format}
It is possible to define sample to genome mappings as shown below:
{\scriptsize
\begin{verbatim}
##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]>
##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]>
##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]>
##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]>
##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germline genome from unaffected",DOI=url>
##SAMPLE=<ID=Sample2,Assay=Exome,Ethnicity=CEU,Disease=Cancer,Tissue=Breast,Description="European patient exome from breast cancer">
\end{verbatim}}
\subsubsection{Pedigree field format}
It is possible to record relationships between genomes using the following syntax:
\begin{verbatim}
##PEDIGREE=<ID=TumourSample,Original=GermlineID>
##PEDIGREE=<ID=SomaticNonTumour,Original=GermlineID>
##PEDIGREE=<ID=ChildID,Father=FatherID,Mother=MotherID>
##PEDIGREE=<ID=SampleID,Name_1=Ancestor_1,...,Name_N=Ancestor_N>
\end{verbatim}
\noindent or a link to a database:
\begin{verbatim}
##pedigreeDB=URL
\end{verbatim}
\noindent See \ref{PedigreeInDetail} for details.
\subsection{Header line syntax}
\label{header-line}
The mandatory header line names the 8 fixed, mandatory columns. These columns are as follows:
\begin{center}
\#CHROM
\qquad POS
\qquad ID
\qquad REF
\qquad ALT
\qquad QUAL
\qquad FILTER
\qquad INFO
\end{center}
\noindent
If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs.
Duplicate sample IDs are not allowed.
The header line is tab-delimited and there must be no tab characters at the end of the line.
\subsection{Data lines}
\label{data-lines}
All data lines are tab-delimited with no tab character at the end of the line.
The last data line must end with a line separator.
In all cases, missing values are specified with a dot (`.').
\subsubsection{Fixed fields}
\label{fixed-fields}
There are 8 fixed fields per record.
Fixed fields are:
\begin{enumerate}
\item CHROM --- chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header).
All entries for a specific CHROM must form a contiguous block within the VCF file.
(String, no whitespace permitted, Required).
\item POS --- position: The reference position, with the 1st base having position 1.
Positions are sorted numerically, in increasing order, within each reference sequence CHROM.
It is permitted to have multiple records with the same POS.
Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig.
(Integer, Required)
\item ID --- identifier: Semicolon-separated list of unique identifiers where available.
If this is a dbSNP variant the rs number(s) should be used.
No identifier should be present in more than one data record.
If there is no identifier available, then the MISSING value should be used.
(String, no whitespace or semicolons permitted, duplicate values not allowed.)
\item REF --- reference base(s): Each base must be one of A,C,G,T,N (case insensitive).
Multiple bases are permitted.
The value in the POS field refers to the position of the first base in the String.
For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the variant (which must be reflected in the POS field), unless the variant occurs at position 1 on the contig in which case it must include the base after the variant; this padding base is not required (although it is permitted) e.g. for complex substitutions or other variants where all alleles have at least one base represented in their Strings.
If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The exception to this is the $<$*$>$ symbolic allele for which the reference call interval includes the POS base.
Tools processing VCF files are not required to preserve case in the REF allele Strings. (String, Required).
If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G), the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically (thus R as a reference base is converted to A in VCF.)
\item ALT --- alternate base(s): Comma-separated list of alternate non-reference alleles.
\label{fixed-fields-alt}
These alleles do not have to be called in any of the samples.
Each allele in this list must be one of: a non-empty String of bases (A,C,G,T,N; case insensitive); the `*' symbol (allele missing due to overlapping deletion); the MISSING value `.' (no variant); an angle-bracketed ID String (``$<$ID$>$''); the unspecified allele ``$<$*$>$'' as described in Section \ref{unspecified-allele}; or a breakend replacement string as described in Section \ref{Breakends}.
If there are no alternative alleles, then the MISSING value must be used.
Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive.
(String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong).
If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant).
If unknown, the MISSING value must be specified. (Float)
\item FILTER --- filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples.
`0' is reserved and must not be used as a filter String.
If filters have not been applied, then this field must be set to the MISSING value.
(String, no whitespace or semicolons permitted, duplicate values not allowed.)
\item INFO --- additional information: Semicolon-separated series of additional information fields, or the MISSING value `{\tt .}'\ if none are present.
Each subfield consists of a short \emph{key} with optional \emph{values} in the format: key[=value[,\,\ldots,value]].
Literal semicolon (`{\tt ;}') and equals sign (`{\tt =}') characters are not permitted in these values, and literal commas (`{\tt ,}') are permitted only as delimiters for lists of values; characters with special meaning can be encoded using percent encoding, see Section~\ref{character-encoding}.
Space characters are allowed in values.
INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value.
Duplicate keys are not allowed.
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional).
The exact format of each INFO key should be specified in the meta-information (as described above).
Example of a complete INFO field: {\tt DP=154;MQ=52;H2}.
Keys without corresponding values may be used to indicate group membership (e.g.\ H2 indicates the SNP is found in HapMap 2).
See Section~\ref{sv-info-keys} for additional reserved INFO keys used to encode structural variants.
\end{enumerate}
\begin{longtable}[c]{ | p{2.5cm} | p{1.5cm} | p{1.5cm} | p{10.3cm} | }
\hline
Key & Number & Type & Description \\ \hline
\endfirsthead
\multicolumn{4}{l}{\small\emph{\ldots Continued from previous page}} \\[0.7ex]
\hline
Key & Number & Type & Description \\ \hline
\endhead
\hline
\multicolumn{4}{r}{\small\emph{Continued on next page\ldots}} \\
\caption[]{Reserved INFO keys}
\endfoot
\hline
\multicolumn{4}{l}{} \\
\caption{Reserved INFO keys}
\label{table:reserved-info}
\endlastfoot
AA & 1 & String & Ancestral allele \\
AC & A & Integer & Allele count in genotypes, for each ALT allele, in the same order as listed \\
AD & R & Integer & Total read depth for each allele \\
ADF & R & Integer & Read depth for each allele on the forward strand \\
ADR & R & Integer & Read depth for each allele on the reverse strand \\
AF & A & Float & Allele frequency for each ALT allele in the same order as listed (estimated from primary data, not called genotypes) \\
AN & 1 & Integer & Total number of alleles in called genotypes \\
BQ & 1 & Float & RMS base quality \\
CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\
DB & 0 & Flag & dbSNP membership \\
DP & 1 & Integer & Combined depth across samples \\
END & 1 & Integer & Deprecated. Present for backwards compatibility with earlier versions of VCF. \\
H2 & 0 & Flag & HapMap2 membership \\
H3 & 0 & Flag & HapMap3 membership \\
MQ & 1 & Float & RMS mapping quality \\
MQ0 & 1 & Integer & Number of MAPQ == 0 reads \\
NS & 1 & Integer & Number of samples with data \\
SB & 4 & Integer & Strand bias \\
SOMATIC & 0 & Flag & Somatic mutation (for cancer genomics) \\
VALIDATED & 0 & Flag & Validated by follow-up experiment \\
1000G & 0 & Flag & 1000 Genomes membership \\
\end{longtable}
\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item END: Deprecated.
Retained for backwards compatibility with earlier versions of VCF and older VCF indexing software which rely on this field being present.
This is a computed field that, when present, must be set to the maximum end reference position (1-based) of:
the position of the final base of the REF allele,
the end position corresponding to the SVLEN of a symbolic SV allele,
and the end positions calculated from FORMAT LEN for the $<$*$>$ symbolic allele.
The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
\end{itemize}
\subsubsection{Genotype fields}
If genotype information is present, then the same types of data must be present for all samples.
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
This is followed by one data block per sample, with the colon-separated data corresponding to the types specified in the format.
The first key must always be the genotype (GT) if it is present.
If any local-allele field is present, LAA must also be present and precede all fields other than GT.
There are no required keys.
Additional Genotype keys can be defined in the meta-information, however, software support for them is not guaranteed.
If any of the fields is missing, it is replaced with the MISSING value.
For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing.
If a field contains a list of missing values, it can be represented either as a single MISSING value (`.') or as a list of missing values (e.g.\ `.,.,.' if the field was Number=3).
Trailing fields can be dropped, with the exception of the GT field, which should always be present if specified in the FORMAT field.
If a field and it's local-allele equivalent are both defined they must encode identical information or one must ignored by containing the MISSING value or omitted.
As with the INFO field, there are several common, reserved keywords that are standards across the community.
See their detailed definitions below, as well as Table~\ref{table:reserved-genotypes} for their reference Number, Type and Description.
See also Section~\ref{sv-format-keys} for a list of genotype keys reserved for structural variants.
\begin{longtable}[c]{ | p{2.5cm} | p{1.5cm} | p{1.5cm} | p{10.3cm} | }
\hline
Field & Number & Type & Description \\ \hline
\endfirsthead
\multicolumn{4}{l}{\small\emph{\ldots Continued from previous page}} \\[0.7ex]
\hline
Field & Number & Type & Description \\ \hline
\endhead
\hline
\multicolumn{4}{r}{\small\emph{Continued on next page\ldots}} \\
\caption[]{Reserved genotype keys}
\endfoot
\hline
\multicolumn{4}{l}{} \\
\caption{Reserved genotype keys}
\label{table:reserved-genotypes}
\endlastfoot
AD & R & Integer & Read depth for each allele \\
ADF & R & Integer & Read depth for each allele on the forward strand \\
ADR & R & Integer & Read depth for each allele on the reverse strand \\
DP & 1 & Integer & Read depth \\
EC & A & Integer & Expected alternate allele counts \\
LEN & 1 & Integer & Length of $<$*$>$ reference block \\
FT & 1 & String & Filter indicating if this genotype was ``called'' \\
GL & G & Float & Genotype likelihoods \\
GP & G & Float & Genotype posterior probabilities \\
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LA & . & Integer & Reserved \\
LAA & . & Integer & 1-based indices into ALT, indicating which alleles are relevant (local) for the current sample \\
LAD & LR & Integer & Local-allele representation of AD \\
LADF & LR & Integer & Local-allele representation of ADF \\
LADR & LR & Integer & Local-allele representation of ADR \\
LEC & LA & Integer & Local-allele representation of EC \\
LGL & LG & Integer & Local-allele representation of GL \\
LGP & LG & Integer & Local-allele representation of GP \\
LPL & LG & Integer & Local-allele representation of PL \\
LPP & LG & Integer & Local-allele representation of PP \\
M[0-9]+[ACGTUN] & M & Float & Fraction of bases modified with the given ChEBI ID. \\
DPM[0-9]+[ACGTUN] & M & Integer & Total read depth for reads able to detect the base modification with the given ChEBI ID. \\
ADM[0-9]+[ACGTUN] & M & Integer & Read depth for reads with the base modification with the given ChEBI ID. \\
M5mC & M & Float & Alias for M27551C 5-Methylcytosine \\
DPM5mC & M & Integer & Alias for DPM27551C \\
ADM5mC & M & Integer & Alias for ADM27551C \\
M5hmC & M & Float & Alias for M76792C 5-Hydroxymethylcytosine \\
DPM5hmC & M & Integer & Alias for DPM76792C \\
ADM5hmC & M & Integer & Alias for ADM76792C \\
M5fC & M & Float & Alias for M76794C 5-Formylcytosine \\
DPM5fC & M & Integer & Alias for DPM76794C \\
ADM5fC & M & Integer & Alias for ADM76794C \\
M5caC & M & Float & Alias for M76793C 5-Carboxylcytosine \\
DPM5caC & M & Integer & Alias for DPM76793C \\
ADM5caC & M & Integer & Alias for ADM76793C \\
M5hmU & M & Float & Alias for M16964T 5-Hydroxymethyluracil \\
DPM5hmU & M & Integer & Alias for DPM16964T \\
ADM5hmU & M & Integer & Alias for ADM16964T \\
M5fU & M & Float & Alias for M80961T 5-Formyluracil \\
DPM5fU & M & Integer & Alias for DPM80961T \\
ADM5fU & M & Integer & Alias for ADM80961T \\
M5caU & M & Float & Alias for M17477T 5-Carboxyluracil \\
DPM5caU & M & Integer & Alias for DPM17477T \\
ADM5caU & M & Integer & Alias for ADM17477T \\
M6mA & M & Float & Alias for M28871A 6-Methyladenine \\
DPM6mA & M & Integer & Alias for DPM28871A \\
ADM6mA & M & Integer & Alias for ADM28871A \\
M8oxoG & M & Float & Alias for M44605G 8-Oxoguanine \\
DPM8oxoG & M & Integer & Alias for DPM44605G \\
ADM8oxoG & M & Integer & Alias for ADM44605G \\
MXaoN & M & Float & Alias for M18107N Xanthosine \\
DPMXaoN & M & Integer & Alias for DPM18107N \\
ADMXaoN & M & Integer & Alias for ADM18107N \\
MQ & 1 & Integer & RMS mapping quality \\
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PS & 1 & Integer & Phase set \\
PSL & P & String & Phase set list \\
PSO & P & Integer & Phase set list ordinal \\
PSQ & P & Integer & Phase set list quality \\
\end{longtable}
\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item AD, ADF, ADR (Integer): Per-sample read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand.
\item DP (Integer): Read depth at this position for this sample.
\item EC (Integer): Comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field.
Typically used in association analyses.
\item LEN (Integer): length of the $<$*$>$ reference block for this sample.
\item FT (String): Sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field).
Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied.
These values should be described in the meta-information in the same way as FILTERs.
No whitespace or semicolons permitted.
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
\item GT (String): Genotype, encoded as allele value preceded by either of $/$ or $\mid$ depending on whether that allele is considered phased.
The first phasing indicator may be omitted and is implicitly defined as $/$ if any phasing indicators are $/$ and $\mid$ otherwise.
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
For diploid calls examples could be $0/1$, $1\mid0$, $/0/1$, or $1/2$, etc.
Haploid calls, e.g.\ on Y, male non-pseudoautosomal X, or mitochondria, should be indicated by having only one allele value.
A triploid call might look like $0/0/1$, and a partially phased triploid call could be $|0/1/2$ to indicate that the first allele is phased with another variant in the VCF.
If a call cannot be made for a sample at a given locus, `$.$' must be specified for each missing allele in the {\tt GT} field (for example `$./.$' for a diploid genotype and `$.$' for haploid genotype).
The meanings of the phasing indicators are as follows (see the {\tt PS} and {\tt PSL} fields below for more details on incorporating phasing information into the genotypes):
\begin{itemize}
\item $/$ : allele is unphased
\item $\mid$ : allele is phased (according to the phase-set indicated in {\tt PS} or {\tt PSL})
\end{itemize}
For symbolic structural variant alleles, GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record.
Implementer should note that merging a VCF record containing only symbolic structural variant ALT alleles with a record containing other alleles will result a change of the meaning of the GT=0 haplotypes from the record containing only symbolic SVs.
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
\textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering}
In general case of ploidy P and N alternate alleles (0 is the REF and $1\ldots N$ the alternate alleles), the ordering of genotypes for the likelihoods can be expressed by the following pseudocode with as many nested loops as ploidy:
\footnote{Note that we use inclusive \texttt{for} loop boundaries.}
\begingroup
\small
\begin{lstlisting}
for $a_P = 0\ldots N$
for $a_{P-1} = 0\ldots a_P$
$\ldots$
for $a_1 = 0\ldots a_{2}$
println $a_1 a_2 \ldots a_P$
\end{lstlisting}
\endgroup
Alternatively, the same can be achieved recursively with the following pseudocode:
\begingroup
\small
\begin{lstlisting}
Ordering($P$, $N$, suffix=""):
for $a$ in $0\ldots N$
if ($P == 1$) println str($a$) + suffix
if ($P > 1$) Ordering($P$-1, $a$, str($a$) + suffix)
\end{lstlisting}
\endgroup
Conversely, the index of the value corresponding to the genotype $k_1\le k_2\le\ldots\le k_P$ is
\begingroup
\small
\begin{lstlisting}
Index($k_1/k_2/\ldots/k_P$) = $\sum_{m=1}^{P} {k_m + m - 1 \choose m}$
\end{lstlisting}
\endgroup
Examples:
\begin{itemize}
\item for $P$=2 and $N$=1, the ordering is 00,01,11
\item for $P$=2 and $N$=2, the ordering is 00,01,11,02,12,22
\item for $P$=3 and $N$=2, the ordering is 000, 001, 011, 111, 002, 012, 112, 022, 122, 222
\item for $P$=1, the index of the genotype $a$ is $a$
\item for $P$=2, the index of the genotype ``$a/b$'', where $a\le b$, is $b (b+1)/2 + a$
\item for $P$=2 and arbitrary $N$, the ordering can be easily derived from a triangular matrix
\newline
\hbox{\hskip5em\footnotesize
\begin{tabular}{l|llll}
$b\setminus a$ & 0 & 1 & 2 & 3 \\ \hline \\[-0.5em]
0 & 0 & & & \\
1 & 1 & 2 & & \\
2 & 3 & 4 & 5 & \\
3 & 6 & 7 & 8 & 9
\end{tabular}
}
\end{itemize}
\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
\item LAA is a list of $n$ distinct integers, giving the 1-based indices of the ALT alleles that are observed in the sample.
In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS.
Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count.
Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.
To prevent this growth in VCF size, one can choose to specify the genotype, allele depth and the genotype likelihood against a subset of ``Local Alleles''.
LAA is the 1-based index into ALT, defining the alleles that are actually in-play for that sample and the order in which they are interpreted.
LAA is required when interpreting local-allele fields and must be present if any local-allele fields are neither omitted nor MISSING.
Since BCF encodes zero length vectors as MISSING, a LAA containing the MISSING value should be treated as the empty vector (i.e. a REF-only site) if any local-allele fields are neither omitted nor MISSING.
All specifications-defined A, R and G FORMAT fields have a local-allele equivalent that should be interpreted in the same manner as it's matching field except for the ALT alleles considered present and the order in which they are interpreted.
For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T.
GQ is still the genotype quality, even when the genotype is given against the local alleles.
In the following example, the records with the same POS encode the same information (some columns removed for clarity):
\vspace{0.5em}
\begin{tabular}[l]{llllll}
POS &REF& ALT&FORMAT&sample\\
1&G&A,C,T,\textless*\textgreater& GT:LAA:LAD:LPL& 2/4:2,4:20,30,10:90,80,0,100,110,120\\
1&G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120\\
2&A&C,G,T,\textless*\textgreater& GT:LAA:LAD:LPL& 0/3:3:15,25:40,0,80\\
2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.,.\\
3&C&G,T,\textless*\textgreater& GT:LAA:LAD:LPL& 0/0:3:30,1:0,30,80\\
3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.,1:0,.,.,.,.,.,30,.,.,80\\
4&G&A,T,\textless*\textgreater& GT:LAA:LAD:LPL& 0/0::30:0\\
4&G&A,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.,.:0,.,.,.,.,.,.,.,.,.\\
\end{tabular}
Due to BCF encoding empty vectors as missing, implementation-defined Number=LA local-allele fields should not be used if distinguishing between zero-length data and missing data is required at REF-only sites.
It is recommended that VCF libraries provide an API in which local allele encoding can be abstracted away from the API consumer and values accessed through their corresponding non-local key.
\item LPL: is a list of $n \choose \mathrm{Ploidy}$ integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the LAA local alleles.
The precise ordering is defined in the GL paragraph.
\item M[0-9]+[ACGTUN] (Float): Fraction of DNA or RNA bases modified with the given ChEBI ID.
All FORMAT keys matching the given regular expression are considered reserved keys, even for ChEBI IDs that do not correspond to valid base modifications.
The alias keys M5mC, M5hmC, M5fC, M5caC, M5hmU, M5fU, M5caU, M6mA, M8oxoG, and MxaoN should be used instead of their corresponding ChEBI keys.
Values must be between 0 and 1 and indicate how prevalent the modified base is in the sample.
When base modification information is present in the FORMAT field of a reference block record, the base modification information apply to all applicable bases covered by that reference block.
\item DPM[0-9]+[ACGTUN] (Integer): Total read depth for reads able to detect the base modification with the given ChEBI ID.
All FORMAT keys matching the given regular expression are considered reserved keys, even for ChEBI IDs that do not correspond to valid base modifications.
The alias keys DPM5mC, DPM5hmC, DPM5fC, DPM5caC, DPM5hmU, DPM5fU, DPM5caU, DPM6mA, DPM8oxoG, and DPMxaoN should be used instead of their corresponding ChEBI keys.
\item ADM[0-9]+[ACGTUN] (Integer): Read depth for reads with the base modification with the given ChEBI ID.
All FORMAT keys matching the given regular expression are considered reserved keys, even for ChEBI IDs that do not correspond to valid base modifications.
The alias keys ADM5mC, ADM5hmC, ADM5fC, ADM5caC, ADM5hmU, ADM5fU, ADM5caU, ADM6mA, ADM8oxoG, and ADMxaoN should be used instead of their corresponding ChEBI keys.
Note that ADFM[0-9]+[ACGTUN] and ADRM[0-9]+[ACGTUN] are not reserved fields as Type=M fields are intrinsically stranded and unstranded information should be encoded using the MISSING value.
Unstranded CpG methylation counts should be placed in the C position with value for the subsequent G base MISSING.
Stranded CpG methylation counts should be placed in both values with the C position effectively encoding ADF, and the G effectively encoding ADR due to the strand the C in the CpG occurs on.
The follow example contains unphased, unstranded CpG methylation information for the CpG at chr:10-11 and phased, stranded CpG methylation information for the CpG at chr:20-21.
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & REF & ALT & FORMAT & SAMPLE\\
chr & $10$ & C & . & GT:M5mC:DPM5mC:ADM5mC & \tt{0/0:0.5:2:1}\\
chr & $11$ & G & . & GT:M5mC:DPM5mC:ADM5mC & \tt{0/0:.:.:.}\\
chr & $20$ & C & . & GT:PS:M5mC:DPM5mC:ADM5mC & \tt{0|0:20:0.75,.:4,.:3,.}\\
chr & $21$ & G & A & GT:PS:M5mC:DPM5mC:ADM5mC & \tt{0|1:20:0.33:3:1}\\
\end{tabular}
Note that in the above example, the second record could be omitted entirely without any change in meaning.
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality.
\item PS (non-negative 32-bit Integer): Phase set, defined as a set of phased genotypes to which this genotype belongs.
Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set.
A phase set specifies multi-marker haplotypes for the phased genotypes in the set.
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item PSL (List of Strings): The list of phase sets, one for each allele value specified in the {\tt GT}.
Unphased alleles (without a $\mid$ separator before them) must have the value '$.$' in their corresponding position in the list.
Unlike {\tt PS} (which is defined per CHROM), records with different CHROM but the same phase-set name are considered part of the same phase set.
If an implementation cannot guarantee uniqueness of phase-set names across the VCF (for example, phasing a streaming VCF or each CHROM is processed independently in parallel), new phase-set names should be of the format CHROM*POS*ALLELE-NUMBER of the ``first'' allele which is included in this set, with ALLELE-NUMBER being the index of the allele in the {\tt GT} field, since multiple distinct phase-sets could start at the same position. \footnote{The `*' character is used as a separator since `:' is not reserved in the CHROM column.}
A given sample-genotype must not have values for both PS and PSL.
In addition, PS and PSL are not interoperable, in that a PS mentioned in one variant cannot be referenced in a PSL in another, since when used in PS it isn't connected to any specific haplotype (i.e. first or second), but PSL is.
Example:
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & SAMPLE1\\
chr19 & $5$ & . & T & G & . & PASS & DP=100 >:PSL & \tt{|0/1:chr9*5*1,.}\\
chr20 & $10$ & . & A & T,G & . & PASS & DP=100 >:PSL & \tt{|1/2|3:chr20*10*1,.,chr9*5*1} \\
chr20 & $15$ & . & G & C & . & PASS & DP=100 >:PSL & \tt{1|2:.,chr20*10*1}\\
\end{tabular}
\item PSO (List of integers): List of phase set ordinals.
For each phase-set name, defines the order in which variants are encountered when traversing a derivative chromosome.
The missing value '$.$' should be used when the corresponding PSO value is missing.
For each phase-set name, PSO should be defined if any allele with that phase-set name on any record is symbolic structural variant or in breakpoint notation.
Variants in breakpoint notation must have the same PSL and PSO on both records.
Without explicitly specifying the derivative chromosome traversal order, multiple derivative chromosome reconstructions are possible.
Take for example this tandem duplication in a triploid organism with SNVs (ID/QUAL/FILTER columns removed for clarity):
\vspace{0.5em}
\begin{tabular}{ l l l l l l l l l l}
\#CHROM & POS & REF & ALT & INFO & FORMAT & SAMPLE1\\
chr1 & $10$ & T & $<$DUP$>$ & SVCLAIM=DJ & GT:PSL:PSO & \tt{/0/0|1:.,.,chr1*10*1:.,.,3}\\
chr1 & $20$ & A & G & . & GT:PSL:PSO & \tt{/0/0|0|1:.,.,chr1*10*1,chr1*10*1:.,.,4,1} \\
chr1 & $30$ & G & T & . & GT:PSL:PSO & \tt{/0/0|0|1:.,.,chr1*10*1,chr1*10*1:.,.,2,5} \\
\end{tabular}
Without defining PSO, it would be ambiguous as to which copy of the duplicated region the SNVs occur on.
In this example, the presence of the PSO field clarifies that the SNVs are cis phased with the duplication, the first SNV occurs on the first copy of the duplicated region, and second SNV on the second copy.
\item PSQ (List of integers): The list of PQs, one for each phase set in PSL (encoded like PQ).
The missing value '$.$' should be used when the corresponding PSL value is missing, or when the phasing is of unknown quality.
\end{itemize}
\section{Understanding the VCF format and the haplotype representation}
VCF records use a single general system for representing genetic variation data composed of:
\begin{itemize}
\item Allele: representing single genetic haplotypes (A, T, ATC).
\item Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus.
\item VCF record: a record holding all segregating alleles at a locus (as well as genotypes, if appropriate, for multiple individuals containing alleles at that locus).
\end{itemize}
VCF records use a simple haplotype representation for REF and ALT alleles to describe variant haplotypes at a locus.
ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases.
In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele.
\subsection{VCF tag naming conventions}
Several tag names follow conventions which should be used for implementation-defined tag as well:
\begin{itemize}
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$.
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL).
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL).
\item The `P' suffix means \emph{probability} as linear-scale probability in the posterior distribution, which is $\Pr(\mathrm{Model}|\mathrm{Data})$. Examples are GP, CNP.
\item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field.
Examples are GQ, CNQ.
The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).
\item The `L' prefix indicates the local-allele equivalent of a Number=A, R or G field.
\end{itemize}
\section{INFO keys used for structural variants}
\label{sv-info-keys}
\begin{samepage}
The following INFO keys are reserved for encoding structural variants.
In general, when these keys are used by imprecise variants, the values should be best estimates.
When present, per allele values must be specified for all ALT alleles (including non-structural alleles).
Except in lists of strings, the missing value should be used as a placeholder for the ALT alleles for which the key does not have a meaningful value.
The empty string should be used to encode missing values in lists of strings.
\footnotesize
\begin{verbatim}
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
\end{verbatim}
\normalsize
Indicates that this record contains an imprecise structural variant $ALT$ allele. ALT alleles missing $CIPOS$ are to be interpreted as imprecise variants with an unspecified confidence interval.
If a precise ALT allele is present in a record with the $IMPRECISE$ flag, $CIPOS$ must be explicitly set for that allele, even if it is `0,0`.
\footnotesize
\begin{verbatim}
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="Deprecated. Present for backwards compatibility with earlier versions of VCF.">
\end{verbatim}
\normalsize
$END$ has been deprecated in favour of INFO SVLEN and FORMAT LEN.
\footnotesize
\begin{verbatim}
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
\end{verbatim}
\normalsize
\end{samepage}
This field has been deprecated due to redundancy with ALT.
Refer to section \ref{altfield} for the set of valid ALT field symbolic structural variant alleles.
\footnotesize
\begin{verbatim}
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
\end{verbatim}
\normalsize
One value for each ALT allele.
SVLEN must be specified for symbolic structural variant alleles.
SVLEN is defined for $INS$, $DUP$, $INV$, and $DEL$ symbolic alleles as the number of the inserted, duplicated, inverted, and deleted bases respectively.
SVLEN is defined for $CNV$ symbolic alleles as the length of the segment over which the copy number variant is defined.
The missing value $.$ should be used for all other ALT alleles, including ALT alleles using breakend notation.
For backwards compatibility, a missing SVLEN should be inferred from the $END$ field.
For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as positive values.
Note that for structural variant symbolic alleles, $POS$ corresponds to the base immediately preceding the variant.
\footnotesize
\begin{verbatim}
##INFO=<ID=CIPOS,Number=.,Type=Integer,Description="Confidence interval around POS for symbolic structural variants">
\end{verbatim}
\normalsize
If present, the number of entries must be twice the number of ALT alleles.
$CIPOS$ consists of successive pairs of records indicating the start and end offsets relative to $POS$ of the confidence interval for each ALT allele.
For example, $CIPOS=-5,5,0,0$ indicates a 5bp confidence interval in each direction for the first ALT allele, and an exact position for the second alt allele.
When breakpoint sequence homology exists, $CIPOS$ should be used in conjunction with $HOMSEQ$ to specify the interval of homology.
If both $IMPRECISE$ and $CIPOS$ are omitted, $CIPOS$ is implicitly defined as 0,0 for all alleles.
Each $CIPOS$ interval must span 0. That is, the lower bound cannot be greater than 0, and the upper bound cannot be less than 0.
\footnotesize
\begin{verbatim}
##INFO=<ID=CIEND,Number=.,Type=Integer,Description="Confidence interval around the inferred END for symbolic structural variants">
\end{verbatim}
\normalsize
If present, the number of entries must be twice the number of ALT alleles.
$CIEND$ consists of successive pairs of records encoding the confidence interval start and end offsets relative to the $END$ position inferred by $SVLEN$ for each ALT allele.
For symbolic structural variants, the first in the pair must not be greater than 0, and the second must not be less than 0.
For all other alleles, both should be the missing value $.$.
For example, $CIEND=-5,5,.,.$ indicates a 5bp confidence interval in each direction around the end position for the first ALT allele, and no $CIEND$ is defined for the second alt allele.
If $CIEND$ is missing, it is assumed to match $CIPOS$.
\footnotesize
\begin{verbatim}
##INFO=<ID=HOMLEN,Number=A,Type=Integer,Description="Length of base pair identical micro-homology at breakpoints">
\end{verbatim}
\normalsize
\footnotesize
\begin{verbatim}
##INFO=<ID=HOMSEQ,Number=A,Type=String,Description="Sequence of base pair identical micro-homology at breakpoints">
\end{verbatim}
\normalsize
\footnotesize
\begin{verbatim}
##INFO=<ID=BKPTID,Number=A,Type=String,Description="ID of the assembled alternate allele in the assembly file">
\end{verbatim}
\normalsize
For precise variants, the consensus sequence the alternate allele assembly is derivable from the REF and ALT fields.
However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs.
\footnotesize
\begin{verbatim}
##INFO=<ID=MEINFO,Number=.,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
\end{verbatim}
\normalsize
If present, the number of entries must be four (4) times the number of ALT alleles.
$MEINFO$ consists of successive quadruplets of records for each ALT allele.
\footnotesize
\begin{verbatim}
##INFO=<ID=METRANS,Number=.,Type=String,Description="Mobile element transduction info of the form CHR,START,END,POLARITY">
\end{verbatim}
\normalsize
If present, the number of entries must be four (4) times the number of ALT alleles.
$METRANS$ consists of successive quadruplets of records for each ALT allele.
\footnotesize
\begin{verbatim}
##INFO=<ID=DGVID,Number=A,Type=String,Description="ID of this element in Database of Genomic Variation">
##INFO=<ID=DBVARID,Number=A,Type=String,Description="ID of this element in DBVAR">
##INFO=<ID=DBRIPID,Number=A,Type=String,Description="ID of this element in DBRIP">
##INFO=<ID=MATEID,Number=A,Type=String,Description="ID of mate breakend">
##INFO=<ID=PARID,Number=A,Type=String,Description="ID of partner breakend">
##INFO=<ID=EVENT,Number=A,Type=String,Description="ID of associated event">
##INFO=<ID=EVENTTYPE,Number=A,Type=String,Description="Type of associated event">
\end{verbatim}
\normalsize
Whilst simple events such as deletions and duplications can be wholly represented by a single VCF record, complex rearrangements such as chromothripsis result in a large number of breakpoints.
VCF uses the $EVENT$ field to group such related records together, and $EVENTTYPE$ to classify these events.
All records with the same $EVENT$ value are considered to be part of the same event.
The following $EVENTTYPE$ values are reserved and should be used when appropriate:
\begin{itemize}
\item DEL - Deletion
\item DEL:ME - Deletion of mobile element with respect to the reference
\item INS - Insertion
\item INS:ME - Insertion of mobile element
\item DUP - Duplication
\item DUP:TANDEM - Tandem duplication
\item DUP:DISPERSED - Dispersed duplication
\item INV - Inversion
\item TRA - Translocation
\item TRA:BALANCED - Balanced inter-chromosomal translocation
\item TRA:UNBALANCED - Unbalanced inter-chromosomal translocation
\item CHROMOTHRIPSIS - Chromothripsis
\item CHROMOPLEXY - Chromoplexy
\item BFB - breakage fusion bridge
\item DOUBLEMINUTE - Double minute
\end{itemize}
The semantics of other $EVENTTYPE$ values is implementation-defined.
The use of $EVENT$ is not restricted to structural variation and can also be used to associate non-symbolic alleles.