forked from HadoopGenomics/Hadoop-BAM
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
273 lines (210 loc) · 10.3 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
Note: This repository is no longer maintained. The code has been moved to:
https://github.com/HadoopGenomics/Hadoop-BAM
Hadoop-BAM: a library for the manipulation of files in common bioinformatics
formats using the Hadoop MapReduce framework, and command line tools in the
vein of SAMtools.
The file formats currently supported are:
- BAM (Binary Alignment/Map)
- SAM (Sequence Alignment/Map)
- FASTQ
- FASTA (input only)
- QSEQ
- VCF (Variant Call Format)
- BCF (Binary VCF) (output is always BGZF-compressed)
For a longer high-level description of Hadoop-BAM, refer to the article
"Hadoop-BAM: directly manipulating next generation sequencing data in the
cloud" in Bioinformatics Volume 28 Issue 6 pp. 876-877, also available online
at: http://dx.doi.org/10.1093/bioinformatics/bts054
If you are interested in using Apache Pig (http://pig.apache.org/) with
Hadoop-BAM, refer to SeqPig at: http://seqpig.sourceforge.net/
Note that the library part of Hadoop-BAM is primarily intended for developers
with experience in using Hadoop. The command line tools of Hadoop-BAM should be
understandable to all users, but they are limited in scope. SeqPig is a more
versatile and higher-level interface to the file formats supported by
Hadoop-BAM.
------------
Dependencies
------------
Hadoop. Tested with 1.1.2 and 2.2.0. Older stable versions as far back
as 0.20.2 should also work. Version 4.2.0 of Cloudera's distribution,
CDH, has also been tested. Use other versions at your own risk. You
can change the version of Hadoop linked against by modifying the
corresponding paramter in the pom.xml build file.
Picard SAM-JDK. Version 1.107 is required. Later versions may also
work but have not been tested. A version of Picard is distributed via
the unofficial maven repository (see below).
Availability:
Hadoop - http://hadoop.apache.org/
Picard - http://picard.sourceforge.net/
------------
Installation
------------
A precompiled "hadoop-bam-X.Y.jar" built against Hadoop 2.2.0 is provided. You
may also build it yourself using the commands below --- a necessary step if you
are using an incompatible version of Hadoop.
The easiest way to compile Hadoop-BAM is to use Maven (version 3.0.4 at least)
with the following command:
mvn clean package -DskipTests
The previous command will create two files:
target/hadoop-bam-X.Y-SNAPSHOT.jar
target/hadoop-bam-X.Y-SNAPSHOT-jar-with-dependencies.jar
The former contains only Hadoop-BAM whereas the latter one also contains all
dependencies and can be run directly via
hadoop jar target/hadoop-bam-X.Y-SNAPSHOT-jar-with-dependencies.jar
Javadoc documentation is generated automatically and can then be found in
the target/apidocs subdirectory.
Finally, unit test can be run via:
mvn test
-------------
Library usage
-------------
Hadoop-BAM provides the standard set of Hadoop file format classes for the file
formats it supports: a FileInputFormat and one or more RecordReaders for input,
and a FileOutputFormat and one or more RecordWriters for output.
Note that Hadoop-BAM is based around the newer Hadoop API introduced in the
0.20 Hadoop releases instead of the older, deprecated API.
See the Javadoc as well as the command line plugins' source code (in
src/main/java/fi/tkk/ics/hadoop/bam/cli/plugins/*.java) for more information. In
particular, for MapReduce usage, recommended examples are
src/main/java/fi/tkk/ics/hadoop/bam/cli/plugins/FixMate.java and
src/main/java/fi/tkk/ics/hadoop/bam/cli/plugins/VCFSort.java.
When using Hadoop-BAM as a library in your program, remember to have
hadoop-bam-X.Y.jar as well as the Picard .jars (including the Commons JEXL .jar)
in your CLASSPATH and HADOOP_CLASSPATH; alternatively, use the
*-jar-with-dependencies.jar which contains already all dependencies.
Linking against Hadoop-BAM
..........................
If your Maven project relies on Hadoop-BAM the easiest way to link against
it is by adding our unofficial maven repository which also provides matching
versions of the dependencies. You need to add the following to your pom.xml:
<project>
...
<repositories>
<repository>
<id>hadoop-bam-sourceforge</id>
<url>http://hadoop-bam.sourceforge.net/maven/</url>
</repository>
</repositories>
...
<dependencies>
<dependency>
<groupId>fi.tkk.ics.hadoop.bam</groupId>
<artifactId>hadoop-bam</artifactId>
<version>6.2</version>
</dependency>
<dependency>
<groupId>variant</groupId>
<artifactId>variant</artifactId>
<version>1.107</version>
</dependency>
<dependency>
<groupId>tribble</groupId>
<artifactId>tribble</artifactId>
<version>1.107</version>
</dependency>
<dependency>
<groupId>cofoja</groupId>
<artifactId>cofoja</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>picard</groupId>
<artifactId>picard</artifactId>
<version>1.107</version>
</dependency>
<dependency>
<groupId>samtools</groupId>
<artifactId>samtools</artifactId>
<version>1.107</version>
</dependency>
...
</dependencies>
...
</project>
------------------
Command-line usage
------------------
Hadoop-BAM can be used as a command-line tool, with functionality in the form
of plugins that provide commands to which hadoop-bam-X.Y.jar is a frontend.
Hadoop-BAM provides some commands of its own, but any others found in the Java
class path will be used as well.
Running under Hadoop
....................
To use Hadoop-BAM under Hadoop, the easiest method is to use the
jar that comes packaged with all dependencies via
hadoop jar hadoop-bam-with-dependencies.jar
Alternatively, you can use the "-libjars" command line argument when
running Hadoop-BAM to provide different versions of dependencies as follows:
hadoop jar hadoop-bam-X.Y.jar \
-libjars sam-1.107.jar,picard-1.107.jar,variant-1.107.jar,tribble-1.107.jar,commons-jexl-2.1.1.jar
Finally, all jar files can also be added to HADOOP_CLASSPATH in the Hadoop
configuration's hadoop-env.sh.
The command used should print a brief help message listing the Hadoop-BAM
commands available. To run a command, give it as the first command-line
argument. For example, the provided SAM/BAM sorting command, "sort":
hadoop jar hadoop-bam-with-dependencies-X.Y.jar sort
This will give a help message specific to that command.
File paths under Hadoop
.......................
When running under Hadoop, keep in mind that file paths refer to the
distributed file system, HDFS. To explicitly access a local file, instead of
using the plain path such as "/foo/bar", you must use a file: URI, such as
"file:/foo/bar". Note that paths in file: URIs must be absolute.
Output of MapReduce-using commands
..................................
An example of a MapReduce-using command is "sort". Like all such commands
should, it takes a working directory argument in which to place its output in
parts. Each part is the output of one reduce task. By default, these parts are
not complete and usable files! They are /not/ BAM or SAM files, they are only
parts of BAM or SAM files containing output records, but lacking headers and
footers.
For convenience, the provided MapReduce-using commands support a "-o" parameter
to output single complete files instead of the individual parts.
For concatenating the outputs of tools that wish to output complete SAM and BAM
files from each reducer, the "cat" command is provided.
Note that some commands, such as the provided "view" and "index" commands, do
not use MapReduce: they are merely useful to operate directly on files stored
in HDFS.
Running without Hadoop
......................
Hadoop-BAM can be run directly, outside Hadoop, as long as it and the Picard
and Hadoop .jar files as well as the Apache Commons CLI .jar provided by Hadoop
("lib/commons-cli-1.2.jar" for version 1.1.2) are in the Java class path.
Alternatively use the bundled jar (hadoop-bam-jar-with-dependencies-X.Y.jar). In
addition, depending on the Hadoop version, there may be more dependencies from
the Hadoop lib/ directory. A command such as the following:
java fi.tkk.ics.hadoop.bam.cli.Frontend
Is equivalent to the "hadoop jar hadoop-bam-X.Y.jar" command used earlier. This has
limited application, but it can be used e.g. for testing purposes.
Note that the "-libjars" way of passing the paths to the Picard .jars will not
work when running Hadoop-BAM like this.
------------------
Summarizer plugins
------------------
This part explains some behaviour of the summarizing plugins, available in the
command line interface as "hadoop jar hadoop-bam-X.Y.jar summarize" and "hadoop jar
hadoop-bam-X.Y.jar summarysort". Unless you are a Chipster user, this section is
unlikely to be relevant to you, and even then, this is not likely to be
something you are interested in.
Summarization is typically best done with the "hadoop jar hadoop-bam-X.Y.jar
summarize --sort -o output-directory" command. Then there is no need to worry
about concatenating nor sorting the output, as both are done automatically in
this one command. But if you do not pass the "--sort" option, do remember that
Chipster needs the outputs sorted before it can make use of them. For this, you
need to run a separate "hadoop jar hadoop-bam-X.Y.jar summarysort" command for each
summary file output by "summarize".
Output format
.............
The summarizer's output format is tabix-compatible. It is composed of rows of
tab-separated data:
<reference sequence ID> <left coordinate> <right coordinate> <count>
The coordinate columns are 1-based and both ends are inclusive.
The 'count' field represents the number of alignments that have been summarized
into that single range. Note that it may not exactly match any of the 'level'
arguments passed to Summarize, due to Hadoop splitting the file at a boundary
which is not an even multiple of the requested level.
Note that the output files are BGZF-compressed, but do not include the empty
terminating block which would make them valid BGZF files. This is to avoid
having to remove it from the end of each output file in distributed usage (when
not using the "-o" convenience parameter): it's much simpler to put an empty
gzip block to the end of the output.