forked from gluster/glusterfs-hadoop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
199 lines (127 loc) · 7.09 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
GlusterFS Hadoop Plugin
=======================
INTRODUCTION
------------
This document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop.
This plugin replaces the hadoop file system (typically, the Hadoop Distributed File System) with the
GlusterFileSystem, which writes to a local directory which FUSE mounts a proxy to a gluster system.
REQUIREMENTS
------------
* Supported OS is GNU/Linux
* GlusterFS installed on all machines in the cluster
* Java Runtime Environment (JRE)
* Maven 3x (needed if you are building the plugin from source)
* JDK 6+ (needed if you are building the plugin from source)
NOTE: Plugin relies on two *nix command line utilities to function properly. They are:
* mount: Used to mount GlusterFS volumes.
* getfattr: Used to fetch Extended Attributes of a file
Make sure they are installed on all hosts in the cluster and their locations are in $PATH
environment variable.
INSTALLATION
------------
** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) **
* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin]
Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin.
# cd $GLUSTER_HOME/hdfs/0.20.2
# mvn package
On a successful build the plugin will be present in the `target` directory.
(NOTE: version number will be a part of the plugin)
# ls target/
classes glusterfs-0.20.2-0.1.jar maven-archiver surefire-reports test-classes
^^^^^^^^^^^^^^^^^^
Copy the plugin to lib/ directory in your $HADOOP_HOME dir.
# cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib
Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf
directory in your $HADOOP_HOME dir.
# cp conf/core-site.xml $HADOOP_HOME/conf
* Installing the plugin from RPM
See the plugin documentation for installing from RPM.
CLUSTER INSTALLATION
--------------------
In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to
build the plugin in one place and deploy it (along with the configuration file on all other hosts).
This should be run on the host which is that hadoop master [Job Tracker].
* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop)
1. Edit conf/slaves file in your hadoop distribution; one line for each slave.
2. Setup password-less ssh b/w hadoop master and slave(s).
3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION)
4. Run the following
# cd $GLUSTER_HOME/hdfs/0.20.2/tools
# python ./build-and-deploy.py -b -d /path/to/hadoop/home -c
This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves).
Script options:
-b : build the plugin
-d : location of hadoop directory
-c : deploy core-site.xml
-m : deploy mapred-site.xml
-h : deploy hadoop-env.sh
CONFIGURATION
-------------
All plugin configuration is done in a single XML file (core-site.xml) with <name><value> tags in each <property>
block.
Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below
name: fs.glusterfs.impl
value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem
The default FileSystem API to use (there is little reason to modify this).
name: fs.default.name
value: glusterfs://server:port
The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host
in the cluster as the server and any port number. This option has to be in server:port format for hadoop
to create file URI; but is not used by plugin.
name: fs.glusterfs.volname
value: volume-dist-rep
The volume to mount.
name: fs.glusterfs.mount
value: /mnt/glusterfs
This is the directory that the plugin will use to mount (FUSE mount) the volume.
name: fs.glusterfs.server
value: 192.168.1.36, hackme.zugzug.org
To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster.
Mention it here.
name: quick.slave.io
value: [On/Off], [Yes/No], [1/0]
NOTE: This option is not tested as of now.
This is a performance tunable option. Hadoop schedules jobs to hosts that contain the file data part. The job
then does I/O on the file (via FUSE in case of GlusterFS). When this option is set, the plugin will try to
do I/O directly from the backed filesystem (ext3, ext4 etc..) the file resides on. Hence read performance
will improve and job would run faster.
USAGE
-----
Once configured, start Hadoop Map/Reduce daemons
# cd $HADOOP_HOME
# ./bin/start-mapred.sh
If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS.
FOR HACKERS
-----------
* Source Layout (./src/)
org.apache.hadoop.fs.glusters/GlusterFSBrickClass.java
org.apache.hadoop.fs.glusters/GlusterFSXattr.java <--- Fetch/Parse Extended Attributes of a file
org.apache.hadoop.fs.glusters/GlusterFUSEInputStream.java <--- Input Stream (instantiated during open() calls; quick read from backed FS)
org.apache.hadoop.fs.glusters/GlusterFSBrickRepl.java
org.apache.hadoop.fs.glusters/GlusterFUSEOutputStream.java <--- Output Stream (instantiated during creat() calls)
org.apache.hadoop.fs.glusters/GlusterFileSystem.java <--- Entry Point for the plugin (extends Hadoop FileSystem class)
org.gluster.test.AppTest.java <--- Your test cases go here (if any :-))
./tools/build-deploy-jar.py <--- Build and Deployment Script
./conf/core-site.xml <--- Sample configuration file
./pom.xml <--- build XML file (used by maven)
./COPYING <--- License
./README <--- This file
JENKINS
-------
At the moment, you need to run as root - this can be done by modifying this line in the jenkins init.d/ script.
This is because of the mount command issued in the GlusterFileSystem.
#Method 1) Modify JENKINS_USER in /etc/sysconfig/jenkins
JENKINS_USER=root
#Method 2) Directly modify /etc/init.d/jenkins
#daemon --user "$JENKINS_USER" --pidfile "$JENKINS_PID_FILE" $JAVA_CMD $PARAMS > /dev/null
echo "WARNING: RUNNING AS ROOT"
daemon --user root --pidfile "$JENKINS_PID_FILE" $JAVA_CMD $PARAMS > /dev/null
BUILDING
--------
Building requires a working gluster mount for unit tests.
The unit tests read test resources from glusterconfig.properties - a file which should be present
1) edit your .bashrc, or else at your terminal run :
export GLUSTER_VOLUME=MyVolume <-- replace with your preferred volume name (default is HadoopVol)
export GLUSTER_HOST=192.0.1.2 <-- replace with your host (default will be determined at runtime in the JVM)
2) run:
mvn package