Hadoop Cluster on AWS VPC with Apache Whirr
Setting up hadoop cluster on cloud providers has been made relatively easy with tools such as apache whirr, cloudera manager, jclouds. Whirr uses jclouds internally. But what if one wanted to create a cluster thats not in the open public cloud? What if one wanted to create a cluster in AWS VPC or on their own internal machines? Whirr has a Build-Your-Own-Node (BYON) feature built on top jclouds byon feature. Let's see how this is done.
For building a hadoop (or any other whirr supported cluster) cluster inside AWS VPC (Virtual Private Cloud) or intranet, prerequisite are:
- all machines/nodes must be up and running.
- must have known private or public ip addresses.
Here is what goes into whirr config file:
whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1
hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
whirr.zookeeper.install-function=install_cdh_zookeeper
whirr.zookeeper.configure-function=configure_cdh_zookeeper
whirr.hbase.install-function=install_cdh_hbase
whirr.hbase.configure-function=configure_cdh_hbase
whirr.private-key-file=/opt/whirr/.ssh/id_rsa
whirr.public-key-file=/opt/whirr/.ssh/id_rsa.pub
whirr.java.install-function=install_oab_java
whirr.provider=byon
whirr.service-name=byon
whirr.bootstrap-user=whirr jclouds.byon.endpoint=file:///opt/whirr/byon.yaml
The last 4 lines are the important ones. We are changing the provider from usual aws to byon. We also specify a location of a yaml file. This file holds details about all the nodes in the to be cluster.
nodes:
- id: node1
hostname: xxx.xxx.xxx.xxx
os_arch: x86_64
os_family: centOS
os_description: centOS
os_version: 6.3
username: root
credential_url: file:///opt/whirr/.ssh/id_rsa
- id: node2
hostname: xxx.xxx.xxx.xxx
os_arch: x86_64
os_family: centOS
os_description: centOS
os_version: 6.3
username: root
credential_url: file:///opt/whirr/.ssh/id_rsa
In the yaml file above, hostname needs to be an ip address and not a real hostname. This is due to a bug in jclouds, as of writing this. Replace xxx.xxx.xxx.xxx with your AWS VPC private ip address. One can directly use a rsa key by specifying it in credential instead of credential_url field.
Here is an example yaml file from jclouds github page. Here are some errors that come due to problem in config and/or yaml file
error : java.lang.IllegalArgumentException: URI is not absolute
This is related to credential file url. It needs to start with file://.
error: java.lang.RuntimeException: java.net.UnknownHostException: byon.yaml
jclouds is looking for the default yaml file. This error comes if yaml file location is missing from the config file or the file is non-existant.
If everything goes well, firing launch-cluster would result in a running cluster. Some of the things to keep in mind are:
- I have used whirr user on my machines, this user needs to be there on all the machines and it must have sudo access.
- launch-cluster command needs to be fired from a non-root user. Trying to invoke it as root user would result in errors.