Hadoop Cluster on AWS VPC with Apache Whirr

Setting up hadoop cluster on cloud providers has been made relatively easy with tools such as apache whirr, cloudera manager, jclouds. Whirr uses jclouds internally. But what if one wanted to create a cluster thats not in the open public cloud? What if one wanted to create a cluster in AWS VPC or on their own internal machines? Whirr has a Build-Your-Own-Node (BYON) feature built on top jclouds byon feature. Let's see how this is done.

For building a hadoop (or any other whirr supported cluster) cluster inside AWS VPC (Virtual Private Cloud) or intranet, prerequisite are:

  • all machines/nodes must be up and running.
  • must have known private or public ip addresses.

Here is what goes into whirr config file:

whirr.cluster-name=myhadoopcluster  
whirr.instance-templates=1  
hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker  
whirr.env.repo=cdh4  
whirr.hadoop.install-function=install_cdh_hadoop  
whirr.hadoop.configure-function=configure_cdh_hadoop  
whirr.zookeeper.install-function=install_cdh_zookeeper  
whirr.zookeeper.configure-function=configure_cdh_zookeeper  
whirr.hbase.install-function=install_cdh_hbase  
whirr.hbase.configure-function=configure_cdh_hbase  
whirr.private-key-file=/opt/whirr/.ssh/id_rsa  
whirr.public-key-file=/opt/whirr/.ssh/id_rsa.pub  
whirr.java.install-function=install_oab_java  
whirr.provider=byon  
whirr.service-name=byon  
whirr.bootstrap-user=whirr jclouds.byon.endpoint=file:///opt/whirr/byon.yaml  

The last 4 lines are the important ones. We are changing the provider from usual aws to byon. We also specify a location of a yaml file. This file holds details about all the nodes in the to be cluster.

nodes:  
- id: node1 
hostname: xxx.xxx.xxx.xxx  
os_arch: x86_64  
os_family: centOS  
os_description: centOS  
os_version: 6.3  
username: root  
credential_url: file:///opt/whirr/.ssh/id_rsa  
- id: node2 
hostname: xxx.xxx.xxx.xxx  
os_arch: x86_64  
os_family: centOS  
os_description: centOS  
os_version: 6.3  
username: root  
credential_url: file:///opt/whirr/.ssh/id_rsa  

In the yaml file above, hostname needs to be an ip address and not a real hostname.  This is due to a bug in jclouds, as of writing this. Replace xxx.xxx.xxx.xxx with your AWS VPC private ip address. One can directly use a rsa key by specifying it in credential instead of credential_url field.

Here is an example yaml file from jclouds github page. Here are some errors that come due to problem in config and/or yaml file

 error : java.lang.IllegalArgumentException: URI is not absolute

This is related to credential file url. It needs to start with file://.

 error: java.lang.RuntimeException: java.net.UnknownHostException: byon.yaml

jclouds is looking for the default yaml file. This error comes if yaml file location is missing from the config file or the file is non-existant.

If everything goes well, firing launch-cluster would result in a running cluster. Some of the things to keep in mind are:

  • I have used whirr user on my machines, this user needs to be there on all the machines and it must have sudo access.
  • launch-cluster command needs to be fired from a non-root user. Trying to invoke it as root user would result in errors.