Hadoop Cluster on AWS VPC with Apache Whirr
Setting up hadoop cluster on cloud providers has been made relatively easy with tools such as apache whirr, cloudera manager, jclouds. Whirr uses jclouds internally. But what if one wanted to create a cluster thats not in the open public cloud? What if one wanted to create a cluster in AWS VPC or on their own internal machines? Whirr has a Build-Your-Own-Node (BYON) feature built on top jclouds byon feature. Let's see how this is done.
For building a hadoop (or any other whirr supported cluster) cluster inside AWS VPC (Virtual Private Cloud) or intranet, prerequisite are:
- all machines/nodes must be up and running.
- must have known private or public ip addresses.
Here is what goes into whirr config file:
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker whirr.env.repo=cdh4 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.zookeeper.install-function=install_cdh_zookeeper whirr.zookeeper.configure-function=configure_cdh_zookeeper whirr.hbase.install-function=install_cdh_hbase whirr.hbase.configure-function=configure_cdh_hbase whirr.private-key-file=/opt/whirr/.ssh/id_rsa whirr.public-key-file=/opt/whirr/.ssh/id_rsa.pub whirr.java.install-function=install_oab_java whirr.provider=byon whirr.service-name=byon whirr.bootstrap-user=whirr jclouds.byon.endpoint=file:///opt/whirr/byon.yaml
The last 4 lines are the important ones. We are changing the provider from usual aws to byon. We also specify a location of a yaml file. This file holds details about all the nodes in the to be cluster.
nodes: - id: node1 hostname: xxx.xxx.xxx.xxx os_arch: x86_64 os_family: centOS os_description: centOS os_version: 6.3 username: root credential_url: file:///opt/whirr/.ssh/id_rsa - id: node2 hostname: xxx.xxx.xxx.xxx os_arch: x86_64 os_family: centOS os_description: centOS os_version: 6.3 username: root credential_url: file:///opt/whirr/.ssh/id_rsa
In the yaml file above, hostname needs to be an ip address and not a real hostname. This is due to a bug in jclouds, as of writing this. Replace xxx.xxx.xxx.xxx with your AWS VPC private ip address. One can directly use a rsa key by specifying it in credential instead of credential_url field.
Here is an example yaml file from jclouds github page. Here are some errors that come due to problem in config and/or yaml file
error : java.lang.IllegalArgumentException: URI is not absolute
This is related to credential file url. It needs to start with file://.
error: java.lang.RuntimeException: java.net.UnknownHostException: byon.yaml
jclouds is looking for the default yaml file. This error comes if yaml file location is missing from the config file or the file is non-existant.
If everything goes well, firing launch-cluster would result in a running cluster. Some of the things to keep in mind are:
- I have used whirr user on my machines, this user needs to be there on all the machines and it must have sudo access.
- launch-cluster command needs to be fired from a non-root user. Trying to invoke it as root user would result in errors.