Everyone wants to do big data, Microsoft is no exception. Jumping on the bigdata bandwagon and cashing in is something no one wants to miss. We have so many players in the market: AWS, Cloudera, MapR, HortonWorks, IBM, Intel and of course open source Hadoop Ecosystem. Stakes are high and microsoft knows it. That is why they jumped in with Windows Azure and Windows HDInsights: IaaS and PaaS.
I had a chance to work with MS IaaS and PaaS services. I have been using Windows HDInsights for some time now and based on my experience with AWS and Cloudera based services i can tell that HDInsights is going nowhere. It is a half baked solutions thrown out in haste to make a presence. And the strategy seems to be simple, grab some lab rats and improve the offering at clients’ expense. HDInsights comes up with Hadoop, Hive, Oozie, Pig and Sqoop preinstalled, Even desktop shortcuts to jobtracker and namenode UI, now thats some neat work. The cluster stores all its data on windows blob store (similar to S3) by default.
Microsoft has come up with, what it is best at (understatement?), a good UI. Few click abstractions for complex tasks such as spinning up a hadoop cluster. Once one is over the feel good factor of UI, clear problems come up.
Here is a list of things i thought was the most problematic areas:
- Very poor documentation. Like you would never know that they have changed namenode and it points to blob store. There is a new address scheme that one needs to use to specify namenode location.
- A HDInsights cluster is fixed size, in terms of nodes, memory, storage etc. One cannot grow/shrink on demand.
- No monitoring tools for all the nodes in a cluster.
- No way to install 3rd party software. Installing e.g. numpy on all nodes as your python reducers needs it. Not possible.
- Uses (forces) users to use GUI based tools like RDP to connect to a cluster. Now, that is a deal breaker compared to powerful unix shell and SSH. Poor development toolset in general. Imagine viewing a log in Notepad. Yes notepad.
- Do not see a new tool working out of the box. Like HBase. It's just not supported. One can create an HBase cluster with Azure VMs, which would be all *nix.
- Some dev team at MS keeps on changing things without notifying anyone, including their own support staff, sales staff.
- One morning my cluster was “re-imaged” as they thought they were responsible for servicing the OS. They reverted all custom changes i had done to the hadoop config, all software i had installed e.g. mahout were gone too.
- Another day, they decided about strong security and made all the keys encrypted. Again without any notifications. (Startup mode?)
- There are only two machine types to choose from on HDInsights.
- For unexplainable reasons hadoop fs -ls /
would take 30-60 seconds.
- A running cluster with small or moderate data set would take 3 mins at least to finish a map reduce job while the same job with same configuration would finish in 45 seconds on AWS!
And the list could go on. One could argue that they are still adapting to the ecosystem but they are doing it at clients’ expense. The only use case i can see fit for HDInsights is running hadoop wordcount.