Hadoop has become de-facto standard for big data and batch processing. Think of data pipeline and end up with Hadoop. The Hadoop eco system is changing and changing at a rapid pace. Hadoop serves very well with Hive for batch oriented tasks. But as the eco system move towards maturity, users have started demanding traditional BI features. Hive queries are slow, and that is by design. Hive’s slowness is not a by product, it's there because it relies on map reduce. There is a huge demand for querying clustered data fast, in seconds rather than in minutes or hours. That is where Impala comes into picture. Imapla is based on google’s Dremel paper. It is a a real time querying system keeping data in memory. It does not rely on map reduce to run queries, rather can access data directly from storage thus bypassing map reduce overhead. Also, Impala supports HIVE SQL.
Impala’s stated goal is to be faster than Hive, which it may very well be depending on the configuration and the specific query in question. I was more interested in finding out how does it compare to mysql with basic setup. As the title of the post suggest, it's like comparing apples and oranges because of how these systems are designed, configured and clustered. Mysql has been around for years and works very well with traditional relational data sets while keeping focus on disk storage and providing configurable memory usage. Impala on the other hand keeps everything in memory. For the purpose of this test, i will use below basic configurations:
Use AWS large instance Create one MySql instance Create Impala Cluster with one Worker node and one Master (which holds hadoop Name node, Job Tracker) Create indexes on MySql tables Tweak my.cnf to improve performance Do not do any optimisations on Impala, enable short circuit reads only. Load 1.2 million records in tables in both mysql and impala. (Why 1.2?, i had a dataset with that many) Fire a a moderately ugly and complex query at both to measure query times.
One would say 1.2 million is not too big a data set and thats why we are dealing with single machines and not huge clusters. The idea is to see if imapla provides any out of the box benefits over MySql when Mysql has some optimisation and impala has none. MySql optimisations are based on my current understanding and are mostly to do with innodb memory settings. I am sure there are people out there with much more expertise on that but for tests here, its just basic and may be good enough settings.
Here is relevant part of my.cnf that i will use:
# Query Cache Configuration query_cache_limit = 64M query_cache_size = 128M innodb_buffer_pool_size=4096M innodb_flush_method=O_DIRECT
The important thing to note is innodb buffer size which is bumped up to 4GB and flush method. I did some of the reading from this article.
With all above considerations in mind, let's move to installations. While Mysql installation is trivial, impala installation is not.
A very good installation guide is provided by Cloudera. It uses apache whirr to create a impala cluster. Despite a very well written blog, i had some problems:
- ganglia install does not work as fedora rpm download url has changed. - http://dl.fedoraproject.org/pub/epel/6/x86_64/ and the version is 6.8 instead of 6.5
- After installation completes, impala-shell fails to query table created in hive. Error is wrong file system. - in /etc/default/impala change arguments to imapalad, -nn and –nn_port. they point to localhost by default.
- After installation, when impala is running. firing queries generates this error in logs: - WARN hdfs.DomainSocketFactory: error creating DomainSocket java.net.ConnectException: connect(2) error: No such file or directory when trying to connect to '/var/run/hadoop-hdfs/dn.50010'
- This is a misconfig for direct reads or short circuit reads. hdfs-site.xml or core-site.xml inside impala conf is missing this:
- dfs.client.read.shortcircuit true dfs.domain.socket.path /var/run/hadoop-hdfs/dn._PORT
The Data Set
The data set we have has 5 fields: timestamp, username, useragent, id and company. Here is a MySql create table statement:
CREATE TABLE `data` ( `datetime` datetime DEFAULT NULL, `user` varchar(255) DEFAULT NULL, `useragent` varchar(255) DEFAULT NULL, `partnerId` varchar(255) DEFAULT NULL, `company` varchar(255) DEFAULT NULL, KEY `pk` (`user`,`datetime`,`useragent`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and its equivalent hive statement:
CREATE TABLE data(ts as timestamp, username String, useragentString, id as String, company as String) STORED AS TEXTFILE;
After loading data in both the tables, i ran some queries to check query performances. The query i ran was based on the understanding that the data set has a certain amount of noise. We want to eliminate that noise. The data set has rows with the same username with a timestamp spaced at exactly one hour. The goal of the query is to identify such rows and eliminate all but one row for sets of rows with that behaviour.
After several tries i came up with this MySql query:
select * from data a where a.datetime in ( select a.datetime from data b where (a.user=b.user and a.useragent = b.useragent and a.datetime!=b.datetime and date(b.datetime) = date(a.datetime) and TIMESTAMPDIFF(MINUTE,b.datetime,a.datetime)&gt;60) or(a.datetime) order by b.datetime desc ) and a.datetime not in ( select a.datetime from data c where a.user=c.user and a.useragent = c.useragent and a.datetime ) order by a.user desc, a.datetime desc, a.useragent desc;
Again, may be there are better and more efficient ways to write this query. I could come up with this. Here is the explain plan of the query:
id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY a ALL NULL NULL NULL NULL 1206873 Using where; Using filesort 3 DEPENDENT SUBQUERY c ref user,datetime,useragent,user_2 user_2 768 alertme.a.user 28 Using where; Using index 2 DEPENDENT SUBQUERY b index user,useragent,user_2 user_2 1545 NULL 1206873 Using where; Using index
Now filesort in the outermost query is bad, may be there is a way to improve that. Running this query against a Mysql box with 4GB of buffer space took somewhere in the range of 5 to 7 minutes.
Translating above query to a impala query proved to be difficult that i had hoped for. Here is a translation that i came up with:
select * from (select e.ts, e.username, e.useragent, e.id, e.company, f.ts as ts2 from (select DISTINCT a.* from data_clean_ts a LEFT JOIN (select * from data_clean_ts order by username DESC, ts DESC LIMIT 1300000) b on a.username=b.username and a.useragent = b.useragent and a.ts!=b.ts and to_date(b.ts) = to_date(a.ts) and ( (hour(a.ts)-hour(b.ts))*60 + (minute(a.ts)-minute(b.ts))*60)&gt;60 order by a.username DESC, a.ts DESC LIMIT 1300000) e LEFT JOIN (select DISTINCT c.* from data_clean_ts c LEFT JOIN (select * from data_clean_ts order by username DESC, ts DESC LIMIT 1300000) d on c.username=d.username and c.useragent = d.useragent and c.ts order by c.username DESC, c.ts DESC LIMIT 1300000) f on e.username=f.username and e.useragent = f.useragent and e.ts=f.ts order by e.username DESC, e.ts DESC LIMIT 1300000) g where ts2 is null LIMIT 25;
I had to use a LIMIT clause as impala requires a limit clause with order by. This query on impala finishes in around 60 to 80 seconds. That is about *~ 4 to 5 times *faster.
It seems that impala surely provides performance benefits even when compared to MySql. It would therefore surely prove to be faster than hive, much faster in fact. We can not rely on one test and say for sure that impala is faster than mysql but this definitely gives an indication. There would be some queries for sure when MySql would beat Impala with proper optimisations. This test was out of the box for impala. And in that context, it is faster.
I would explore more avenues for tests and update this post at a later stage with results.