Comparison Matrix: Real time data processing systems

There are several tools/framework available that help process data as it arrives. I had done a comparative study of below four systems in the past:

  • Apache Kafka
  • Facebook Scribe
  • Cloudera Flume
  • Apache Chukwa
KafkaScribeFlumeChukwa
Current Version0.612.2?0.9.41,20.41
Site & DocsAverageVery PoorGoodPoor
TopologyP2PMaster/Slave3Master/Slave3, 4P2P
Central Node ManagementNoNoYesNo
Configurable Level of ReliabilityNoNoYesNo
InstallationEasyMany DependenciesFairly EasyFairly Easy
Zookeeper IntegrationYesNoYesNo
ConfigurationManual.Manual.Centralised, dynamic configuration.Manual. Needs Agents, Collectors and HICC configurations, Tomcat and Mysql database for web UI.
Hadoop IntegrationPossible to Store data in HDFSPossible to Store data in HDFSPossible to Store data in HDFSHigh, Needs a Hadoop Cluster to operate!
Cenralised Liveness MonitoringNoNoYesNo
Language SupportJavaManyJava, Shell Scripts?Java, Shell Scripts
Output BucketingYes, Custom bucketingYes, Custom bucketing.Yes, Custom bucketing with default time and ip based bucketingYes, seems manual nothing inherent in the framework.
In-Flight TransformationsYesYesYesYes (write map-reduces on collected sink files, even the documentation is not too optimistic about this fetaure. )
Transactional GuaranteesHighAdjustable$
Data Storage^DiskDisk
Data FlowPull (consumners pull from Producers)Push (producers push to consumers)Push (producers push to consumers)Push (producers push to consumers)
1 = Under Apache Incubator 2 = First Apache release would be 0.9.5 3 = A cluster of master nodes, no single point of failure 4 = Multiple master is experimental feature. ^ = Nodes in the framework maintain a log for failure recovery. $ = System can be tweaked to maintain order and delivery guarantees

P2P = No single point of failure
Master/Slave = Single point of failure

Each of the systems above have at least 2 (or more) components in it. on a high level each one has a message producers and consumers. Both of which is a cluster of machines. There is an additional layer of machines, controllers present in Master-Slave setups.