Spark vs. Hadoop Map Reduce – Who wins?
The next generation for Big Data programs is already prepared to launch. New tools like Apache Spark penetrate the market and achieve great success at large companies like IBM, Intel or Spotify. Spark can save queries and data directly in-memory and parallel spread it across different nodes leading to an immense increase in performance. In the Daytona Gray Sort Benchmark, Spark was the Leader in the 100-Terabyte-class and set a new record. With 23 minutes Spark was 3 times faster than den old record, which was set up by a Hadoop Map Reduce-Cluster with 72 minutes. In addition to this, Spark did with one-tenth of the computer capacity. So is Spark the new hadoop?
Development and properties of Spark
Apache Spark was developed by 250 programmes in the AMP Lab of the University of California in Berkeley. The idea was to create an user friendly interface and to resolve the problem of latency that occurred during batch processing, like Map Reduce. Thereby the analysis is accelerated and the read and write access is simplified. Apache Spark is an open source software and you can be downloaded for free. At the moment it is still supported by the company Databricks. Sparks main place for work is in the Hadoop File System (HDFS). However, Cassandra, HBase and Amazon S3 can also be integrated into Spark. Besides Hadoop, Spark also offers SQL access onto YARN, Hive and Mesos. Sparks programming language builds on Scala, Java and Python, but Scala is the central language.
Sparks strengths and weaknesses compared to Map-Reduce
Included in the Spark high-level Tool-package are: Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. These tolls can all be integrated into one single application. Spark optimizes response times for queries by efficient process management, special indexes and caching mechanism . The developer say that Spark requires up to 80% less code and reaches up to 100 time the speed of map reduce. If the RAM of the nodes is large enough, Spark spreads the data completely in-memory. This in-memory technology improves performance significantly. However, if the amount of data too large and can´t be completely stored in-memory, then intelligent Spark algorithms can control the data themselves and move them between the RAM and the normal storage. But, under these circumstances Spark can´t not use his Cache and this means with a high probability it will be much slower than the batch processing of Hadoop map reduce.
And the winner is …
It becomes clear very fast that Spark will become very important in field of Big Data analysis and reach levels that are impossible for Hadoop to achieve. Especially for data application requiring low latency queries, iterative processing and real time processing Spark will be the first choice. This will mean that some current and future application will use Spark. But, Spark will not completely replace Hadoop. Different factors will determine which technology to use such as current industries trends and especially customer specific use case. This means you cannot make a general decision whether to choose Disk based Computing or RAM based computing.
But, there is one for certain: There is no way around Big Data!
Big Data Greetings