Sven Löffler
17. March 2015 0

Spark vs. Hadoop Map Reduce – Who wins?

The next generation for Big Data programs is already prepared to launch. New tools like Apache Spark penetrate the market and achieve great success at large companies like IBM, Intel or Spotify. Spark can save queries and data directly in-memory and parallel spread it across different nodes leading to an immense increase in performance. In the Daytona Gray Sort Benchmark, Spark was the Leader in the 100-Terabyte-class and set a new record. With 23 minutes Spark was 3 times faster than den old record, which was set up by a Hadoop Map Reduce-Cluster with 72 minutes. In addition to this, Spark did with one-tenth of the computer capacity. So is Spark the new hadoop?
Spark vs Hadoop

Development and properties of Spark

Apache Spark was developed by 250 programmes in the AMP Lab of the University of California in Berkeley. The idea was to create an user friendly interface and to resolve the problem of latency that occurred during batch processing, like Map Reduce. Thereby the analysis is accelerated and the read and write access is simplified. Apache Spark is an open source software and you can be downloaded for free. At the moment it is still supported by the company Databricks. Sparks main place for work is in the Hadoop File System (HDFS). However, Cassandra, HBase and Amazon S3 can also be integrated into Spark. Besides Hadoop, Spark also offers SQL access onto YARN, Hive and Mesos. Sparks programming language builds on Scala, Java and Python, but Scala is the central language.

Sparks strengths and weaknesses compared to Map-Reduce

Included in the Spark high-level Tool-package are: Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. These tolls can all be integrated into one single application. Spark optimizes response times for queries by efficient process management, special indexes and caching mechanism . The developer say that Spark requires up to 80% less code and reaches up to 100 time the speed of map reduce. If the RAM of the nodes is large enough, Spark spreads the data completely in-memory. This in-memory technology improves performance significantly. However, if the amount of data too large and can´t be completely stored in-memory, then intelligent Spark algorithms can control the data themselves and move them between the RAM and the normal storage. But, under these circumstances Spark can´t not use his Cache and this means with a high probability it will be much slower than the batch processing of Hadoop map reduce.

And the winner is …

It becomes clear very fast that Spark will become very important in field of Big Data analysis and reach levels that are impossible for Hadoop to achieve. Especially for data application requiring low latency queries, iterative processing and real time processing Spark will be the first choice. This will mean that some current and future application will use Spark. But, Spark will not completely replace Hadoop. Different factors will determine which technology to use such as current industries trends and especially customer specific use case. This means you cannot make a general decision whether to choose Disk based Computing or RAM based computing.
But, there is one for certain: There is no way around Big Data!

Big Data Greetings
Sven Löffler

Leave a Reply

Your email address will not be published. Required fields are marked *

By sending this comment you accept our comment policy.

a) Blog visitors are always invited to comment.

b) Comments are supposed to increase the value of this weblog.

c) Comments will be activated only after validation.

d) Comments which do not relate to the topic, obviously violate copyrights, have offensive content or contain personal attacks will be deleted.

e) Links can be inserted to the comment but should refer to the topic of the blog post. Links to other websites or blogs which do not refer to the posting will be considered as spam and will be deleted.



tsystemsCom @tsystemsCom
T-Systems  @tsystemsCom
#IoT will continue its rapid growth rate in 2018, shows study by @cradlepoint & @Spiceworks @eWEEKNews:… 
T-Systems  @tsystemsCom
Keep going. Keep growing. Berlin is ready to host our #TSystems #Partnerdays 2017 with prominent guests and Partner… 
T-Systems  @tsystemsCom
“We can control digital complexity for our customers” - that is the résumé from 5 exciting days at the #TSBeachBar 
T-Systems  @tsystemsCom
Cloudsourcing is the new outsourcing, explains #TSystems Director IT Divison @fleutif @LinkedIn: #Cloud @IDC 
T-Systems  @tsystemsCom
Digital solutions for #SmartCities can combine environmental protection and cost effectiveness, says #TSystems Gene…