Realized Design: PolyBase vs. Spark vs. Hive

Wednesday, November 25, 2015

PolyBase vs. Spark vs. Hive

Hadoop has been gaining grown in the last few years, and as it grows, some of its weaknesses are starting to show. For analysis/analytics, one issue has been a combination of complexity and speed. Given that Hadoop is designed to store unstructured data, the reality is that at least with the first phases of ETL/ELT/EL against the unstructured data it will be complex. And yes, call it what you want, but it is in fact a form of ETL/ELT/EL. But once the data has been organized, let's say structured, the next questions will be - where do we put the data, and how do we manipulate the data.

In the early days of Hadoop, it appeared that the typical approach was to transfer the data to a more traditional database. It might be an MPP system, such as Vertica or Teradata, or a relational database such as SQL Server. or you could move the data to a Hive table. Hive uses many of the SQL commands, but the early design of Hive was slow. And HBase was available, but few if any analysts knew the cryptic commands for HBase.

Improvements to Analysis

These weaknesses have been addressed in one of two approaches: Improve the current Hadoop functionality, or create new external tools that address both the complexity and the speed issues.

YARN & SPARK
Mapreduce against Hadoop is slow. Spark allows the creation of a clustering computation engine that can be run against HDFS, or a few other non-Hadoop data structures. The enabler for HDFS was Hadoop 2 with YARN. Spark runs under YARN, but much faster than mapreduce. If you are running Hadoop, you will want to include Spark.

And over time, Hive has improved, with the introduction of ORC tables (optimized row columnar), which greatly improved performance (see ORC File in HDP 2: Better Compression, Better Performance). For 2016, an even faster Hive will be introduced, called Hive LLAP (live long and process). The goal, provide sub-second response for Hive tables For external tools, Here are two links that provide additional details on Hive LLAP:

INTERACTIVE SQL ON HADOOP WITH HIVE LLAP
HIVE LLAP PREVIEW ENABLES SUB-SECOND SQL ON HADOOP AND MORE

Working with PolyBase, we have found that once we get the data setup, using it is straight forward. But, we still had to get the data into some form of structure. Looking over the Spark documentation, Spark has a lot to offer, but it too begs to have the data first organized into a standardized structure (see Is Apache Spark going to replace Hadoop). They are all somewhat different, so the question might be: how do I choose between MapReduce, Spark, Hive and PolyBase. If you are already using SQL Server, and have access to PolyBase, that is the best place to start. You can access traditional text files in Hadoop, as well as the ORC tables in Hive (or delimitedtext tables). PolyBase allows you to use the T-SQL command you already know, and will bypass MapReduce as needed. If you have both PolyBase and Hadoop/Spark it is not an either/or question. The question is which tool is the best for this problem.

Se also:
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

http://www.infoworld.com/article/3019754/application-development/16-things-you-should-know-about-hadoop-and-spark-right-now.html#tk.drr_mlt