PolyBase vs. Spark vs. Hive
Hadoop has been gaining grown in the last few years, and as
it grows, some of its weaknesses are starting to show. For analysis/analytics, one issue has been a
combination of complexity and speed. Given that Hadoop is designed to store unstructured
data, the reality is that at least with the first phases of ETL/ELT/EL against the
unstructured data it will be complex. And yes, call it what you want, but it is
in fact a form of ETL/ELT/EL. But once the data has been organized, let's say
structured, the next questions will be - where do we put the data, and how do
we manipulate the data.
In the early days of Hadoop, it appeared that the typical
approach was to transfer the data to a more traditional database. It might be an MPP system, such as Vertica or
Teradata, or a relational database such as SQL Server. or you could move the
data to a Hive table. Hive uses many of the SQL commands, but the early design
of Hive was slow. And HBase was available, but few if any analysts knew the
cryptic commands for HBase.
Improvements to Analysis
These weaknesses have been addressed in one of two
approaches: Improve the current Hadoop
functionality, or create new external tools that address both the complexity
and the speed issues.
YARN & SPARK
Mapreduce against Hadoop is slow. Spark allows the creation of a clustering computation engine that can be run against HDFS, or a few other non-Hadoop data structures. The enabler for HDFS was Hadoop 2 with YARN. Spark runs under YARN, but much faster than mapreduce. If you are running Hadoop, you will want to include Spark.
And over time, Hive has improved, with the introduction of
ORC tables (optimized row columnar), which greatly improved performance (see ORC File in HDP 2: Better Compression, Better Performance). For 2016, an even faster Hive will be introduced, called Hive LLAP (live long and process). The goal, provide sub-second response for Hive tables For external tools, Here are two links that provide additional details on Hive LLAP:
INTERACTIVE SQL ON HADOOP WITH HIVE LLAP
HIVE LLAP PREVIEW ENABLES SUB-SECOND SQL ON HADOOP AND MORE
INTERACTIVE SQL ON HADOOP WITH HIVE LLAP
HIVE LLAP PREVIEW ENABLES SUB-SECOND SQL ON HADOOP AND MORE
Working with PolyBase, we have found that once we get the
data setup, using it is straight forward. But, we still had to get the data
into some form of structure. Looking over the Spark documentation, Spark has a
lot to offer, but it too begs to have the data first organized into a
standardized structure (see Is Apache Spark going to replace Hadoop). They are all somewhat different, so the question might
be: how do I choose between MapReduce,
Spark, Hive and PolyBase. If you are already using SQL Server, and have
access to PolyBase, that is the best place to start. You can access traditional text files in
Hadoop, as well as the ORC tables in Hive (or delimitedtext tables). PolyBase
allows you to use the T-SQL command you already know, and will bypass MapReduce
as needed. If you have both PolyBase and Hadoop/Spark it is not an either/or question. The question is which tool is the best for this problem.
Se also:
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html
http://www.infoworld.com/article/3019754/application-development/16-things-you-should-know-about-hadoop-and-spark-right-now.html#tk.drr_mlt
Se also:
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html
http://www.infoworld.com/article/3019754/application-development/16-things-you-should-know-about-hadoop-and-spark-right-now.html#tk.drr_mlt
8 comments:
Can polybase connects to a hive ORC table ?? We are getting some errors trying to connect
Yes, see
http://realizeddesign.blogspot.com/2015/11/connect-polybase-to-your-hive-orc-table.html
Hahaha that's awesome
Hello, can Polybase leverage SPARK instead of MapReduce? Is this possible?
RK, interesting question. I do not know the internals of Polybase, so do not know the answer. If we do find out, we'll be sure to post it. Thanks
Such an interesting blog,i gather more useful information...
Aviation Courses in Chennai
Air hostess Training Institute in Bangalore
air hostess training fees in mumbai
air hostess training in chennai
Aviation courses in Bangalore
air hostess training in chennai
Air Hostess Training Institute in chennai
Aviation Courses in Chennai
aviation institute in bangalore
air hostess course in chennai
Nice post,thanku for your info..
Keep sharing more posts with us.
big data online training
ibovi staffing and consulting agency
Post a Comment