Realized Design: September 2016

Thursday, September 29, 2016

Hadoop Format Speed Tests: Parquet,ORC, w;w/o compression

For Hadoop/HDFS, which format is faster?

ORC vs RCfile

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity. (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

Parquet vs ORC

On Stackoverflow, contributor Rahul posted an extensive list of results he did comparing ORC vs. Parquet, along with different compressions. You can find the full results here: http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy.

Below are the results that were posted by Rahul:

       
Table A - Text File Format- 2.5GB

Table B - ORC - 652MB

Table C - ORC with Snappy - 802MB

Table D - Parquet - 1.9 GB

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Friday, September 23, 2016

Columnstore Indexes - a reference

With the release of SQL Server 2016, which included a more complete codebase for columnstore indexes, I recently discovered a listing of 86 blog posting by Niko Neugebauer. Perhaps all you might want to know about columnstore indexes.

You can find it here: Columnstore Indexes Blog by Niko Neugebauer

Also:

Trace Flag 834: Avoid when using Columnstore indexes.
See KB3210239: Interoperability of Columnstore indexes with large page memory model

Monday, September 5, 2016

Use a web browser to verify the hadoop IPC port is accessible

A quick, simple method to verify that your Hadoop / hdfs / IPC port is accessible is to simply use a web browser. Setting up PolyBase, you'll need to create an external data source. This is where you point to the Hadoop IPC port. PolyBase is not known for its robust error messages, so just because you created a data sources successfully does not mean it will work.

Lets say your Hadoop hdfs port has this address:

hdfs://192.168.1.120:8020

Simply replace hdfs with http and try to connect with a browser. Like this:

http://192.168.1.120:8020/

You should get the message:

       It looks like you are making an HTTP request to a Hadoop 
       IPC port. This is not the correct port for the web 
       interface on this daemon.

If you don't get this message, you'll need to work out why your Hadoop system is not responding.

Command(s) completed successfully.

PolyBase does not validate your Hadoop Location when you create a new external data source. For example, we created our new, but invalid data source with the following command. One that has a totally invalid hdfs location value:

CREATE EXTERNAL DATA SOURCE [invalid]
WITH (
TYPE = HADOOP
, LOCATION = N'hdfs://192.168.1.130:8020'
, RESOURCE_MANAGER_LOCATION = N'192.168.1.130:8050'
)
GO

PolyBase simple returned this value: Command(s) completed successfully. The point, is that just because you can create a Hadoop external data source does not mean that it is valid.

Pages