Tuesday, November 17, 2015

Connect PolyBase to your Hive ORC Table

Using PolyBase to connect to a plain text Hive table (file) is no different from connecting to any other file in Hadoop. (See:  Connect PolyBase to your Hive database Table: SQL Server 2016) But the future of Hive is moving to the optimized row columnar (ORC) format. According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

Setting up ORC tables in PolyBase is a three step process:


External Data Source      - no sharing between file format types!
External File Format        - specific for ORC
External Table                   - relies on the ORC file format

This follows the same approach used to connect to plain text files. But, do we need to make changes to all three? To some extent, yes. External tables rely on the external file format, so we'll need to either create a new external table, or modify an existing external table. The external file format is where we specify that the source table is an ORC. And finally, the external data source can only support one format type. So, you will need different external data sources for your plain text file formats and your ORC file formats - even if they are pointing to the same Hadoop cluster.  On MSDN - CREATE EXTERNAL TABLE(Transact-SQL), near the bottom in the examples section is a note about data sources. Specifically it states:

System_CAPS_noteNote
All data sources must have the same format type. You cannot have some data
sources with text-delimited file formats and some with RCFILE formats.

ORC Specific External Data Source


For our test, we created a specific external data source just for ORC Hive tables.

CREATE EXTERNAL DATA SOURCE hdp23_orc with 
 (
  TYPE = HADOOP,
  LOCATION ='hdfs://192.168.1.120:8020',
  RESOURCE_MANAGER_LOCATION='192.168.1.120:8050'
 ); 

ORC Specific External File Format


Next, we created three external file formats just for our ORC tables.  Notice that unlike the DELIMITEDTEXT external file formats, there is no need for a field terminator. string delimiter or date format. Besides the FORMAT_TYPE option, the only other option for the ORC format type is for compression.  Here, you can either ask your Hadoop administrator, or experiment to see which one works. We created all three external file formats. One for each of the two explicit data compression options, and a third where we omitted the data compression completely. In our Hortonworks cluster, both the SnappyCodec data compression format and the file format where we omitted the declaration for the data compression worked. Here they are:

  CREATE EXTERNAL FILE FORMAT ORCdefault
  WITH (
        FORMAT_TYPE = ORC
       ,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
       );
  go

  CREATE EXTERNAL FILE FORMAT ORCsnappy
  WITH (
        FORMAT_TYPE = ORC
       ,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
       );
  go

  CREATE EXTERNAL FILE FORMAT ORCnocompress
  WITH (
        FORMAT_TYPE = ORC
       );
 go


ORC Specific External Table


With the ORC specific data source setup and the ORC specific file formats ready to use, all we need is to setup the table. We found that PolyBase wants a strongly typed ORC Hive Table, so here you will need to make sure your SQL data types match those that are present in Hadoop. (See PolyBase wants a strongly typed ORC Hive Table)


   CREATE EXTERNAL TABLE [dbo].sample_07c (
     codeid          nvarchar(200) NULL ,
     descrip         nvarchar(200) NULL ,
     total_emp             int null,
     salary                int null
       )
       WITH (
             LOCATION    ='/apps/hive/warehouse/sample_07c/000000_0',
             DATA_SOURCE = hdp23_orc ,
             FILE_FORMAT = ORCsnappy                  

            );



See also:



12 comments:

Dennes said...

Hi,

Did you tried to connect polybase to HDInsight ?

I had no success yet. I can connect to wasb but not to hdinsight (to enable pushdown).

Do you have some example ? Thank you !

Andrew Peterson said...

Hi Dennes,

It should work with HDInsight - the first release of PolyBase was part of APS (analytic platform system) that connected to HDInsight. But, the current documentation for PolyBase with SQLSvr2016 does not explicitly list HDInsight. So we are left to guess what configuration settings to use.

If you do get it working, please do let me know. And when I do have time, I will take a look at HDInsight.
This posting on Azure lists out the HDInsight version and how it compares to the Hortonworks version (which are listed for the configuration settings):

HDInsight versions and Hadoop components
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/

Dennes said...

Hi, Andrew,

Thank you, I will check this link.

At this moment my problem is that I can't access the ports for name node and resource manager (8020 and 8032) from a remote host, like my machine.

Besides this problem, HDInsight doesn't have classpath property configured and I'm also not sure about how to pass authentication when hdinsight is configured to simple authentication. The documentation says to use credential when hdinsight have kerberos authentication, but says nothing about simple authentication.

Cheers,

Dennes said...

Hi, Andrew,

Thank you, I will check this link.

At this moment my problem is that I can't access the ports for name node and resource manager (8020 and 8032) from a remote host, like my machine.

Besides this problem, HDInsight doesn't have classpath property configured and I'm also not sure about how to pass authentication when hdinsight is configured to simple authentication. The documentation says to use credential when hdinsight have kerberos authentication, but says nothing about simple authentication.

Cheers,

Hendrik Schulmeyer said...

Hi Dennes,

i've got the same problem.
I am trying to connect to Azure HDInsight on Linux via Polybase. But there is a Error, every time i try to do a query on a external table.

Did you solve this problem already? I would appreciate some help.

Cheers,

Hendrik

Andrew Peterson said...

Hendrik,
Dennes added a comment on the PolyBase errors page http://realizeddesign.blogspot.com/2015/10/polybase-error-connecting-to-hadoop-file.html

here is this comment
"Hi,

Finally it worked !

I figured that the message "could not obtain block" usually means the port 50010 (data node port) is not responding to SQL Server.

HDP 2.3 sandbox in marketplace has this behavior, doesn't answer the SQL Server request in 50010, even when telnet to this port is working.

Cheers,"

Dennes said...

Hi, Hendrik,

I didn't connected to HDInsight. I connected to a local HDP sandbox, it worked.

The problem with HDInsight are the ports: The ports that polybase use to connect to hadoop aren't open in hdinsight.

You can try a telnet to check this and check in the main node the opened ports. You will need to figure out which ports you should use.

Hendrik Schulmeyer said...

Thanks Andrew and Dennes for your help.

I tried a HDP Sandbox in the past. It worked aswell.
But with HDInsight, it doesnt'.

The HDInsight Cluster and the VM with SQL Server 2016 are in the same virutal network. A portscan gives the following result.


Scanning 10.1.0.13 [65535 ports]

Discovered open port 8888/tcp on 10.1.0.13

Discovered open port 22/tcp on 10.1.0.13

Discovered open port 19888/tcp on 10.1.0.13

Discovered open port 16001/tcp on 10.1.0.13

Discovered open port 10001/tcp on 10.1.0.13

Discovered open port 8141/tcp on 10.1.0.13

Discovered open port 10033/tcp on 10.1.0.13

Discovered open port 8188/tcp on 10.1.0.13

Discovered open port 8035/tcp on 10.1.0.13

Discovered open port 6188/tcp on 10.1.0.13

Discovered open port 30200/tcp on 10.1.0.13

Discovered open port 37821/tcp on 10.1.0.13

Discovered open port 34259/tcp on 10.1.0.13

Discovered open port 9083/tcp on 10.1.0.13

Discovered open port 30070/tcp on 10.1.0.13

Discovered open port 30111/tcp on 10.1.0.13

Discovered open port 11000/tcp on 10.1.0.13

Discovered open port 8670/tcp on 10.1.0.13

Discovered open port 8009/tcp on 10.1.0.13

Discovered open port 8088/tcp on 10.1.0.13

Discovered open port 8019/tcp on 10.1.0.13

Discovered open port 10020/tcp on 10.1.0.13

Discovered open port 61181/tcp on 10.1.0.13

Discovered open port 8020/tcp on 10.1.0.13

Discovered open port 50073/tcp on 10.1.0.13

Discovered open port 1270/tcp on 10.1.0.13

Discovered open port 61310/tcp on 10.1.0.13

Discovered open port 10200/tcp on 10.1.0.13

Discovered open port 34862/tcp on 10.1.0.13

Discovered open port 8021/tcp on 10.1.0.13


Do i have to force polybase to use a specific port? Where i have to change this setting?

Best regards
Hendrik

Hendrik Schulmeyer said...

Here's the error report ...


12/16/2015 10:34:00 AM [Thread:4452] [EngineInstrumentation:EngineExecuteQueryErrorEvent] (Error, High):
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
] occurred while accessing external file.'
Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
] occurred while accessing external file.' ---> Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException: Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)


.
.
.

Dennes said...

Hi,

You do this in the datasource configuration. Try 8020 port.

Cheers,

Hendrik Schulmeyer said...

Port 8020 doesn't work.

Could it be the the following setting?


hdfs


I already tried "hue" "admin" and the hadoop credentials.




Hendrik Schulmeyer said...

Oh, there some lines missing in the last post...

Should be:


setting name="HadoopUserName" serializeAs="String"
value>hdfs</value
setting