Using PolyBase to connect to a plain text Hive table (file)
is no different from connecting to any other file in Hadoop. (See: Connect PolyBase to your Hive database Table: SQL Server 2016) But the future of
Hive is moving to the optimized row columnar (ORC) format. According to a
posting on the Hortonworks site, both the compression and the performance for
ORC files are vastly superior to both plain text Hive tables and RCfile tables.
For compression, ORC files are listed as 78% smaller than plain text files. And
for performance, ORC files support predicate pushdown and improved indexing
that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC
files will gain in popularity. (you can
read the posting here: ORC File in HDP 2: Better Compression, Better Performance).
Setting up ORC tables in PolyBase is a three step process:
External Data Source
- no sharing between file format
types!
External File Format -
specific for ORC
External Table -
relies on the ORC file format
This follows the same approach used to connect to plain text
files. But, do we need to make changes to all three? To some extent, yes. External
tables rely on the external file format, so we'll need to either create a new
external table, or modify an existing external table. The external file format
is where we specify that the source table is an ORC. And finally, the external
data source can only support one format type. So, you will need different
external data sources for your plain text file formats and your ORC file
formats - even if they are pointing to the same Hadoop cluster. On MSDN - CREATE EXTERNAL TABLE(Transact-SQL), near the bottom in the examples section is
a note about data sources. Specifically it states:
Note
|
All data sources must have the
same format type. You cannot have some data
sources with text-delimited file formats and some with RCFILE formats. |
ORC Specific External Data Source
For our test, we created a specific external data source
just for ORC Hive tables.
CREATE EXTERNAL DATA SOURCE hdp23_orc with
(
TYPE = HADOOP,
LOCATION ='hdfs://192.168.1.120:8020',
RESOURCE_MANAGER_LOCATION='192.168.1.120:8050'
);
ORC Specific External File Format
Next, we created three external file formats just for our ORC
tables. Notice that unlike the DELIMITEDTEXT external
file formats, there is no need for a field terminator. string delimiter or date
format. Besides the FORMAT_TYPE option, the only other option for the ORC
format type is for compression. Here,
you can either ask your Hadoop administrator, or experiment to see which one works. We created all three external file formats. One for each of the two explicit data
compression options, and a third where we omitted the data compression
completely. In our Hortonworks cluster, both the SnappyCodec data compression
format and the file format where we omitted the declaration for the data
compression worked. Here they are:
CREATE EXTERNAL FILE FORMAT ORCdefault
WITH (
FORMAT_TYPE = ORC
,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
);
go
CREATE EXTERNAL FILE FORMAT ORCsnappy
WITH (
FORMAT_TYPE = ORC
,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
go
CREATE EXTERNAL FILE FORMAT ORCnocompress
WITH (
FORMAT_TYPE = ORC
);
go
ORC Specific External Table
With the ORC specific data source setup and the ORC specific
file formats ready to use, all we need is to setup the table. We found that PolyBase
wants a strongly typed ORC Hive Table, so here you will need to make sure your
SQL data types match those that are present in Hadoop. (See PolyBase wants a strongly typed ORC Hive Table)
CREATE EXTERNAL TABLE [dbo].sample_07c (
codeid
nvarchar(200) NULL ,
descrip
nvarchar(200) NULL ,
total_emp int
null,
salary int
null
)
WITH
(
LOCATION ='/apps/hive/warehouse/sample_07c/000000_0',
DATA_SOURCE = hdp23_orc ,
FILE_FORMAT = ORCsnappy
);
See also:
- Getting Started with PolyBase in SQL Server 2016
- Create an External Data Source to Hadoop in SQL Server 2016
- Create a PolyBase CSV External File Format in SQL Server 2016
- Create PolyBase External Table in SQL Server 2016
- PolyBase YARN configurations in SQL Server 2016
- PolyBase Settings Folder in SQL Server 2016
13 comments:
Hi,
Did you tried to connect polybase to HDInsight ?
I had no success yet. I can connect to wasb but not to hdinsight (to enable pushdown).
Do you have some example ? Thank you !
Hi Dennes,
It should work with HDInsight - the first release of PolyBase was part of APS (analytic platform system) that connected to HDInsight. But, the current documentation for PolyBase with SQLSvr2016 does not explicitly list HDInsight. So we are left to guess what configuration settings to use.
If you do get it working, please do let me know. And when I do have time, I will take a look at HDInsight.
This posting on Azure lists out the HDInsight version and how it compares to the Hortonworks version (which are listed for the configuration settings):
HDInsight versions and Hadoop components
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/
Hi, Andrew,
Thank you, I will check this link.
At this moment my problem is that I can't access the ports for name node and resource manager (8020 and 8032) from a remote host, like my machine.
Besides this problem, HDInsight doesn't have classpath property configured and I'm also not sure about how to pass authentication when hdinsight is configured to simple authentication. The documentation says to use credential when hdinsight have kerberos authentication, but says nothing about simple authentication.
Cheers,
Hi, Andrew,
Thank you, I will check this link.
At this moment my problem is that I can't access the ports for name node and resource manager (8020 and 8032) from a remote host, like my machine.
Besides this problem, HDInsight doesn't have classpath property configured and I'm also not sure about how to pass authentication when hdinsight is configured to simple authentication. The documentation says to use credential when hdinsight have kerberos authentication, but says nothing about simple authentication.
Cheers,
Hi Dennes,
i've got the same problem.
I am trying to connect to Azure HDInsight on Linux via Polybase. But there is a Error, every time i try to do a query on a external table.
Did you solve this problem already? I would appreciate some help.
Cheers,
Hendrik
Hendrik,
Dennes added a comment on the PolyBase errors page http://realizeddesign.blogspot.com/2015/10/polybase-error-connecting-to-hadoop-file.html
here is this comment
"Hi,
Finally it worked !
I figured that the message "could not obtain block" usually means the port 50010 (data node port) is not responding to SQL Server.
HDP 2.3 sandbox in marketplace has this behavior, doesn't answer the SQL Server request in 50010, even when telnet to this port is working.
Cheers,"
Hi, Hendrik,
I didn't connected to HDInsight. I connected to a local HDP sandbox, it worked.
The problem with HDInsight are the ports: The ports that polybase use to connect to hadoop aren't open in hdinsight.
You can try a telnet to check this and check in the main node the opened ports. You will need to figure out which ports you should use.
Thanks Andrew and Dennes for your help.
I tried a HDP Sandbox in the past. It worked aswell.
But with HDInsight, it doesnt'.
The HDInsight Cluster and the VM with SQL Server 2016 are in the same virutal network. A portscan gives the following result.
Scanning 10.1.0.13 [65535 ports]
Discovered open port 8888/tcp on 10.1.0.13
Discovered open port 22/tcp on 10.1.0.13
Discovered open port 19888/tcp on 10.1.0.13
Discovered open port 16001/tcp on 10.1.0.13
Discovered open port 10001/tcp on 10.1.0.13
Discovered open port 8141/tcp on 10.1.0.13
Discovered open port 10033/tcp on 10.1.0.13
Discovered open port 8188/tcp on 10.1.0.13
Discovered open port 8035/tcp on 10.1.0.13
Discovered open port 6188/tcp on 10.1.0.13
Discovered open port 30200/tcp on 10.1.0.13
Discovered open port 37821/tcp on 10.1.0.13
Discovered open port 34259/tcp on 10.1.0.13
Discovered open port 9083/tcp on 10.1.0.13
Discovered open port 30070/tcp on 10.1.0.13
Discovered open port 30111/tcp on 10.1.0.13
Discovered open port 11000/tcp on 10.1.0.13
Discovered open port 8670/tcp on 10.1.0.13
Discovered open port 8009/tcp on 10.1.0.13
Discovered open port 8088/tcp on 10.1.0.13
Discovered open port 8019/tcp on 10.1.0.13
Discovered open port 10020/tcp on 10.1.0.13
Discovered open port 61181/tcp on 10.1.0.13
Discovered open port 8020/tcp on 10.1.0.13
Discovered open port 50073/tcp on 10.1.0.13
Discovered open port 1270/tcp on 10.1.0.13
Discovered open port 61310/tcp on 10.1.0.13
Discovered open port 10200/tcp on 10.1.0.13
Discovered open port 34862/tcp on 10.1.0.13
Discovered open port 8021/tcp on 10.1.0.13
Do i have to force polybase to use a specific port? Where i have to change this setting?
Best regards
Hendrik
Here's the error report ...
12/16/2015 10:34:00 AM [Thread:4452] [EngineInstrumentation:EngineExecuteQueryErrorEvent] (Error, High):
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
] occurred while accessing external file.'
Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:835)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065)
] occurred while accessing external file.' ---> Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException: Java exception raised on call to HdfsBridge_IsDirExist: Error [Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1362)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4413)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:893)
.
.
.
Hi,
You do this in the datasource configuration. Try 8020 port.
Cheers,
Port 8020 doesn't work.
Could it be the the following setting?
hdfs
I already tried "hue" "admin" and the hadoop credentials.
Oh, there some lines missing in the last post...
Should be:
setting name="HadoopUserName" serializeAs="String"
value>hdfs</value
setting
Hello,
I have created the hive table from polybase. But some i'm not able to see in Ambari. How i can see table which created under polybase in Ambari table list.
Post a Comment