Realized Design: Connect PolyBase to your Hive database Table: SQL Server 2016

Monday, September 14, 2015

Connect PolyBase to your Hive database Table: SQL Server 2016

A straight forward way to connect PolyBase to your Hadoop text-delimited Hive tables is to use the DELIMITEDTEXT format method (non-ORC files). The only difference is that with Hive tables, the file data has been internally migrated from a simple file format into a Hive table. But, while it is treated as a Hive table, it is still a file within Hadoop. And you can access that file using the PolyBase External Table declaration syntax. (For ORC tables see Connect PolyBase to your Hive ORC Table)

Before you can create an External Table in PolyBase, you first need both an External Data Source, and an appropriate External File format. And like any text-delimited External Tables in PolyBase, you'll need to define how the columns are delimited, the column names and data types, and the physical location of the file on the Hadoop server.

To get started, first navigate to the Hive user interface in Hadoop. For this example, we are using Hortonworks. With the UI open, select/click on the Tables tab. This will show a listing of the Hive tables available.

HIVE table listing - notice the database name: xademo.

With the Table listing open, locate the table. Our target table is call_detail_records. After locating the table, select/click on it. This brings up the Table Metadata page.

Metadata page shows the column names and datatypes stored in HIVE.

The Metadata page defines how Hive has named the column and their data type. We will need this to create our External Table. On the left, locate and select/click on the View File Location action. This takes you to the folder where the file is located, as well as providing the actual file name stored in the Hive database.

File Browser - File Folder Location.

In our example, the file and folder name to be used for the External Table is:

/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt

In the file-folder name, notice the folder xademo.db. This represents the Hive database named xademo. The next folder, call_detail_records is identical to the Hive table name. The last value, cdrs.txt, is the file name used within Hadoop on the server. You will also notice that you are now in the File Browser, and not Hive.

Select/click the file (in our case cdrs.txt). This opens the file in viewing mode, so you can discover the file delimiter being used by Hive. In our case, it is a pipe.

Preview of the file data in Hadoop for the file cdrs.txt

For the Pipe delimiter, we'll need to create a new External File format, with this syntax:

CREATE EXTERNAL FILE FORMAT Pipefile

WITH (

FORMAT_TYPE = DELIMITEDTEXT

, FORMAT_OPTIONS (

FIELD_TERMINATOR = '|'

,USE_TYPE_DEFAULT = TRUE ),

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'

);

Using the column and data type values discovered earlier, we can now create the External Table using this syntax:

CREATE EXTERNAL TABLE [dbo].HIVE_asfile_cdrs (

PHONE varchar(100) NULL ,

TYPE_1 varchar(40) NULL ,

IN_OUT varchar(40) NULL ,

REC_LOCAL varchar(40) NULL ,

DURATION varchar(40) NULL ,

REC_DATE varchar(40) NULL ,

TIME_ varchar(40) NULL ,

ROAMING varchar(40) NULL ,

AMOUNT varchar(40) NULL ,

IN_NETWORK varchar(40) NULL ,

IS_PROMO varchar(40) NULL ,

TOLL_FREE varchar(40) NULL ,

BYTES varchar(40) NULL ,

TYPE_2 varchar(40) NULL

)

WITH (LOCATION='/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt',

DATA_SOURCE = hdp23 ,

FILE_FORMAT = Pipefile

);

With the External Table created, test the connection by running a simple SELECT. If all goes well, your Hive table is now available via a PolyBase connection.

14 comments:

kiran kumar said...: Andrew,

I followed all the steps but when i execute select from SQL server 2016 for the hive table, it goes executing without returning anything. Any idea on this issue ?

-Kiran; October 23, 2015 at 5:34 AM
Andrew Peterson said...: First verify that you can connect to a text file first. If you Hive uses an RCFile format, I'm not sure this approach will work.; October 23, 2015 at 8:17 AM
Dennes said...: Hi, Andrew,

I tried your example and a few more using horton sandbox in windows azure (hdp 2.3), but had no success. I always get the message could not obtain block.

Do you have any idea on this issue ?

Thank you !; November 25, 2015 at 10:13 AM
Andrew Peterson said...: Dennes, it sounds like you have a VM in Azure running Hortonworks on Windows? Is that correct? I have not seen that error during my setups. I'm hoping to look over a windows setup later, so perhaps then I might see the error. I do suggest that you start with a small, simple CSV text file. And check the sp_config setup for the correct version of Hortonworks for windows, and HDP 2.3 is Yarn, so will need that as well.

If you can, list out the actual error message, thanks; November 25, 2015 at 11:01 AM
Dennes said...: Hi, Andrew,

Thank you for your help.

Yes, I have a VM in azure, but it's running linux. I created from the azure marketplace, using Hortonworks 2.3 sandbox in marketplace.

I tried hadoop connectivity with 5 and 7, had no success with neither options. When I do a select over the table the message is:

"Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase."

It's too generic, doesn't explain anything. When I try to create statistics over the table I get a more specific message :

"Could not obtain block: BP-57164570-172.16.137.139-1445956759306:blk_1073742372_1556"

The number of the block change when I create different tables.

In SQL Server logs I get two messages:

"Failed stream scan. hr: -2097151974, resultcode : 26"
"shut down stream scan with handle 8 - tname PBQTable{438F0B04-DE57-4CC2-8282-6CAE9F5BA42B}"

I already followed your suggestion, I created a file with only one record, the error message was the same.

I checked HDFS log and I can see the access two the file as a success. The name node log has some replication errors, but I don't think they are related, although I can't be sure.

Thank you !; November 25, 2015 at 11:32 AM
Andrew Peterson said...: Yea, the error messages for PolyBase don't help. Good idea to check the HDFS log.; November 25, 2015 at 1:09 PM
Wayne said...: I have a table in Hive that consists of 10 files, how can I connect to the table instead of one of the underlying files?

I also have a view in Hive that would like to connect to, is this possible?; February 5, 2016 at 6:10 AM
Dinnu said...: Trying to create an external table using Polybase and this is the error I get

EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_Connect: Error [java.net.UnknownHostException: sandbox.hortonworks.com] occurred while accessing external file.'

Any ideas?; September 2, 2016 at 12:14 AM
Andrew Peterson said...: Dinnu, we have not seen that error. Let us know what you find out.
Also, PolyBase will not read all file types in HDFS. Since it is not clear if you have verified your overall setup and connections, be sure to check all. And use a very simple test file at first.; September 2, 2016 at 8:10 AM
Dennes said...: I didn't see this exact error, but this link solved all my pushdown problems to polybase: https://community.hortonworks.com/questions/38806/yarn-classpath-value-for-polybase-pushdown.html; September 2, 2016 at 8:14 AM
Dinnu said...: All I am trying to do is to get data from a csv file with 2 columns and 2 rows. I am just trying to make sure if I can access the file. FYI, Hortonworks sandbox is on Microsoft Azure and the machines I am trying to access sandbox is local to my home. These are the steps in the sequence to create external table

1.
CREATE External DATA SOURCE HortonWorksSandbox
WITH (
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'

)

2.
CREATE EXTERNAL FILE FORMAT CSVFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =',',
USE_TYPE_DEFAULT = TRUE)
)

3.
CREATE EXTERNAL TABLE Expenses
(TransDate VARCHAR(1000),
Amount INT
)
WITH (
LOCATION = '/apps/hive/warehouse/TestExpenses/TestExpenses.csv',
Data_Source = HortonWorksSandbox,
FILE_FORMAT = CSVFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1)

On researching, folks on google said I should see this file C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config but I dont see Settings folder. I installed SQL Server 2016 on Windows Server 2012 and Windows 10 but with no luck finding Settings folder.; September 2, 2016 at 10:28 PM
Andrew Peterson said...: Are you using true Azure location, or the one you posted above? "LOCATION = 'hdfs://sandbox.hortonworks.com:8020'"

Have you verified that you can access the Hadoop port from your machine?
Using a browser, you can do a simple test just to make sure you can access the hdfs port.
Just switch the hdfs to http. for example: http://192.168.1.120:8020/. you will get a message like this:
"It looks like you are making an HTTP request to a Hadoop IPC port. This is not the correct port for the web interface on this daemon."

And you will need to setup your SQL2016 for YARN.
The file/folder you seek is in:
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf\

See:
https://realizeddesign.blogspot.com/2015/09/setting-up-polybase-for-yarn-in-sql.html
https://realizeddesign.blogspot.com/2015/09/setting-up-polybase-in-sql-server-2016.html

See my posts; September 5, 2016 at 10:20 AM
Dinnu said...: Andrew,
Thanks a lot for the info. I modified yarn-site.xml file as specified. And using true Azure location, I get this below error. Is it still because of some access issue?

Could not obtain block: BP-595454498-172.16.137.143-1456768655900:blk_1073742472_1658 file=/apps/hive/warehouse/TestExpenses/TestExpenses.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-595454498-172.16.137.143-1456768655900:blk_1073742472_1658 file=/apps/hive/warehouse/TestExpenses/TestExpenses.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status); September 7, 2016 at 11:23 AM
Andrew Peterson said...: I have not seen that error. All I can suggest is to modify the location in the external table definition to just point to the folder in hadoop, and not the full file name. See "Best Practices - use a folder" in https://realizeddesign.blogspot.com/2015/09/create-polybase-external-table-in-sql.html.

Also, PolyBase has limits on the types of files it can read. If the file is UTF-16, for example, PolyBase cannot read. If the file is not ASCII or UTF-8, then that could be the issue.; September 7, 2016 at 5:02 PM

Pages

Monday, September 14, 2015

Connect PolyBase to your Hive database Table: SQL Server 2016

14 comments: