Monday, September 14, 2015

Connect PolyBase to your Hive database Table: SQL Server 2016

A straight forward way to connect PolyBase to your Hadoop text-delimited Hive tables is to use the DELIMITEDTEXT format method (non-ORC files). The only difference is that with Hive tables, the file data has been internally migrated from a simple file format into a Hive table. But, while it is treated as a Hive table, it is still a file within Hadoop. And you can access that file using the PolyBase External Table declaration syntax. (For ORC tables see Connect PolyBase to your Hive ORC Table)

Before you can create an External Table in PolyBase, you first need both an External Data Source, and an appropriate External File format. And like any text-delimited External Tables in PolyBase, you'll need to define how the columns are delimited, the column names and data types, and the physical location of the file on the Hadoop server.

To get started, first navigate to the Hive user interface in Hadoop. For this example, we are using Hortonworks.  With the UI open, select/click on the Tables tab. This will show a listing of the Hive tables available.

HIVE table listing - notice the database name:  xademo.
 



With the Table listing open, locate the table. Our target table is call_detail_records. After locating the table, select/click on it. This brings up the Table Metadata page.

Metadata page shows the column names and datatypes stored in HIVE.




The Metadata page defines how Hive has named the column and their data type. We will need this to create our External Table. On the left, locate and select/click on the View File Location action. This takes you to the folder where the file is located, as well as providing the actual file name stored in the Hive database. 

File Browser - File Folder Location.


In our example, the file and folder name to be used for the External Table is:

  /apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt

In the file-folder name, notice the folder xademo.db. This represents the Hive database named xademo. The next folder, call_detail_records is identical to the Hive table name. The last value, cdrs.txt, is the file name used within Hadoop on the server. You will also notice that you are now in the File Browser, and not Hive.

Select/click the file (in our case cdrs.txt). This opens the file in viewing mode, so you can discover the file delimiter being used by Hive. In our case, it is a pipe.

Preview of the file data in Hadoop for the file cdrs.txt


For the Pipe delimiter, we'll need to create a new External File format, with this syntax:

CREATE EXTERNAL FILE FORMAT Pipefile
WITH (
       FORMAT_TYPE = DELIMITEDTEXT
              , FORMAT_OPTIONS (
              FIELD_TERMINATOR = '|'                  
                     ,USE_TYPE_DEFAULT =  TRUE ),
              DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
       );

Using the column and data type values discovered earlier, we can now create the External Table using this syntax:

       CREATE EXTERNAL TABLE [dbo].HIVE_asfile_cdrs (
              PHONE         varchar(100) NULL ,
              TYPE_1        varchar(40) NULL ,
              IN_OUT        varchar(40) NULL ,
              REC_LOCAL     varchar(40) NULL ,
              DURATION      varchar(40) NULL ,
              REC_DATE      varchar(40) NULL ,
              TIME_         varchar(40) NULL ,
              ROAMING       varchar(40) NULL ,
              AMOUNT        varchar(40) NULL ,
              IN_NETWORK    varchar(40) NULL ,
              IS_PROMO      varchar(40) NULL ,
              TOLL_FREE     varchar(40) NULL ,
              BYTES         varchar(40) NULL ,
              TYPE_2        varchar(40) NULL
             )
           WITH (LOCATION='/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt',
              DATA_SOURCE = hdp23 ,
              FILE_FORMAT = Pipefile           
              );

With the External Table created, test the connection by running a simple SELECT. If all goes well, your Hive table is now available via a PolyBase connection.



14 comments:

kiran kumar said...

Andrew,

I followed all the steps but when i execute select from SQL server 2016 for the hive table, it goes executing without returning anything. Any idea on this issue ?

-Kiran

Andrew Peterson said...

First verify that you can connect to a text file first. If you Hive uses an RCFile format, I'm not sure this approach will work.

Dennes said...

Hi, Andrew,

I tried your example and a few more using horton sandbox in windows azure (hdp 2.3), but had no success. I always get the message could not obtain block.

Do you have any idea on this issue ?

Thank you !

Andrew Peterson said...

Dennes, it sounds like you have a VM in Azure running Hortonworks on Windows? Is that correct? I have not seen that error during my setups. I'm hoping to look over a windows setup later, so perhaps then I might see the error. I do suggest that you start with a small, simple CSV text file. And check the sp_config setup for the correct version of Hortonworks for windows, and HDP 2.3 is Yarn, so will need that as well.

If you can, list out the actual error message, thanks

Dennes said...

Hi, Andrew,

Thank you for your help.

Yes, I have a VM in azure, but it's running linux. I created from the azure marketplace, using Hortonworks 2.3 sandbox in marketplace.

I tried hadoop connectivity with 5 and 7, had no success with neither options. When I do a select over the table the message is:

"Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase."

It's too generic, doesn't explain anything. When I try to create statistics over the table I get a more specific message :

"Could not obtain block: BP-57164570-172.16.137.139-1445956759306:blk_1073742372_1556"

The number of the block change when I create different tables.

In SQL Server logs I get two messages:

"Failed stream scan. hr: -2097151974, resultcode : 26"
"shut down stream scan with handle 8 - tname PBQTable{438F0B04-DE57-4CC2-8282-6CAE9F5BA42B}"

I already followed your suggestion, I created a file with only one record, the error message was the same.

I checked HDFS log and I can see the access two the file as a success. The name node log has some replication errors, but I don't think they are related, although I can't be sure.

Thank you !

Andrew Peterson said...

Yea, the error messages for PolyBase don't help. Good idea to check the HDFS log.

Wayne said...

I have a table in Hive that consists of 10 files, how can I connect to the table instead of one of the underlying files?

I also have a view in Hive that would like to connect to, is this possible?

Dinnu said...

Trying to create an external table using Polybase and this is the error I get

EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_Connect: Error [java.net.UnknownHostException: sandbox.hortonworks.com] occurred while accessing external file.'


Any ideas?

Andrew Peterson said...

Dinnu, we have not seen that error. Let us know what you find out.
Also, PolyBase will not read all file types in HDFS. Since it is not clear if you have verified your overall setup and connections, be sure to check all. And use a very simple test file at first.

Dennes said...

I didn't see this exact error, but this link solved all my pushdown problems to polybase: https://community.hortonworks.com/questions/38806/yarn-classpath-value-for-polybase-pushdown.html

Dinnu said...

All I am trying to do is to get data from a csv file with 2 columns and 2 rows. I am just trying to make sure if I can access the file. FYI, Hortonworks sandbox is on Microsoft Azure and the machines I am trying to access sandbox is local to my home. These are the steps in the sequence to create external table

1.
CREATE External DATA SOURCE HortonWorksSandbox
WITH (
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'

)

2.
CREATE EXTERNAL FILE FORMAT CSVFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =',',
USE_TYPE_DEFAULT = TRUE)
)

3.
CREATE EXTERNAL TABLE Expenses
(TransDate VARCHAR(1000),
Amount INT
)
WITH (
LOCATION = '/apps/hive/warehouse/TestExpenses/TestExpenses.csv',
Data_Source = HortonWorksSandbox,
FILE_FORMAT = CSVFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1)

On researching, folks on google said I should see this file C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config but I dont see Settings folder. I installed SQL Server 2016 on Windows Server 2012 and Windows 10 but with no luck finding Settings folder.

Andrew Peterson said...

Are you using true Azure location, or the one you posted above? "LOCATION = 'hdfs://sandbox.hortonworks.com:8020'"

Have you verified that you can access the Hadoop port from your machine?
Using a browser, you can do a simple test just to make sure you can access the hdfs port.
Just switch the hdfs to http. for example: http://192.168.1.120:8020/. you will get a message like this:
"It looks like you are making an HTTP request to a Hadoop IPC port. This is not the correct port for the web interface on this daemon."

And you will need to setup your SQL2016 for YARN.
The file/folder you seek is in:
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf\

See:
https://realizeddesign.blogspot.com/2015/09/setting-up-polybase-for-yarn-in-sql.html
https://realizeddesign.blogspot.com/2015/09/setting-up-polybase-in-sql-server-2016.html


See my posts

Dinnu said...

Andrew,
Thanks a lot for the info. I modified yarn-site.xml file as specified. And using true Azure location, I get this below error. Is it still because of some access issue?

Could not obtain block: BP-595454498-172.16.137.143-1456768655900:blk_1073742472_1658 file=/apps/hive/warehouse/TestExpenses/TestExpenses.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-595454498-172.16.137.143-1456768655900:blk_1073742472_1658 file=/apps/hive/warehouse/TestExpenses/TestExpenses.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)

Andrew Peterson said...

I have not seen that error. All I can suggest is to modify the location in the external table definition to just point to the folder in hadoop, and not the full file name. See "Best Practices - use a folder" in https://realizeddesign.blogspot.com/2015/09/create-polybase-external-table-in-sql.html.

Also, PolyBase has limits on the types of files it can read. If the file is UTF-16, for example, PolyBase cannot read. If the file is not ASCII or UTF-8, then that could be the issue.