Tuesday, November 17, 2015

PolyBase wants a strongly typed ORC Hive Table

It appears that PolyBase wants a strongly typed external table definition to make a connection to a Hive ORC table in Hortonworks.  For new tables, or new Hadoop connections, I initially declare all of the columns as varchar or nvarchar. For this test, my ORC table had four columns, the first two were string, and the others were int.




So, starting off, all four columns were declared as nvarchar.  But this was a Hive ORC table, and it did not work.  The return message was:

     Msg 106000, Level 16, State 1, Line 75
     org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
     OLE DB provider "SQLNCLI11" for linked server "(null)" returned message
     "Unspecified error".

     Msg 7421, Level 16, State 2, Line 76
     Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server " (null)"..


Looking closely, the key phrase in the error message was "cannot be cast." Once the two int columns were properly declared as int the create table command ran successfully, and the table could now be queried.  The successful DDL:

CREATE EXTERNAL TABLE [dbo].sample_07c (
codeid nvarchar(200) null,
descrip nvarchar(200) null,
total_emp int null,
salary int null
)
WITH (
         LOCATION='/apps/hive/warehouse/sample_07c/000000_0',
 DATA_SOURCE = hdp23_orc ,
 FILE_FORMAT = ORCsnappy
 );
         go


Also note that I had a dedicated data source for the ORC file type. According to MSDN, all data sources must have the same file format associated with it.  So, you'll need a different data source for your DELIMITEDTEXT format types, and another for your ORC format types. The exact warning in MSDN is:

System_CAPS_noteNote
All data sources must have the same format type.
You cannot have some data sources with text-delimited file formats
and some with RCFILE formats.

            See:  MSDN: CREATE EXTERNAL TABLE (Transact-SQL)



No comments: