A straight forward way to connect PolyBase to your Hadoop text-delimited Hive tables is to use the DELIMITEDTEXT format method (non-ORC files). The only difference is that with Hive tables, the file data has been internally migrated from a simple file format into a Hive table. But, while it is treated as a Hive table, it is still a file within Hadoop. And you can access that file using the PolyBase External Table declaration syntax. (For ORC tables see Connect PolyBase to your Hive ORC Table)
Before you can create an External Table in PolyBase, you first need both an External Data Source, and an appropriate External File format. And like any text-delimited External Tables in PolyBase, you'll need to define how the columns are delimited, the column names and data types, and the physical location of the file on the Hadoop server.
To get started, first navigate to the Hive user interface in Hadoop. For this example, we are using Hortonworks. With the UI open, select/click on the Tables tab. This will show a listing of the Hive tables available.
HIVE table listing - notice the database name: xademo.
With the Table listing open, locate the table. Our target table is call_detail_records. After locating the table, select/click on it. This brings up the Table Metadata page.
Metadata page shows the column names and datatypes stored in HIVE.
The Metadata page defines how Hive has named the column and their data type. We will need this to create our External Table. On the left, locate and select/click on the View File Location action. This takes you to the folder where the file is located, as well as providing the actual file name stored in the Hive database.
File Browser - File Folder Location.
In our example, the file and folder name to be used for the External Table is:
In the file-folder name, notice the folder xademo.db. This represents the Hive database named xademo. The next folder, call_detail_records is identical to the Hive table name. The last value, cdrs.txt, is the file name used within Hadoop on the server. You will also notice that you are now in the File Browser, and not Hive.
Select/click the file (in our case cdrs.txt). This opens the file in viewing mode, so you can discover the file delimiter being used by Hive. In our case, it is a pipe.
Preview of the file data in Hadoop for the file cdrs.txt
For the Pipe delimiter, we'll need to create a new External File format, with this syntax:
CREATE EXTERNAL FILE FORMAT Pipefile
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR = '|'
,USE_TYPE_DEFAULT = TRUE ),
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
Using the column and data type values discovered earlier, we can now create the External Table using this syntax:
CREATE EXTERNAL TABLE [dbo].HIVE_asfile_cdrs (
PHONE varchar(100) NULL ,
TYPE_1 varchar(40) NULL ,
IN_OUT varchar(40) NULL ,
REC_LOCAL varchar(40) NULL ,
DURATION varchar(40) NULL ,
REC_DATE varchar(40) NULL ,
TIME_ varchar(40) NULL ,
ROAMING varchar(40) NULL ,
AMOUNT varchar(40) NULL ,
IN_NETWORK varchar(40) NULL ,
IS_PROMO varchar(40) NULL ,
TOLL_FREE varchar(40) NULL ,
BYTES varchar(40) NULL ,
TYPE_2 varchar(40) NULL
DATA_SOURCE = hdp23 ,
FILE_FORMAT = Pipefile
With the External Table created, test the connection by running a simple SELECT. If all goes well, your Hive table is now available via a PolyBase connection.