A straight forward way to connect PolyBase to your Hadoop text-delimited Hive tables is to use the DELIMITEDTEXT format method (non-ORC files). The only difference is that
with Hive tables, the file data has been internally migrated from a simple file
format into a Hive table. But, while it is treated as a Hive table, it is still
a file within Hadoop. And you can access that file using the PolyBase External
Table declaration syntax. (For ORC tables see
Connect PolyBase to your Hive ORC Table)
Before you can create an External Table in PolyBase, you
first need both an
External Data Source, and an appropriate
External File format. And like any text-delimited External Tables in PolyBase, you'll need
to define how the columns are delimited, the column names and data types, and
the physical location of the file on the Hadoop server.
To get started, first navigate to the Hive user interface in
Hadoop. For this example, we are using Hortonworks. With the UI open, select/click on the Tables
tab. This will show a listing of the Hive tables available.
HIVE table listing - notice the database name: xademo.
With the Table listing open, locate the table. Our target
table is call_detail_records. After
locating the table, select/click on it. This brings up the Table Metadata page.
Metadata page shows the column names and datatypes stored in
HIVE.
The Metadata page defines how Hive has named the column and their data type. We will need this to create our External Table. On the left, locate and select/click on the
View File Location action. This
takes you to the folder where the file is located, as well as providing the
actual file name stored in the Hive database.
File Browser - File Folder Location.
In our example, the file
and folder name to be used for the External Table is:
/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt
In the file-folder name, notice the folder xademo.db. This represents
the Hive database named xademo. The next folder, call_detail_records is identical to the Hive table name. The last value, cdrs.txt, is the file name used within Hadoop on the server.
You will also notice that you are now in the File Browser, and not Hive.
Select/click the file (in our case cdrs.txt). This opens the
file in viewing mode, so you can discover the file delimiter being used by Hive. In our case, it is a pipe.
Preview of the file data in Hadoop for the file cdrs.txt
For the Pipe delimiter, we'll need to create a new External File format, with this syntax:
CREATE
EXTERNAL FILE FORMAT Pipefile
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR = '|'
,USE_TYPE_DEFAULT = TRUE ),
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
);
Using the column and data type values discovered earlier, we
can now create the External Table using this syntax:
CREATE EXTERNAL TABLE [dbo].HIVE_asfile_cdrs (
PHONE varchar(100) NULL ,
TYPE_1 varchar(40) NULL ,
IN_OUT varchar(40) NULL ,
REC_LOCAL varchar(40) NULL ,
DURATION varchar(40) NULL ,
REC_DATE varchar(40) NULL ,
TIME_ varchar(40) NULL ,
ROAMING varchar(40) NULL ,
AMOUNT varchar(40) NULL ,
IN_NETWORK varchar(40) NULL ,
IS_PROMO varchar(40) NULL ,
TOLL_FREE varchar(40) NULL ,
BYTES varchar(40) NULL ,
TYPE_2 varchar(40) NULL
)
WITH (LOCATION='/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt',
DATA_SOURCE
= hdp23 ,
FILE_FORMAT
= Pipefile
);
With the External Table created, test the connection by
running a simple SELECT. If all goes well, your Hive table is now available via
a PolyBase connection.