Using PolyBase to connect to a plain text Hive table (file) is no different from connecting to any other file in Hadoop. (See: Connect PolyBase to your Hive database Table: SQL Server 2016) But the future of Hive is moving to the optimized row columnar (ORC) format. According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity. (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).
Setting up ORC tables in PolyBase is a three step process:
External Data Source - no sharing between file format types!
External File Format - specific for ORC
External Table - relies on the ORC file format
This follows the same approach used to connect to plain text files. But, do we need to make changes to all three? To some extent, yes. External tables rely on the external file format, so we'll need to either create a new external table, or modify an existing external table. The external file format is where we specify that the source table is an ORC. And finally, the external data source can only support one format type. So, you will need different external data sources for your plain text file formats and your ORC file formats - even if they are pointing to the same Hadoop cluster. On MSDN - CREATE EXTERNAL TABLE(Transact-SQL), near the bottom in the examples section is a note about data sources. Specifically it states:
All data sources must have the same format type. You cannot have some data
sources with text-delimited file formats and some with RCFILE formats.
ORC Specific External Data Source
For our test, we created a specific external data source just for ORC Hive tables.
CREATE EXTERNAL DATA SOURCE hdp23_orc with
TYPE = HADOOP,
ORC Specific External File Format
Next, we created three external file formats just for our ORC tables. Notice that unlike the DELIMITEDTEXT external file formats, there is no need for a field terminator. string delimiter or date format. Besides the FORMAT_TYPE option, the only other option for the ORC format type is for compression. Here, you can either ask your Hadoop administrator, or experiment to see which one works. We created all three external file formats. One for each of the two explicit data compression options, and a third where we omitted the data compression completely. In our Hortonworks cluster, both the SnappyCodec data compression format and the file format where we omitted the declaration for the data compression worked. Here they are:
CREATE EXTERNAL FILE FORMAT ORCdefault
FORMAT_TYPE = ORC
,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
CREATE EXTERNAL FILE FORMAT ORCsnappy
FORMAT_TYPE = ORC
,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
CREATE EXTERNAL FILE FORMAT ORCnocompress
FORMAT_TYPE = ORC
ORC Specific External Table
With the ORC specific data source setup and the ORC specific file formats ready to use, all we need is to setup the table. We found that PolyBase wants a strongly typed ORC Hive Table, so here you will need to make sure your SQL data types match those that are present in Hadoop. (See PolyBase wants a strongly typed ORC Hive Table)
CREATE EXTERNAL TABLE [dbo].sample_07c (
codeid nvarchar(200) NULL ,
descrip nvarchar(200) NULL ,
total_emp int null,
salary int null
DATA_SOURCE = hdp23_orc ,
FILE_FORMAT = ORCsnappy
- Getting Started with PolyBase in SQL Server 2016
- Create an External Data Source to Hadoop in SQL Server 2016
- Create a PolyBase CSV External File Format in SQL Server 2016
- Create PolyBase External Table in SQL Server 2016
- PolyBase YARN configurations in SQL Server 2016
- PolyBase Settings Folder in SQL Server 2016