PolyBase - Creating an External File Format
This article continues the series on setting up PolyBase in
SQL Server 2016 CTP 2.2 and covers some of the basic requirements for setting
up one or more External File Formats. With PolyBase setup and configured in SQL Server 2016, and an External Data Source created, the next step is to create a
file format, known as an External File Format.
Why do we need an External File Format?
When file data is loaded into Hadoop, it's just loaded as
is. Unstructured, semi-structured or perhaps structured. Hadoop does not care
what it looks like. The Hadoop file system is just there to hold data, in whatever
form it may have. So with Hadoop, the approach is load data first, define the
structure later. This allows you, the user to overlay any format you want onto
the data. And it allows you to create different formats for the same data!
But that flexibility comes at a slight cost. And that cost
is that before you can use the file data in Hadoop, you first need to tell
Hadoop how to read the data. In our case, we tell PolyBase how to read the file
and PolyBase translates that for Hadoop. For PolyBase, this is a two-step process. The
first step is to define how to read a file and how to find the column
delimiters if any. The second step is how to take this data and place it into a
nicely formatted table. We'll cover that next in how to setup an External
Table.
So PolyBase needs to know how to read the file. Is there a field
terminator? What is the field
terminator? Are there any string delimiters?
If there is a date, and if so what is the format of the data? And finally, has Hadoop compressed the file
data? That is what the External File Format does. The External File Format is a
reusable format that can be used for one or more External Tables that will
eventually be created.
Reusable Formats
Over time, you'll end with a core set of generic, stock
file formats for the most common file types stored in Hadoop. One for comma separated values (CSV) comes to
mind. And perhaps one for a pipe '|'. And undoubtedly, you will end up with some
custom special use file formats.
A Generic External File Format for a CSV file in Hadoop
For this post, we are going to focus only on Hadoop file
data, and not on HIVE tables. And specifically, comma separated (CSV) values. We'll
cover the syntax for Hadoop Hive tables later.
Syntax for a Hadoop file is as follows: Full MSDN Syntax for External File Format
--Create an external file format for a Hadoop text-delimited file. CREATE EXTERNAL FILE FORMAT file_format_name WITH ( FORMAT_TYPE = DELIMITEDTEXT [ , FORMAT_OPTIONS (
[ ,...n ] ) ] [ , DATA_COMPRESSION = { 'org.apache.hadoop.io.compress.GzipCodec' | 'org.apache.hadoop.io.compress.DefaultCodec' } ]); ::= { FIELD_TERMINATOR = field_terminator | STRING_DELIMITER = string_delimiter | DATE_FORMAT = datetime_format | USE_TYPE_DEFAULT = { TRUE | FALSE } }
Today, we just want a basic, generic format to read a CSV
file loaded into our Hadoop system.
Here is the script to create a generic External File Format
for a CSV file.
------------------------------------------------------------------- -- 3) CREATE EXTERNAL FILE FORMAT (used by table) ------------------------------------------------------------------- use [the same database where you created the External Data Source]; go --Create an external file format for a Hadoop text-delimited file. CREATE EXTERNAL FILE FORMAT CSVfile WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = ',' ,USE_TYPE_DEFAULT = TRUE ), DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec' );
Item
|
Description
|
file_format_name
|
User defined name: CSVfile
|
FORMAT_TYPE = DELIMITEDTEXT
|
Format to read a
file. The others are for HIVE tables
|
FIELD_TERMINATOR = ','
|
Since this is a
CSV format, a comma is our field delimiter
|
USE_TYPE_DEFAULT = TRUE
|
When TRUE, add in
a default value when a field value is missing. Zero (0) for numeric, Empty
string for string columns. FALSE stores all missing values as a NULL
|
DATA_COMPRESSION = 'org.apache.hadoop.io.
compress.DefaultCodec'
|
The
DefaultCodec specifies that the data is not compressed.
|
One format option we skipped that is frequently needed was the STRING_DELIMITER option. While basic, this is a good starting point for going to the next step and creating an External Table against a file in Hadoop.
No comments:
Post a Comment