Thursday, September 3, 2015

Create PolyBase CSV External File Format

PolyBase - Creating an External File Format

This article continues the series on setting up PolyBase in SQL Server 2016 CTP 2.2 and covers some of the basic requirements for setting up one or more External File Formats. With PolyBase setup and configured in SQL Server 2016, and an External Data Source created, the next step is to create a file format, known as an External File Format.

Why do we need an External File Format?


When file data is loaded into Hadoop, it's just loaded as is. Unstructured, semi-structured or perhaps structured. Hadoop does not care what it looks like. The Hadoop file system is just there to hold data, in whatever form it may have. So with Hadoop, the approach is load data first, define the structure later. This allows you, the user to overlay any format you want onto the data. And it allows you to create different formats for the same data!

But that flexibility comes at a slight cost. And that cost is that before you can use the file data in Hadoop, you first need to tell Hadoop how to read the data. In our case, we tell PolyBase how to read the file and PolyBase translates that for Hadoop.  For PolyBase, this is a two-step process. The first step is to define how to read a file and how to find the column delimiters if any. The second step is how to take this data and place it into a nicely formatted table. We'll cover that next in how to setup an External Table.

So PolyBase needs to know how to read the file. Is there a field terminator?  What is the field terminator? Are there any string delimiters?  If there is a date, and if so what is the format of the data?  And finally, has Hadoop compressed the file data? That is what the External File Format does. The External File Format is a reusable format that can be used for one or more External Tables that will eventually be created.

Reusable Formats


Over time, you'll end with a core set of generic, stock file formats for the most common file types stored in Hadoop.  One for comma separated values (CSV) comes to mind. And perhaps one for a pipe '|'.  And undoubtedly, you will end up with some custom special use file formats.

A Generic External File Format for a CSV file in Hadoop

For this post, we are going to focus only on Hadoop file data, and not on HIVE tables. And specifically, comma separated (CSV) values. We'll cover the syntax for Hadoop Hive tables later.

Syntax for a Hadoop file is as follows: Full MSDN Syntax for External File Format


      
--Create an external file format for a Hadoop text-delimited file.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
    FORMAT_TYPE = DELIMITEDTEXT
    [ , FORMAT_OPTIONS (  [ ,...n  ] ) ]
    [ , DATA_COMPRESSION = {
           'org.apache.hadoop.io.compress.GzipCodec'
         | 'org.apache.hadoop.io.compress.DefaultCodec'
        }
     ]);

 ::=
{
    FIELD_TERMINATOR = field_terminator
    | STRING_DELIMITER = string_delimiter
    | DATE_FORMAT = datetime_format
    | USE_TYPE_DEFAULT = { TRUE | FALSE }
}       

Today, we just want a basic, generic format to read a CSV file loaded into our Hadoop system.

Here is the script to create a generic External File Format for a CSV file.

      
-------------------------------------------------------------------
-- 3)  CREATE EXTERNAL FILE FORMAT  (used by table)
-------------------------------------------------------------------

 use [the same database where you created the External Data Source];
 go

 --Create an external file format for a Hadoop text-delimited file.
   CREATE EXTERNAL FILE FORMAT CSVfile
    WITH (
          FORMAT_TYPE = DELIMITEDTEXT
           , FORMAT_OPTIONS (
             FIELD_TERMINATOR = ','
             ,USE_TYPE_DEFAULT =  TRUE ),
          DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
         );       

Item
Description
file_format_name
User defined name:  CSVfile
FORMAT_TYPE = DELIMITEDTEXT
Format to read a file. The others are for HIVE tables
FIELD_TERMINATOR = ','
Since this is a CSV format, a comma is our field delimiter
USE_TYPE_DEFAULT =  TRUE
When TRUE, add in a default value when a field value is missing. Zero (0) for numeric, Empty string for string columns.  FALSE stores all missing values as a NULL
 DATA_COMPRESSION = 'org.apache.hadoop.io.
compress.DefaultCodec'
The DefaultCodec specifies that the data is not compressed.


 One format option we skipped that is frequently needed was the STRING_DELIMITER option. While basic, this is a good starting point for going to the next step and creating an External Table against a file in Hadoop.

No comments: