Friday, September 4, 2015

Create PolyBase External Table in SQL Server 2016

To set up a new External Table in PolyBase against a file in Hadoop, you first need to collect some basic information about how the table will be mapped against the file. This is a logical mapping, much like a view. No changes are made to the file in Hadoop.

  1. Where is the file: What is the file name and path on the Hadoop server?
  2. Data Source: You will need an External Data Source, create a new one, or use existing?
  3. File format:   Create new a external file format or Use an existing External File Format, 
  4. Table Column definition:  You will need to determine the number of columns that defines your file, including column names, data types, and data type size or length. All based on how you want to interpret the file. That is the magic of Hadoop. You can organize Hadoop files anyway you want.  However, for many files, perhaps most file, the column break will be along a typical format such as CSV. 

The formal syntax in MSDN on how to create an External Table is as follows:

       
--Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
    (  [ ,...n ] )
    WITH (
        LOCATION = 'folder_or_filepath',
        DATA_SOURCE = external_data_source_name,
        FILE_FORMAT = external_file_format_name
        [ ,  [ ,...n ] ]
    )
[;]

 ::=
{
    | REJECT_TYPE = value | percentage
    | REJECT_VALUE = reject_value
    | REJECT_SAMPLE_VALUE = reject_sample_value

}
 

Locating the File in Hadoop

If you do not already know the folder location and name of the file you have a few options. One is to ask the Hadoop administrator. Or, if you are using Hortonworks, you can connect to Hadoop using a browser and navigate to the File Browser section.




Using the File Browser in Hortonworks, once you locate your target file, File Browser will provide you with both the file name, and the folder navigation path. File NYSE_dividends_A.csv  is located in the folder /user/hue.  So, when we create the External Table, we'll set LOCATION='/user/hue/NYSE_dividends_A.csv'
Update:  if you plan on inserting or update data in Hadoop from PolyBase, the Location can only be a directory (folder). See PolyBase - Insert data into new Hadoop Directory).

Best Practices - use a folder

Use Folders!  For production, create a new folder for each file type.  Then point your LOCATION= parameter to just the folder, and not the specific file. Why?  1) you can add more files to the directory, and your Polybase External table will automagically read them. 2) you can do INSERTS and UPDATES from PolyBase back to your files in Hadoop. 3) It's cleaner. ( See PolyBase - Insert data into a Hadoop Hue Directory).

Data Source

The list of available data sources for your database are located in External Resources. We've created two - they are identical except for the name. In an earlier post, we covered how to create an External Data Source.




What happens if you reference an invalid data source?  You get this message:

Msg 46501, Level 16, State 1, Line 82
External table references 'DATA_SOURCE' that does not exist.



File Format and Table Layout

For both the file format and the table layout, you will need to assess the overall structure of your file in Hadoop, and how you want the data to be presented in a table format. For our example, we are using a fairly simple CSV file in Hadoop that has four distinct columns. The first two are string values, the third is a date column and the last is numeric.

If you have approached these posts as a tutorial to install, setup and configure PolyBase, up to this point, you don't know if your PolyBase connection to Hadoop is properly setup. To minimize problems, for this initial table test, we kept the file and the table configuration simple. That includes setting up all four columns as long string values.Once we have confirmed our PolyBase configuration is working, we can update the table column data types. For our simple CSV file, we will use the following syntax:


       
       CREATE EXTERNAL TABLE [dbo].[stock] (
              [sname] varchar(200) NULL,
              [CO]    varchar(200) NULL ,
              [DATE]  varchar(200) NULL ,
              [div]   varchar(200) NULL
       )
       WITH (
              LOCATION='/user/hue/NYSE_dividends_A.csv',
              DATA_SOURCE = hdpDemo ,
              FILE_FORMAT = CSVfile
            );
       
 

( If you are trying to connect to a Hive table, please refer to: Connect PolyBase to your Hive Database.)

Final Test

With our table created, we can run a simple query in SSMS against our Hadoop table. To keep the record set small, we'll limit the result set to the top 100 records.

             select top 100  sname, CO, DATE, div  from [dbo].[stock];

If all goes well, we see something like this:




So what happens if you reference an unknown file format? You get this message:

Msg 46501, Level 16, State 2, Line 82
External table references 'FILE_FORMAT' that does not exist.

And what happens if you create an external table, but the declared column width is too small. For example, if we declare the column to be type varchar(2) and it is too small for the incoming data.  PolyBase returns the following message:

Msg 107090, Level 16, State 1, Line 79
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
OLE DB provider "SQLNCLI11" for linked server "(null)" returned message "Unspecified error".
Msg 7421, Level 16, State 2, Line 79
Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server "(null)". .

And what Happens if you try to create an External Table, and the External Data Source is not valid? To test this, we created a bogus data source with an invalid IP address in the LOCATION option. First off, we were able to create the bogus External Data Source with no error. But when you try to create a table using the invalid External Data Source, you'll get this message:

Msg 105019, Level 16, State 1, Line 82
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Call From WIN-RC147EPV3T3/192.168.1.138 to 192.168.1.113:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.1.113/192.168.1.113:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout] occurred while accessing external file.'


Verify the table connection values using SSMS Table Properties:


Simply highlight the external table in SSMS, select properties, and then click on External Table.
The popup will display the information that was provided for Location, File Format and Data Source when the table was created.








2 comments:

Mehmet Agop Bakkaloglu said...

Hi,

I am trying to create an external table to a file on Azure Blob Storage. And I'm getting the following error:

EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [com.microsoft.windowsazure.services.core.storage.StorageException: The value for one of the HTTP headers is not in the correct format.] occurred while accessing external file.'
However, creating an external table on another Azure Blob Storage file is successful.

This makes me think that there is a version difference between these two Azure Blob containers?? Perhaps I need to change the Azure tools I have installed to something older/newer?

Any suggestions on how to resolve this?

Andrew Peterson said...

Hi Mehmet,

It's great that you are working on connections to Azure Blob's. I've only focused on making connections to Hadoop, so I have not seen that error. One of my big issues with PolyBase is the is the lack of rich PolyBase error messages.

If you can, move a copy of the file that works on the first Azure system to the one that is not working. This will help eliminate one possible problem.

It might also be that your Location address has a hidden character that is causing a problem.

If you do get this resolved, please do add what steps you followed.