Wednesday, September 2, 2015

Create PolyBase External Data Source to Hadoop in SQL Server 2016

With PolyBase installed and configured the next step is to create an external data source. To quote James Serra on his blog about PolyBase, this is similar to setting up a linked server. But in our case, the linked server is Hadoop.

Using PolyBase, the T-SQL commands are directed against what is called an External Table.  Our External Tables are mapped against an External Data Source. And, before you can create an External Table, you also need to create an External  File Format. So, the order of creation would typically be:
  1. Create an External Data Source
  2. Create one, or more External File Formats
  3. Create one or more External Tables


Creating an External Data Source to Hadoop

From MSDN-CREATE EXTERNAL DATA SOURCE, the syntax to create an External Data Source to Hadoop is as follows:

--Create an external data source for a Hadoop cluster
CREATE EXTERNAL DATA SOURCE data_source_name
    WITH (
        TYPE = HADOOP
        LOCATION = 'hdfs://NameNode_URI:port'
        [, RESOURCE_MANAGER_LOCATION = 'ip_address:port' ]
        [, CREDENTIAL = credential_name ]
    )
[;]

Here, data_source_name is a name value you define.  The current documentation just specifies that it must be unique within the database.  I Initially created an External Data Source in the DWConfiguration database, which worked. Using the same script, I then created an External Data Source in a different, user database - which is most likely the preferred approach. Check with your Hadoop administrator for the correct IP address and port. For our test, we followed the MSDN-CREATE EXTERNAL DATA SOURCE that tells us that the port our Hortonworks Hadoop system is 8020.

      Examples:

            Hortonworks 2.0, 2.1, 2.2, or Cloudera 5.1
            Copy this code and replace NameNode_IP with the IP address of your Name node.

               LOCATION = 'hdfs://NameNode_IP:8020'


Resource Manager Location

While the argument for the RESOURCE_MANAGER_LOCATION is optional, we included it in our test. MSDN tells us that when the RESOURCE_MANAGER_LOCATION is specified, the PolyBase query optimizer uses a cost-based decision to determine if Hadoop should first pre-process the data before passing it on to PolyBase for inclusion in the query. See MSDN or check with your Hadoop administrator to determine what port number to use for the Resource Manager.

To create our External Data Source, we used the following script. Run this script in your user database.

No Validation

The LOCATION and the RESOURCE_MANAGER_LOCATION  values are NOT validated when you execute the create data source script. At this point, the only way to test the connection is with a test file.

T-SQL Script to create a new External Data Source

 -------------------------------------------------------------------
 --  CREATE EXTERNAL DATA SOURCE
 -------------------------------------------------------------------

 -- Create an external data source (Hadoop)
 -- online reference: https://msdn.microsoft.com/en-us/library/dn935022.aspx

 CREATE EXTERNAL DATA SOURCE hdp2 with (
    TYPE = HADOOP,
           -- Hortonworks 2.0, 2.1, 2.2, or Cloudera 5.1:  = 
           --   LOCATION = 'hdfs://NameNode_IP:8020'
    LOCATION ='hdfs://192.168.1.112:8020',
           -- Hortonworks HDP 2.0, 2.1, 2.2 on Linux: 
           -- RESOURCE_MANAGER_LOCATION = 'NameNode_IP:8050'
    RESOURCE_MANAGER_LOCATION='192.168.1.112:8050');



A new External Data Source

Multiple data sources used with different File Format Types - Not Allowed

MSDN, in the documentation for External Tables explicitly states:

Note
All data sources must have the same format type. You cannot have some data sources with text-delimited file formats and some with RCFILE formats.

For PolyBase, you will most likely use the text-delimited format. 

Can you Have more than one actual external Hadoop or Azure Data source?

First we need to differentiate between a physical instance of Hadoop or Azure  vs. creating a named External Data Source.  We can create many named External Data Sources even if they all point to the same physical instance - and in multiple databases. But can we have SQL Server 2016's PolyBase point to more than one physical Hadoop or Azure instance?

To some extent, yes - according to the documentation, but only if you choose option 1 or 4. That leaves Azure or an older version of Hadoop. When you update the server for 'hadoop connectivity'  [See Setting up PolyBase in SQL Server 2016], you have to define which Hadoop Server Option you will be using. This immediately limits your type range of Hadoop or Azure Blob storage. And next, if you are using a YARN enabled Hadoop server, the specific yarn.application.classpath value from your Hadoop server must be added to the SQL Server installation.  That seems to limit Hadoop access to a single Hadoop Server. For an Azure Blob, if you can use option 4, you should be able to connect to multiple Azure sites.


No comments: