With PolyBase installed and configured the next step is to
create an external data source. To quote James Serra on his blog about PolyBase,
this is similar to setting up a linked server. But in our case, the linked
server is Hadoop.
Using PolyBase, the T-SQL commands are directed against what
is called an External Table. Our External
Tables are mapped against an External Data Source. And, before you can create
an External Table, you also need to create an External File Format. So, the order of creation would
typically be:
- Create an External Data Source
- Create one, or more External File Formats
- Create one or more External Tables
Creating an External Data Source to Hadoop
From MSDN-CREATE EXTERNAL DATA SOURCE, the syntax to create an External Data Source to
Hadoop is as follows:
--Create an external data
source for a Hadoop cluster
CREATE EXTERNAL DATA SOURCE
data_source_name
WITH (
TYPE = HADOOP
LOCATION = 'hdfs://NameNode_URI:port'
[, RESOURCE_MANAGER_LOCATION =
'ip_address:port' ]
[,
CREDENTIAL = credential_name ]
)
[;]
Here, data_source_name is
a name value you define. The current documentation
just specifies that it must be unique within the database. I Initially created an External Data Source in the DWConfiguration database, which worked. Using the same script, I then created an External Data Source in a different, user database - which is most likely the preferred approach. Check with your Hadoop administrator for the correct IP
address and port. For our test, we followed the MSDN-CREATE EXTERNAL DATA SOURCE that tells
us that the port our Hortonworks Hadoop system is 8020.
Examples:
Hortonworks
2.0, 2.1, 2.2, or Cloudera 5.1
Copy
this code and replace NameNode_IP with the IP address of your Name node.
LOCATION
= 'hdfs://NameNode_IP:8020'
Resource Manager Location
While the argument for the RESOURCE_MANAGER_LOCATION is optional,
we included it in our test. MSDN tells us that when the RESOURCE_MANAGER_LOCATION is
specified, the PolyBase query optimizer uses a cost-based decision to determine
if Hadoop should first pre-process the data before passing it on to PolyBase
for inclusion in the query. See MSDN or check with your Hadoop administrator to
determine what port number to use for the Resource Manager.
To create our External Data Source, we used the following script. Run this script in your user database.
No Validation
The LOCATION and the RESOURCE_MANAGER_LOCATION values are NOT validated when you execute the create data source script. At this point, the only way to test the connection is with a test file.
T-SQL Script to create a new External Data Source
-------------------------------------------------------------------
-- CREATE EXTERNAL DATA SOURCE
-------------------------------------------------------------------
-- Create an external data source (Hadoop)
-- online reference:
https://msdn.microsoft.com/en-us/library/dn935022.aspx
CREATE EXTERNAL DATA SOURCE hdp2 with (
TYPE = HADOOP,
-- Hortonworks 2.0, 2.1,
2.2, or Cloudera 5.1: =
-- LOCATION = 'hdfs://NameNode_IP:8020'
-- LOCATION = 'hdfs://NameNode_IP:8020'
LOCATION ='hdfs://192.168.1.112:8020',
-- Hortonworks HDP 2.0,
2.1, 2.2 on Linux:
-- RESOURCE_MANAGER_LOCATION = 'NameNode_IP:8050'
-- RESOURCE_MANAGER_LOCATION = 'NameNode_IP:8050'
RESOURCE_MANAGER_LOCATION='192.168.1.112:8050');
Multiple data sources used with different File Format Types - Not Allowed
MSDN, in the documentation for External Tables explicitly states:
Note |
---|
All data sources must have the same format type. You cannot have some data sources with text-delimited file formats and some with RCFILE formats. For PolyBase, you will most likely use the text-delimited format. |
Can you Have more than one actual external Hadoop or Azure Data source?
First
we need to differentiate between a physical instance of Hadoop or
Azure vs. creating a named External Data
Source. We can create many named
External Data Sources even if they all point to the same physical instance - and in multiple databases. But
can we have SQL Server 2016's PolyBase point to more than one physical Hadoop
or Azure instance?
No comments:
Post a Comment