Tuesday, July 5, 2016

PolyBase - Insert data into a Hadoop Hue Directory

You can use PolyBase as an ETL tool to insert data into your Hadoop installation from SQL Server. For this example, we will insert data from the Adventure Works data warehouse into a new, empty Hue Directory on HortonWorks. Please Note: this example installs data into the Hadoop file repository. We are not inserting data into a Hive or HCat table.


  1. Hadoop: Create a new Directory in Hue **
  2. PolyBase: Create a new External Table pointing to the Directory
  3. Run your INSERT command

The Details:

With PolyBase, it inserts records into a directory (folder) - not a file. 
Using Hue, go to the file browser, and create a new Directory.

We named our new Directory: AWDW2012_SalesData.  (we are using an old Adventure Works data warehouse database for this example. No reason, it was just there.)

Navigate to the directory, and you'll see that it is empty.

Using the full directory path from Hue, create a new external table.  
(see Have PolyBase INSERT create a new Directory for you on how you can use the Create External Table DDL to instruct Hue to create a new Directory for you). 
DataTypes:  For our external table, our data types match exactly the data types in our SQL Server data warehouse. 

  ModelName  nvarchar(200) NULL ,
  ShipDate  datetime      NULL ,
  ExtendedAmt  money         NULL ,
  OrderQty  smallint   NULL 
 WITH (LOCATION='/user/hue/AWDW2012_SalesData',
    DATA_SOURCE = hdp23 ,
    FILE_FORMAT = CSVfile  

Test your new external table.  Results will return an empty set:

      select  * from  [dbo].AWDW_CSV_Sales_Date;

     (0 row(s) affected)

Now, all we need to do is run the INSERT.

   INSERT INTO  [dbo].AWDW_CSV_Sales_Date (ModelName, ShipDate, ExtendedAmt, OrderQty)
   SELECT top 100 p.[ModelName], s.[ShipDate], s.[ExtendedAmount], s.[OrderQuantity]
     FROM [AdventureWorksDW2012].[dbo].[FactInternetSales] as s 
 left outer join [AdventureWorksDW2012].[dbo].[DimProduct] as p 
           on s.[ProductKey] = p.[ProductKey];


   (100 row(s) affected)

To test, re-run the select query: 

      select  * from  [dbo].AWDW_CSV_Sales_Date;

      Road-150,2005-07-26 00:00:00.000,3578.27,1
      Road-150,2005-07-26 00:00:00.000,3578.27,1
      Road-150,2005-07-26 00:00:00.000,3578.27,1
     (100 row(s) affected)

Let's go back to Hadoop, and review the contents of the Hue directory. PolyBase created eight (8) files to hold 100 records.

With the PolyBase external table setup, you can insert additional records.

No comments: