Tuesday, September 29, 2015

Power BI Desktop - upgrade

On September 23, 2015, Microsoft released the awaited upgrade to the Power BI Desktop tool.
With the December 21, 2015 update, the added a Beta connector to R.  You can find out more at Visualizing and operationalizing R data in Power BI.

What is amazing, is that there is a very long list of enhancements to the desktop tool. And, especially so, since Desktop is a free tool. Has Microsoft lost it? Or perhaps have they finally decided to once again try to take the lead in BI.

A hidden gem is support to connect to an on-premises SSAS Multi-dimensional database. In earlier versions, only Tabular was supported. And the Navigator connection tool only needs to know the name of the server and the authentication user/pw.

Power BI 2.0 desktop September, 2015 update to read about the upgrade in detail.
Power BI 2.0 desktop December 21, 2015 update to read about the upgrade in detail.

 Power BI Get Data Navigator listing the SSAS Multi-dimensional databases


Power BI Relationships Tab - displaying the imported Multi-dimensional table


Power BI Waterfall displaying the imported Multi-dimensional table

Thursday, September 24, 2015

Power BI Updates for September 22, 2015

The most recent Power BI weekly updates ( Sept 22, 2015) added some much needed improvements to the available list of tile sizes. Until this release, there were only a small number of sizes available.

While the introduction of Power BI 2.0 has been well received, the limited tile sizing was a presentation drawback, This release greatly expands tile sizing flexibility that will give report designers a greater ability to deliver the reports being requested.

In this weeks release, it was also suggest that an update to Power BI Desktop will be release very soon.  For more details go to the Power BI Weekly Service Update report.

Friday, September 18, 2015

Power BI: Comparing the Waterfall, Line and Column Charts

The Waterfall visualization (often called a progressive bar chart) combines a column chart with a line chart. In effect, it lays out the incremental data as individual columns that add incrementally to the total. Visually, each record provides you with a proportionately sized column that is added to the total over time. It is a great visual for showing unit and sales growth.

For our example below, we used the monthly gross profit numbers for the Adventure Works sample database. Using a waterfall chart, the value for each month shows up like a step, sized according to the impact it has on the total. Showing sales, gross profit, or any other incremental item allows you to see not only the overall trend, but also the individual contribution by month. It is a really nice way to display these types of trends.

These three graphs display the same data to compare the Waterfall Chart with the Column Chart and the Line Chart.

Waterfall Chart

Line Chart

Column Chart

You can visit the Power BI tutorial at support.powerbi.com.

Data type Changes when loading Power BI Data

During the data loading process, Power BI Desktop evaluates the incoming data, and then assigns what it thinks is the correct data type. And in most cases, it makes a good choice. But what if it makes the wrong choice? 

For example, what if one of the columns is a text field made up of the month and the year? Such as December, 2015. Power BI Desktop decides that this is a data field, and changes the value to 12/1/2015. Great if that is what you want.

Editing the data types

During the initial load process, one of the first screens will be the Navigator. On the bottom right, select/click on the Edit button.

This will bring up the Query Editor. The Query Editor is one of the most powerful tools available to the Power BI Desktop, Over time, you'll find this tool more and more useful. On the right side of the Query Editor, locate the Applied Steps section. For this example, the last step the query editor performed was the Change Type.

Simply right click on the Change Type item, and select Delete. Or, simply select/click on the X immediately to the left of the Change Type item. Once the change has been made, the Sales Month column is now back to the value we wanted.

While we are in the Query Editor, lets update the data type for the monetary columns. Holding the Ctrl button, select/click on the monetary columns. On the Data Type combo box, select the Fixed Decimal Number value. 

For our example, the incoming column in Excel was configured as Currency. Once we change the Data Type to Fixed Decimal Number, Power BI recognizes the columns as Currency.

When you are finished, select/click the Close & Load button. After loading the data, you can review using the Data Tools. The currency columns are now presented as currency values.

Thursday, September 17, 2015

Power BI 2.0 - Desktop or Online Portal for File Loads

This article continues in our exploration on how users can benefit from the free version of Power BI.  Perhaps as a way to demonstrate how Power BI can benefit your organization and to justify an upgrade to the Pro version. Or, you just need to make the most of a free tool. This post focuses on the single table limitation placed on reports built using the Online Portal Report Builder.

Analysts spend endless hours creating and maintaining Excel files to build custom reports for management. And as good as Power BI is, that will not change soon. Excel is an excellent tool to explore and manipulate the data. If an analyst can imagine a way to manipulate data, it almost always can be done in Excel. This free-form ability of Excel works as a great compliment to the free version of Power BI.

Getting started - which tool:  Power BI Desktop or the Online Portal Builder

Since data sources for the free tool are limited, the main determinate is:  can you get the final data set down to a single table? ( see Is the Free version of Power BI 2.0Worth Using?)

        Multiple Data Files as a source?

Power BI Desktop
Online Portal Report Builder
Single Excel File
Multiple Excel Files

Using the free Online Portal Report Builder for Power BI, the design framework expects, and really only supports a single two-dimensional Excel table. Conversely, the Power BI Desktop allows you to work with data from multiple local databases and file sets.  If you can use all the tools in Excel to build out to a single sheet (table) in Excel, the Online Portal is fine. If not, consider the Power BI Desktop.

Viewing the data?

Using the Power BI Desktop, once data has been loaded into the Online Portal, as of Sept 16, 2105, there were no tools available to review the data you just uploaded. If you are the type that needs to see the data, then the Online Portal will be a bit of a hassle. Fortunately, if you use the Power BI Desktop, you can not only have multiple data sources, you can view, and review the data. Including tools to filter the data.

Using Excel Data: Format as Table

Using the Online Portal to import an Excel file from your OneDrive, Power BI looks for a sheet where the style has been setup using Format as Table.  Desktop does not have this limitation. You can have additional supporting pages, with sheet references. You just need the primary/final sheet to be setup as a table using the Format as Table command. On the Home tab of the Excel ribbon bar, before you can use your Excel file in Power BI, you will need to update the "style" of file to a "table."

Over time, you'll find that while the Online Portal is quick, the Desktop is the best way to build out reports for Power BI. 

Wednesday, September 16, 2015

Is the Free version of Power BI 2.0 Worth Using?

There has been a lot of excitement following the general release of Power BI 2.0 earlier this summer. And part of that focus has been on the entry level price - Free. But free has its limits, and Microsoft is not known for distributing free software, let alone robust free software.

One big advantage of software with a free version is that it allows analysts and staff to work with the application, and ideally, use it while they try to convince their manager to pay for a full version. And Power BI 2.0 is no different. So the question might be, is the free version sufficiently robust that I can do real work with the tool. Can I deploy reports that my boss and others can use. And once I have a report built using the free version, just how easy will it be for me to update the data if I don't have the Pro version?

The table below outlines the current differences between the free version and the Pro version (as of Sept 16, 2015).

$9.99 user / month
Data capacity limit
1 GB/user
10 GB/user
Create, view and share your personal dashboards and reports
Author content with the Power BI Desktop
Explore data with Natural Language1
Access your dashboards on mobile devices using native apps for iOS, Windows, and Android
Consume curated content packs for services like Dynamics, Salesforce, and Google Analytics
Import data and reports from Excel, CSV and Power BI Desktop files
Data Refresh
Consume content that is scheduled to refresh
Consume streaming data in your dashboards and reports
10K rows/hour
1M rows/hour
Consume live data sources with full interactivity
Access on-premises data using the Data Connectivity Gateways (Personal and Data Management)
Collaborate with your team using Office 365 Groups in Power BI
Create, publish and view organizational content packs
Manage access control and sharing through Active Directory groups2
Shared data queries through the Data Catalog

Source:  Microsoft (as of Sept 16, 2015)

With the free version, all of the core functionality is available.  And perhaps the best free option is the full featured Power BI Desktop. There is only one version, used by both the free and Pro versions. It allows you to pull data from a nearly complete list of file, database, Azure and other cloud and remote data sources. Unfortunately, the free version of the Online Portal greatly limits your access to data.

Since Power BI Desktop is the same for both the free and Pro version, lets take a look at this tool. You can easily obtain data from multiple sources, and importantly, there is an extensive list of built-in tools to filter the data both vertically (by removing columns) and horizontally (by filtering out rows). The relationship editor allows you to review and adjust how the data tables are joined. You can use the automated matching logic, or make a change to fit your needs. So, while the Online Portal is restricted, for the free tool user, Desktop gives you a lot of flexibility and power.

Data refresh is another area that separates the free version from the Pro version.  The free version only supports a limited number of sources that allow a simple data refresh.  Files in your OneDrive can be used to refresh the free version, but only a manual refresh. 

Another data source available to the free version is the external data service provider. Power BI has a long list of supported sources such as Google Analytics and Salesforce. To see if the one you need is supported this will be a good place to start. Unfortunately, if your data comes from an Azure SQL database, or from an on-premises database your free version options are limited. You'll need to first open the report in Power BI Desktop, refresh the data on your personal workstation, and then upload the full Power BI report file to your Power BI Portal. If your source is only a file, you build your Power BI report using the limited online tools, and upload a new file. So, for initial trials and low budget operations it is very doable. Longer term, you'll want to find a way to move to the relatively inexpensive Pro version.

One limitation with both the free and Pro versions of Power BI 2.0 is where you can publish the final report. With both there is only one place to publish, and that is at the Power BI cloud portal. If you are already making the move to Office 365 you are all set. If not, then management may or may not be ready for cloud based reports. However, this limitation is changing. Pyramid Analytics is introducing a platform where Power BI Desktop reports can be published. ( Power BI Desktop to Pyramid Analytics ). On October 29, 2015, the Microsoft team reported suggested that there will be additional ways/locations to publish Power BI reports, That would be excellent.

So, if you are looking for a great BI tool, and have the ability to publish to the cloud, then Power BI is worth taking a look at for your BI needs. Even if management never approves the monthly cost of Pro, your end users can still access your reports. 

A few months ago, I published an article on the newly released Power BI 2.0 at Redmond Magazine. Called First Look: Power BI 2.0.  If you would like to get started with the new Power BI 2.0, the best place to start is on the Getting Started Page,

Monday, September 14, 2015

Connect PolyBase to your Hive database Table: SQL Server 2016

A straight forward way to connect PolyBase to your Hadoop text-delimited Hive tables is to use the DELIMITEDTEXT format method (non-ORC files). The only difference is that with Hive tables, the file data has been internally migrated from a simple file format into a Hive table. But, while it is treated as a Hive table, it is still a file within Hadoop. And you can access that file using the PolyBase External Table declaration syntax. (For ORC tables see Connect PolyBase to your Hive ORC Table)

Before you can create an External Table in PolyBase, you first need both an External Data Source, and an appropriate External File format. And like any text-delimited External Tables in PolyBase, you'll need to define how the columns are delimited, the column names and data types, and the physical location of the file on the Hadoop server.

To get started, first navigate to the Hive user interface in Hadoop. For this example, we are using Hortonworks.  With the UI open, select/click on the Tables tab. This will show a listing of the Hive tables available.

HIVE table listing - notice the database name:  xademo.

With the Table listing open, locate the table. Our target table is call_detail_records. After locating the table, select/click on it. This brings up the Table Metadata page.

Metadata page shows the column names and datatypes stored in HIVE.

The Metadata page defines how Hive has named the column and their data type. We will need this to create our External Table. On the left, locate and select/click on the View File Location action. This takes you to the folder where the file is located, as well as providing the actual file name stored in the Hive database. 

File Browser - File Folder Location.

In our example, the file and folder name to be used for the External Table is:


In the file-folder name, notice the folder xademo.db. This represents the Hive database named xademo. The next folder, call_detail_records is identical to the Hive table name. The last value, cdrs.txt, is the file name used within Hadoop on the server. You will also notice that you are now in the File Browser, and not Hive.

Select/click the file (in our case cdrs.txt). This opens the file in viewing mode, so you can discover the file delimiter being used by Hive. In our case, it is a pipe.

Preview of the file data in Hadoop for the file cdrs.txt

For the Pipe delimiter, we'll need to create a new External File format, with this syntax:

              , FORMAT_OPTIONS (
              FIELD_TERMINATOR = '|'                  
                     ,USE_TYPE_DEFAULT =  TRUE ),
              DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'

Using the column and data type values discovered earlier, we can now create the External Table using this syntax:

       CREATE EXTERNAL TABLE [dbo].HIVE_asfile_cdrs (
              PHONE         varchar(100) NULL ,
              TYPE_1        varchar(40) NULL ,
              IN_OUT        varchar(40) NULL ,
              REC_LOCAL     varchar(40) NULL ,
              DURATION      varchar(40) NULL ,
              REC_DATE      varchar(40) NULL ,
              TIME_         varchar(40) NULL ,
              ROAMING       varchar(40) NULL ,
              AMOUNT        varchar(40) NULL ,
              IN_NETWORK    varchar(40) NULL ,
              IS_PROMO      varchar(40) NULL ,
              TOLL_FREE     varchar(40) NULL ,
              BYTES         varchar(40) NULL ,
              TYPE_2        varchar(40) NULL
           WITH (LOCATION='/apps/hive/warehouse/xademo.db/call_detail_records/cdrs.txt',
              DATA_SOURCE = hdp23 ,
              FILE_FORMAT = Pipefile           

With the External Table created, test the connection by running a simple SELECT. If all goes well, your Hive table is now available via a PolyBase connection.

Sunday, September 6, 2015

PolyBase Settings Folder in SQL Server 2016

PolyBase in SQL Server 2016 currently has multiple configuration files.
Kerberos:  several have kerberos sections, see https://msdn.microsoft.com/en-us/library/mt712797.aspx, for more details under the kerberos section.

Located in the folder:
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\

  1. Dms.config
  2. DWEngineService.exe.config

  1. jaas.conf

  1. FailoverDiagnosticsConfig.xml
  2. EngineDiagnosticsConfig.xml
  3. EngineDiagnosticsCloudMdsConfig.xml
  4. DmsDiagnosticsConfig.xml
  5. DmsDiagnosticsCloudMdsConfig.xml
  6. DmsCommonDiagnosticsConfig.xml
  7. CommonDiagnosticsConfig.xml

Located in the folder:     C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\ MSSQL\Binn\Polybase\hadoop\conf
  1. core-site.xml
  2. yarn-site.xml
  3. mapred-site.xml
  4. hive-site.xml
  5. hdfs-site.xml

Earlier CTP releases:

we had these additional files.  It is unclear
  1. Backup.config
  2. Copy.config
  3. CostModel.config
  4. Dsql.config
  5. Hadoop.config
  6. Monitoring.config
  7. Server.config

For CTP 2.2, most, perhaps all of these may never need to be changed.
These are located in the folder:
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\.


Located in Hadoop.config file, the setting HadoopUserName has a default of pdw_user.

      <setting name="HadoopUserName" serializeAs="String">

We did not have to change this value, but we were using the fairly open Hortonworks Sandbox with the user Hue. However, this value does seem to be a setting used by HDInsight. Until additional documentation is released, if you are have having connection issues, this might be a setting you want to consider updating. Thanks to Hilmar Buchta for pointing out this hidden user name setting.

PolyBase Resources

Just What is PolyBase, and how do I find out more about it:  a list of PolyBase resources 

PolyBase Explained, James Serra, an excellent overview on PolyBase, and what it does, and why you might want to use it.

It's a bit old, having been published on November 15, 2012, but ZDNet's Microsoft's PolyBase mashes up SQL Server and Hadoop is a nice summary of PolyBase and other MPP mash-ups.

Gray Systems Lab at Microsoft - on PolyBase, its origins, including links to publications by David Dewitt.

Another Paper by David DeWitt, Split Query Processing in PolyBase MSDN: Getting Started with PolyBase, SQL Server 2016,  MSDN, always a good reference.

PolyBase vs. Scoop; Ginger Grant,  Scoop provides a link between the relational world and Hadoop, but it is very different from PolyBase.  Ginger Grant provides a nice comparison.

PolyBase in APS - Yet another SQL over Hadoop solution?, from the SQL Server team, some commentary on their view of PolyBase.

SQLPass 2014 Event - The Role of PolyBase in the MDW (PDF), by Brian Mitchell. Based on the APS, but still relevant to the upcoming release in SQL Server 2016.  A good overview for PolyBase. Downloads the file: 24HOP_BMitchell_The_Role_Polybase_in_the_MDW.pdf

SQLPass:  Big Data on the Microsoft Platform (a pdf), by Andrew Brust. Download the file 24HOP_session3_Brust.pdf. Nothing on PolyBase, but nice summary of other Hadoop related items and how they might work with SQL Server.

APS documentation:  PolyBase was first introduced with APS (formerly Parallel Data Warehouse). So, while intended for APS, some of the documents relate to how the database, PolyBase and Hadoop interact. It may not be a direct, one-to-one comparison, but they provide insight. A good place to start is the document: Optimizing Distributed Database Design for Analytics Platform System.docx.  If you download the APS help file {aps.au3.chm}, if it appears empty, go to properties, and at the bottom of the General tab, click on the unblock command.

Hive DDL Language Manual,  A first resource for Hive data definition language (DDL) syntax. Scroll down to the section on External Tables. When you create an External Table in Hive you are just creating a schema against data located elsewhere. Similar to creating a linked table (or perhaps a view). Apparently the PolyBase designers decided to continue with this format.

Polybase in SQL Server 2016 CTP 2, Hilmar Buchta has a brief overview on setting up PolyBase in SQL Server 2016 CTP2.  His post is the only reference I have found that discusses the Hadoop.config file on the SQL Server machine, and updating the HadoopUserName setting.

Friday, September 4, 2015

Create PolyBase External Table in SQL Server 2016

To set up a new External Table in PolyBase against a file in Hadoop, you first need to collect some basic information about how the table will be mapped against the file. This is a logical mapping, much like a view. No changes are made to the file in Hadoop.

  1. Where is the file: What is the file name and path on the Hadoop server?
  2. Data Source: You will need an External Data Source, create a new one, or use existing?
  3. File format:   Create new a external file format or Use an existing External File Format, 
  4. Table Column definition:  You will need to determine the number of columns that defines your file, including column names, data types, and data type size or length. All based on how you want to interpret the file. That is the magic of Hadoop. You can organize Hadoop files anyway you want.  However, for many files, perhaps most file, the column break will be along a typical format such as CSV. 

The formal syntax in MSDN on how to create an External Table is as follows:

--Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
    (  [ ,...n ] )
    WITH (
        LOCATION = 'folder_or_filepath',
        DATA_SOURCE = external_data_source_name,
        FILE_FORMAT = external_file_format_name
        [ ,  [ ,...n ] ]

    | REJECT_TYPE = value | percentage
    | REJECT_VALUE = reject_value
    | REJECT_SAMPLE_VALUE = reject_sample_value


Locating the File in Hadoop

If you do not already know the folder location and name of the file you have a few options. One is to ask the Hadoop administrator. Or, if you are using Hortonworks, you can connect to Hadoop using a browser and navigate to the File Browser section.

Using the File Browser in Hortonworks, once you locate your target file, File Browser will provide you with both the file name, and the folder navigation path. File NYSE_dividends_A.csv  is located in the folder /user/hue.  So, when we create the External Table, we'll set LOCATION='/user/hue/NYSE_dividends_A.csv'
Update:  if you plan on inserting or update data in Hadoop from PolyBase, the Location can only be a directory (folder). See PolyBase - Insert data into new Hadoop Directory).

Best Practices - use a folder

Use Folders!  For production, create a new folder for each file type.  Then point your LOCATION= parameter to just the folder, and not the specific file. Why?  1) you can add more files to the directory, and your Polybase External table will automagically read them. 2) you can do INSERTS and UPDATES from PolyBase back to your files in Hadoop. 3) It's cleaner. ( See PolyBase - Insert data into a Hadoop Hue Directory).

Data Source

The list of available data sources for your database are located in External Resources. We've created two - they are identical except for the name. In an earlier post, we covered how to create an External Data Source.

What happens if you reference an invalid data source?  You get this message:

Msg 46501, Level 16, State 1, Line 82
External table references 'DATA_SOURCE' that does not exist.

File Format and Table Layout

For both the file format and the table layout, you will need to assess the overall structure of your file in Hadoop, and how you want the data to be presented in a table format. For our example, we are using a fairly simple CSV file in Hadoop that has four distinct columns. The first two are string values, the third is a date column and the last is numeric.

If you have approached these posts as a tutorial to install, setup and configure PolyBase, up to this point, you don't know if your PolyBase connection to Hadoop is properly setup. To minimize problems, for this initial table test, we kept the file and the table configuration simple. That includes setting up all four columns as long string values.Once we have confirmed our PolyBase configuration is working, we can update the table column data types. For our simple CSV file, we will use the following syntax:

       CREATE EXTERNAL TABLE [dbo].[stock] (
              [sname] varchar(200) NULL,
              [CO]    varchar(200) NULL ,
              [DATE]  varchar(200) NULL ,
              [div]   varchar(200) NULL
       WITH (
              DATA_SOURCE = hdpDemo ,
              FILE_FORMAT = CSVfile

( If you are trying to connect to a Hive table, please refer to: Connect PolyBase to your Hive Database.)

Final Test

With our table created, we can run a simple query in SSMS against our Hadoop table. To keep the record set small, we'll limit the result set to the top 100 records.

             select top 100  sname, CO, DATE, div  from [dbo].[stock];

If all goes well, we see something like this:

So what happens if you reference an unknown file format? You get this message:

Msg 46501, Level 16, State 2, Line 82
External table references 'FILE_FORMAT' that does not exist.

And what happens if you create an external table, but the declared column width is too small. For example, if we declare the column to be type varchar(2) and it is too small for the incoming data.  PolyBase returns the following message:

Msg 107090, Level 16, State 1, Line 79
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
OLE DB provider "SQLNCLI11" for linked server "(null)" returned message "Unspecified error".
Msg 7421, Level 16, State 2, Line 79
Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server "(null)". .

And what Happens if you try to create an External Table, and the External Data Source is not valid? To test this, we created a bogus data source with an invalid IP address in the LOCATION option. First off, we were able to create the bogus External Data Source with no error. But when you try to create a table using the invalid External Data Source, you'll get this message:

Msg 105019, Level 16, State 1, Line 82
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [Call From WIN-RC147EPV3T3/ to failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout] occurred while accessing external file.'

Verify the table connection values using SSMS Table Properties:

Simply highlight the external table in SSMS, select properties, and then click on External Table.
The popup will display the information that was provided for Location, File Format and Data Source when the table was created.

Thursday, September 3, 2015

Create PolyBase CSV External File Format

PolyBase - Creating an External File Format

This article continues the series on setting up PolyBase in SQL Server 2016 CTP 2.2 and covers some of the basic requirements for setting up one or more External File Formats. With PolyBase setup and configured in SQL Server 2016, and an External Data Source created, the next step is to create a file format, known as an External File Format.

Why do we need an External File Format?

When file data is loaded into Hadoop, it's just loaded as is. Unstructured, semi-structured or perhaps structured. Hadoop does not care what it looks like. The Hadoop file system is just there to hold data, in whatever form it may have. So with Hadoop, the approach is load data first, define the structure later. This allows you, the user to overlay any format you want onto the data. And it allows you to create different formats for the same data!

But that flexibility comes at a slight cost. And that cost is that before you can use the file data in Hadoop, you first need to tell Hadoop how to read the data. In our case, we tell PolyBase how to read the file and PolyBase translates that for Hadoop.  For PolyBase, this is a two-step process. The first step is to define how to read a file and how to find the column delimiters if any. The second step is how to take this data and place it into a nicely formatted table. We'll cover that next in how to setup an External Table.

So PolyBase needs to know how to read the file. Is there a field terminator?  What is the field terminator? Are there any string delimiters?  If there is a date, and if so what is the format of the data?  And finally, has Hadoop compressed the file data? That is what the External File Format does. The External File Format is a reusable format that can be used for one or more External Tables that will eventually be created.

Reusable Formats

Over time, you'll end with a core set of generic, stock file formats for the most common file types stored in Hadoop.  One for comma separated values (CSV) comes to mind. And perhaps one for a pipe '|'.  And undoubtedly, you will end up with some custom special use file formats.

A Generic External File Format for a CSV file in Hadoop

For this post, we are going to focus only on Hadoop file data, and not on HIVE tables. And specifically, comma separated (CSV) values. We'll cover the syntax for Hadoop Hive tables later.

Syntax for a Hadoop file is as follows: Full MSDN Syntax for External File Format

--Create an external file format for a Hadoop text-delimited file.
    [ , FORMAT_OPTIONS (  [ ,...n  ] ) ]
         | 'org.apache.hadoop.io.compress.DefaultCodec'

    FIELD_TERMINATOR = field_terminator
    | STRING_DELIMITER = string_delimiter
    | DATE_FORMAT = datetime_format

Today, we just want a basic, generic format to read a CSV file loaded into our Hadoop system.

Here is the script to create a generic External File Format for a CSV file.

-- 3)  CREATE EXTERNAL FILE FORMAT  (used by table)

 use [the same database where you created the External Data Source];

 --Create an external file format for a Hadoop text-delimited file.
    WITH (
           , FORMAT_OPTIONS (
             FIELD_TERMINATOR = ','
             ,USE_TYPE_DEFAULT =  TRUE ),
          DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'

User defined name:  CSVfile
Format to read a file. The others are for HIVE tables
Since this is a CSV format, a comma is our field delimiter
When TRUE, add in a default value when a field value is missing. Zero (0) for numeric, Empty string for string columns.  FALSE stores all missing values as a NULL
 DATA_COMPRESSION = 'org.apache.hadoop.io.
The DefaultCodec specifies that the data is not compressed.

 One format option we skipped that is frequently needed was the STRING_DELIMITER option. While basic, this is a good starting point for going to the next step and creating an External Table against a file in Hadoop.

Wednesday, September 2, 2015

Create PolyBase External Data Source to Hadoop in SQL Server 2016

With PolyBase installed and configured the next step is to create an external data source. To quote James Serra on his blog about PolyBase, this is similar to setting up a linked server. But in our case, the linked server is Hadoop.

Using PolyBase, the T-SQL commands are directed against what is called an External Table.  Our External Tables are mapped against an External Data Source. And, before you can create an External Table, you also need to create an External  File Format. So, the order of creation would typically be:
  1. Create an External Data Source
  2. Create one, or more External File Formats
  3. Create one or more External Tables

Creating an External Data Source to Hadoop

From MSDN-CREATE EXTERNAL DATA SOURCE, the syntax to create an External Data Source to Hadoop is as follows:

--Create an external data source for a Hadoop cluster
    WITH (
        TYPE = HADOOP
        LOCATION = 'hdfs://NameNode_URI:port'
        [, RESOURCE_MANAGER_LOCATION = 'ip_address:port' ]
        [, CREDENTIAL = credential_name ]

Here, data_source_name is a name value you define.  The current documentation just specifies that it must be unique within the database.  I Initially created an External Data Source in the DWConfiguration database, which worked. Using the same script, I then created an External Data Source in a different, user database - which is most likely the preferred approach. Check with your Hadoop administrator for the correct IP address and port. For our test, we followed the MSDN-CREATE EXTERNAL DATA SOURCE that tells us that the port our Hortonworks Hadoop system is 8020.


            Hortonworks 2.0, 2.1, 2.2, or Cloudera 5.1
            Copy this code and replace NameNode_IP with the IP address of your Name node.

               LOCATION = 'hdfs://NameNode_IP:8020'

Resource Manager Location

While the argument for the RESOURCE_MANAGER_LOCATION is optional, we included it in our test. MSDN tells us that when the RESOURCE_MANAGER_LOCATION is specified, the PolyBase query optimizer uses a cost-based decision to determine if Hadoop should first pre-process the data before passing it on to PolyBase for inclusion in the query. See MSDN or check with your Hadoop administrator to determine what port number to use for the Resource Manager.

To create our External Data Source, we used the following script. Run this script in your user database.

No Validation

The LOCATION and the RESOURCE_MANAGER_LOCATION  values are NOT validated when you execute the create data source script. At this point, the only way to test the connection is with a test file.

T-SQL Script to create a new External Data Source


 -- Create an external data source (Hadoop)
 -- online reference: https://msdn.microsoft.com/en-us/library/dn935022.aspx

           -- Hortonworks 2.0, 2.1, 2.2, or Cloudera 5.1:  = 
           --   LOCATION = 'hdfs://NameNode_IP:8020'
    LOCATION ='hdfs://',
           -- Hortonworks HDP 2.0, 2.1, 2.2 on Linux: 
           -- RESOURCE_MANAGER_LOCATION = 'NameNode_IP:8050'

A new External Data Source

Multiple data sources used with different File Format Types - Not Allowed

MSDN, in the documentation for External Tables explicitly states:

All data sources must have the same format type. You cannot have some data sources with text-delimited file formats and some with RCFILE formats.

For PolyBase, you will most likely use the text-delimited format. 

Can you Have more than one actual external Hadoop or Azure Data source?

First we need to differentiate between a physical instance of Hadoop or Azure  vs. creating a named External Data Source.  We can create many named External Data Sources even if they all point to the same physical instance - and in multiple databases. But can we have SQL Server 2016's PolyBase point to more than one physical Hadoop or Azure instance?

To some extent, yes - according to the documentation, but only if you choose option 1 or 4. That leaves Azure or an older version of Hadoop. When you update the server for 'hadoop connectivity'  [See Setting up PolyBase in SQL Server 2016], you have to define which Hadoop Server Option you will be using. This immediately limits your type range of Hadoop or Azure Blob storage. And next, if you are using a YARN enabled Hadoop server, the specific yarn.application.classpath value from your Hadoop server must be added to the SQL Server installation.  That seems to limit Hadoop access to a single Hadoop Server. For an Azure Blob, if you can use option 4, you should be able to connect to multiple Azure sites.