Monday, November 30, 2015

Troubleshooting Master Data Services Installation

SQL Server 2016 continues with Master Data Services. And whether you find it useful or not, if you move forward with both the installation and setup of MDS, there is a good chance you'll get a notification that MDS was not installed properly.

If the installation and setup problems focus on IIS, then all you may need to do to get MDS up and running is to add them using the Server Manager  in the Add Roles and Features section. Once you have added the necessary components, startup the MDS Configuration Manager. If you still need some additional components, add the items and restart the MDS Configuration Manager (be sure to close out the configuration manager).

For a detailed  listing of troubleshooting items for MDS, refer to the TechNet article SQL Server Troubleshooting: Master Data Services Installation and Configuration.

Wednesday, November 25, 2015

PolyBase vs. Spark vs. Hive

PolyBase vs. Spark vs. Hive

Hadoop has been gaining grown in the last few years, and as it grows, some of its weaknesses are starting to show.  For analysis/analytics, one issue has been a combination of complexity and speed. Given that Hadoop is designed to store unstructured data, the reality is that at least with the first phases of ETL/ELT/EL against the unstructured data it will be complex. And yes, call it what you want, but it is in fact a form of ETL/ELT/EL. But once the data has been organized, let's say structured, the next questions will be - where do we put the data, and how do we manipulate the data.

In the early days of Hadoop, it appeared that the typical approach was to transfer the data to a more traditional database.  It might be an MPP system, such as Vertica or Teradata, or a relational database such as SQL Server. or you could move the data to a Hive table. Hive uses many of the SQL commands, but the early design of Hive was slow. And HBase was available, but few if any analysts knew the cryptic commands for HBase.

Improvements to Analysis

These weaknesses have been addressed in one of two approaches:  Improve the current Hadoop functionality, or create new external tools that address both the complexity and the speed issues.

YARN & SPARK
Mapreduce against Hadoop is slow. Spark allows the creation of a clustering computation engine that can be run against HDFS, or a few other non-Hadoop data structures. The enabler for HDFS was Hadoop 2 with YARN. Spark runs under YARN, but much faster than mapreduce. If you are running Hadoop, you will want to include Spark.

And over time, Hive has improved, with the introduction of ORC tables (optimized row columnar), which greatly improved performance (see ORC File in HDP 2: Better Compression, Better Performance). For 2016, an even faster Hive will be introduced, called Hive LLAP (live long and process).  The goal, provide sub-second response for Hive tables For external tools, Here are two links that provide additional details on Hive LLAP:

         INTERACTIVE SQL ON HADOOP WITH HIVE LLAP
         HIVE LLAP PREVIEW ENABLES SUB-SECOND SQL ON HADOOP AND MORE


Working with PolyBase, we have found that once we get the data setup, using it is straight forward. But, we still had to get the data into some form of structure. Looking over the Spark documentation, Spark has a lot to offer, but it too begs to have the data first organized into a standardized structure (see Is Apache Spark going to replace Hadoop). They are all somewhat different, so the question might be:  how do I choose between MapReduce, Spark, Hive  and PolyBase.  If you are already using SQL Server, and have access to PolyBase, that is the best place to start.  You can access traditional text files in Hadoop, as well as the ORC tables in Hive (or delimitedtext tables). PolyBase allows you to use the T-SQL command you already know, and will bypass MapReduce as needed. If you have both PolyBase and Hadoop/Spark it is not an either/or question.  The question is which tool is the best for this problem.

Se also:
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

http://www.infoworld.com/article/3019754/application-development/16-things-you-should-know-about-hadoop-and-spark-right-now.html#tk.drr_mlt


Tuesday, November 24, 2015

PolyBase and HDInsight

HDInsight has two variants:  Azure, and a "difficult to find" on-premise version.  Here, for this posting, we'll primarily focus on the on-premise version. But it's not clear (as of Nov 24, 2015) just what is HDInsight for Windows Server and if it even still exists.  Several years ago, I downloaded an early version, but I can no longer find a download.  Now, I can only find an install for the Emulator (below). The other references point to www.microsoft.com/bigdata/ and those point to SQL Server.

But, before you get started, overall, it seems that an on-premise HDInsight product has been abandoned by Microsoft for the Hortonworks for Windows kits. The focus is with Azure HDInsight, and the Hortonworks for Windows.  The documentation is old, the links do not always work, and there are few comments about this elusive on-premise package, and those are about how it does not work.  So, if you want Hadoop running on Windows, skip directly to Hortonworks (see link below). And keep in mind, that there are no listed sp_configure options for HDInsight - none!

Can we get it to Work?

Be forewarned, several people have commented that they cannot connect PolyBase to HDInsight in Azure. And I have not had time to give it a go. 

Where do I get started?

Start with the technet overview:  HDInsight Server.
This will take you to another Technet article about Getting Started with the Windows Azure HDInsight Emulator. So, what does that mean?  From some phrases, it seems to mean that the emulator is limited to a single node - for testing purposes.  Down the page, another reference takes you to a Hortonworks partnership page:  Hortonworks & Microsoft: Bringing Apache Hadoop to Windows.  And this page has links to try HDP on Windows, or to try HDP in Azure.

Deep in one of the pages, you'll find your way to the Microsoft Azure page:  Install the HDInsight Emulator.

HDInsight Emulator for Windows Azure 

Once you get to the Azure page, you'll find a link to the install page:  Microsoft HDInsight Emulator for Windows Azure.  And for an earlier preview version:  There is no date for this, so it might not be current, but here is the link: Microsoft HDInsight Emulator for Windows Azure (Preview).  I found one comment that it does not show up in the installer! If you have any success with the Emulator, please let me know.

Is HDP in Azure different from or the same as HDInsight

According to this partners page:  Hortonworks Data Platform, they are different. Perhaps similar, but different.

PolyBase to the HDInsight Emulator

Once I have more information, I'll update.


Other Resources

Introduction to Azure HDInsight
Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service Announced
How to Install Microsoft HDInsight Server Hadoop on Windows 8 Professional

Friday, November 20, 2015

Setting Up Sqoop to connect to SQL Server

Looking for tutorials, reference guides and info on how to setup Sqoop to connect to SQL Server? Below are some good starting points.


Resources


Hortonworks tutorial:  Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop,

Microsoft SQL Server-Hadoop Connector user Guide, a PDF on how to setup and use Sqoop with Hadoop

The Apache.org Sqoop User Guide (v1.4.3),

Plus some other sites that may be useful:

Sqoop and Microsoft SQL Server

Thursday, November 19, 2015

Power BI Mobile - a hidden gem

One of the hidden gems of the new Power BI 2.0 is the mobile app. It's free, available for the iPhone, Android and Windows mobile devices and it is very good.  Starting with the December, 2015 Power BI mobile app, you can view Reporting Services 2016 (SSRS) reports and KPI's as well.




Getting Started

There are only two steps to getting started with the Power BI mobile app.  Downloading it from the app store, and signing in with your Power BI login. Once you have signed in, you instantly have access to all your dashboards.

                       Main Menu                                                                 My Workspace



  
After selecting a dashboard, you are presented with all of the visualizations from the selected dashboard. Holding the phone in the normal portrait mode, each visualization is shown in a stacked, vertical  manner:



But if you hold the phone in landscape mode, the visualizations are shown similar to how they are organized in the dashboard.



Moving about the app

It's easy to move about the dashboards, reports and visualizations in the app. The visualizations are scaled well, and overall are easy to read. Tap on any individual visualization, and the app zooms into just that visualization. For the iPhone, they have included the ability to send it via Airdrop, email, add it to notes, print, etc.


Favorites and Sharing

While you are reviewing a dashboard, you can share it with another Power BI user. Or you can add individual visualizations to your favorites list. Handy once you have more than a few dashboards in your workspace



Alerts

You can setup an alert, but only on what are called card tiles. These are the single number visualizations, such as the ones you find in the Retail Analysis Sample - for example: This Year's Sales.


Weaknesses

One of the big selling features for Power BI is the interactive nature of the dashboards. Select an item, and all of the visualizations in the dashboard are adjusted. In the mobile tool, interactivity is limited to the single visualization.

Alerts are another area where the usage is limited.  For now, you can only setup an alert on a card tile. Great if your dashboard has one, but for many users, it will be awkward to have an endless number of card tiles just to support alerts on the mobile app.

Conclusion

If you are already using Power BI, this is a great addition, and for those that have a need to access their reports using a mobile device, the Power BI mobile app is a great start. Microsoft continues to improve Power BI, and has really ramped up the BI presentation toolkit. There are also indications that the upcoming release of Reporting Services for SQL Server 2016 will include support for mobile. That will benefit everyone.



Tuesday, November 17, 2015

Connect PolyBase to your Hive ORC Table

Using PolyBase to connect to a plain text Hive table (file) is no different from connecting to any other file in Hadoop. (See:  Connect PolyBase to your Hive database Table: SQL Server 2016) But the future of Hive is moving to the optimized row columnar (ORC) format. According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

Setting up ORC tables in PolyBase is a three step process:


External Data Source      - no sharing between file format types!
External File Format        - specific for ORC
External Table                   - relies on the ORC file format

This follows the same approach used to connect to plain text files. But, do we need to make changes to all three? To some extent, yes. External tables rely on the external file format, so we'll need to either create a new external table, or modify an existing external table. The external file format is where we specify that the source table is an ORC. And finally, the external data source can only support one format type. So, you will need different external data sources for your plain text file formats and your ORC file formats - even if they are pointing to the same Hadoop cluster.  On MSDN - CREATE EXTERNAL TABLE(Transact-SQL), near the bottom in the examples section is a note about data sources. Specifically it states:

System_CAPS_noteNote
All data sources must have the same format type. You cannot have some data
sources with text-delimited file formats and some with RCFILE formats.

ORC Specific External Data Source


For our test, we created a specific external data source just for ORC Hive tables.

CREATE EXTERNAL DATA SOURCE hdp23_orc with 
 (
  TYPE = HADOOP,
  LOCATION ='hdfs://192.168.1.120:8020',
  RESOURCE_MANAGER_LOCATION='192.168.1.120:8050'
 ); 

ORC Specific External File Format


Next, we created three external file formats just for our ORC tables.  Notice that unlike the DELIMITEDTEXT external file formats, there is no need for a field terminator. string delimiter or date format. Besides the FORMAT_TYPE option, the only other option for the ORC format type is for compression.  Here, you can either ask your Hadoop administrator, or experiment to see which one works. We created all three external file formats. One for each of the two explicit data compression options, and a third where we omitted the data compression completely. In our Hortonworks cluster, both the SnappyCodec data compression format and the file format where we omitted the declaration for the data compression worked. Here they are:

  CREATE EXTERNAL FILE FORMAT ORCdefault
  WITH (
        FORMAT_TYPE = ORC
       ,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
       );
  go

  CREATE EXTERNAL FILE FORMAT ORCsnappy
  WITH (
        FORMAT_TYPE = ORC
       ,DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
       );
  go

  CREATE EXTERNAL FILE FORMAT ORCnocompress
  WITH (
        FORMAT_TYPE = ORC
       );
 go


ORC Specific External Table


With the ORC specific data source setup and the ORC specific file formats ready to use, all we need is to setup the table. We found that PolyBase wants a strongly typed ORC Hive Table, so here you will need to make sure your SQL data types match those that are present in Hadoop. (See PolyBase wants a strongly typed ORC Hive Table)


   CREATE EXTERNAL TABLE [dbo].sample_07c (
     codeid          nvarchar(200) NULL ,
     descrip         nvarchar(200) NULL ,
     total_emp             int null,
     salary                int null
       )
       WITH (
             LOCATION    ='/apps/hive/warehouse/sample_07c/000000_0',
             DATA_SOURCE = hdp23_orc ,
             FILE_FORMAT = ORCsnappy                  

            );



See also:



PolyBase wants a strongly typed ORC Hive Table

It appears that PolyBase wants a strongly typed external table definition to make a connection to a Hive ORC table in Hortonworks.  For new tables, or new Hadoop connections, I initially declare all of the columns as varchar or nvarchar. For this test, my ORC table had four columns, the first two were string, and the others were int.




So, starting off, all four columns were declared as nvarchar.  But this was a Hive ORC table, and it did not work.  The return message was:

     Msg 106000, Level 16, State 1, Line 75
     org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
     OLE DB provider "SQLNCLI11" for linked server "(null)" returned message
     "Unspecified error".

     Msg 7421, Level 16, State 2, Line 76
     Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server " (null)"..


Looking closely, the key phrase in the error message was "cannot be cast." Once the two int columns were properly declared as int the create table command ran successfully, and the table could now be queried.  The successful DDL:

CREATE EXTERNAL TABLE [dbo].sample_07c (
codeid nvarchar(200) null,
descrip nvarchar(200) null,
total_emp int null,
salary int null
)
WITH (
         LOCATION='/apps/hive/warehouse/sample_07c/000000_0',
 DATA_SOURCE = hdp23_orc ,
 FILE_FORMAT = ORCsnappy
 );
         go


Also note that I had a dedicated data source for the ORC file type. According to MSDN, all data sources must have the same file format associated with it.  So, you'll need a different data source for your DELIMITEDTEXT format types, and another for your ORC format types. The exact warning in MSDN is:

System_CAPS_noteNote
All data sources must have the same format type.
You cannot have some data sources with text-delimited file formats
and some with RCFILE formats.

            See:  MSDN: CREATE EXTERNAL TABLE (Transact-SQL)



Monday, November 16, 2015

Add GUI to Hortonworks Sandbox

If you are not familiar with Linux, the minimal install that comes with the Hortonworks Sandbox can prove difficult.  Fortunately, several bloggers have posted details on how to add a user interface to the Hortonworks Sandbox - and for that matter, any other CentOS minimal install that you may end up using.

Once you have made the changes, Hortonworks will display lots of logins, and none of them will seem to work - try admin with the default admin password. Does not work. Go to the bottom of the list, and log in as root.  Then create a new user from the menu  SYSTEM | ADMINISTRATION | USERS AND GROUPS.  I also had to setup a network connection.

You can find the details here:


Thursday, November 12, 2015

Yes! - Microsoft Finally has a Reporting Strategy

On October 29, 2015, the SQL Server Blog team released Microsoft Business Intelligence - our reporting roadmap. It's a good read, and outlines how Microsoft plans on pulling together the latest update to reporting services (SSRS) with Power BI, the push into mobile reporting, and the ever present Excel.

If you have used reporting services, is has been a capable, if minimalist based reporting server. And in the past, perhaps the biggest problem was the inability to support any browser other than Internet Explorer.  Fortunately, the upcoming update to SSRS includes support for HTML5 - with the hope that this will support all modern browsers. And, a suggestion that mobile will be supported as part of SSRS.

For those of you who have started to use the updated Power BI tool kit (see my Power BI 2.0 First Look at Redmond Magazine), perhaps the biggest change was the focus on presentation. Looking back at all of the Microsoft BI tools, Power BI 2.0 is the first to really focus on presentation and usability. And Microsoft really got this part right. But what was not mentioned, is that Power BI already has support for its own mobile apps.  Will these be one in the same as the ones for SSRS, or different?  [Answer: as of December 2015, SSRS 2016 will use the new Power BI mobile apps - see Use Power BI mobile apps to view SSRS 2016 reports.

And, for all of the strategy, the one part that is really missing is the core, back-end server part that will pull it all together.  We still have SSRS report formats that are different from the Power BI report formats - with no indication that the two will be merged (1). And where is Sharepoint is all this? This is a great weakness.  There are indications that Power BI reports will be supported on an internal server, but that is in the future. And it may, or may not include Pyramid Analytics. (see my blog posts on Power BI).  (1) - 2016 update:  posting by both the Power BI team and the SSRS team have indicated that evenually, SSRS will support 'managed' Power BI Desktop reports. That is great news - but what are managed Power BI Desktop reports?
(2) - 2017 - Microsoft announces the introduction of Power BI Reporting Services.  For now, you'll need some cash to get the integration of Power BI and SSRS.

Still, it is great to see that Microsoft has finally decided to get back into the Business Intelligence game, and in a big way. Now, if they will only show that they are not abandoning multi-dimensional SSAS. Tabular is nice, for some uses, but not all of them.

Power BI - a History

Tim Rodman has a historical listing of Power BI. You can find it here:
What is Power BI? - Tim Rodman

Query Optimization - Limit Tables in Join

Years ago, I was called in to stabilize the SQL Server databases for a large internet platform. Among the problems discovered was that there was really no concern for limiting how many tables should be included in a query. It was just assumed that the query optimizer would automagically work, For one table, two tables, or dozens of tables.

Digging through the internal statistics and various DMV's, a few queries were found to be causing most of the problems.  One query in particular was found to have 192 tables as part of the join. Given that a join with 10 tables can have over one million possible join orders, 192 tables was most likely astronomical.  On his blog, Benjamin Nevarez has written about the number of possible permutations as the number of tables in the join increases.  You can find it here:  Optimizing Join Orders.

If you just want a quick overview,
here is a reprint of the table posted in the article by Benjamin Nevarez:

Tables
Left-Deep Trees
Bushy Trees
1
1
1
2
2
2
3
6
12
4
24
120
5
120
1,680
6
720
30,240
7
5,040
665,280
8
40,320
17,297,280
9
362,880
518,918,400
10
3,628,800
17,643,225,600
11
39,916,800
670,442,572,800
12
479,001,600
28,158,588,057,600

Tuesday, November 10, 2015

SQL Server 2016 First Look - Review

The upcoming release of SQL Server 2016 promises to have some big enhancements, based on the various prereleases. PolyBase is just one of the major enhancements.

You can find my published review covering SQL Server 2016 community technology preview (CTP) at Redmond Magazine. Find it here: SQL Server 2016 Preview Reveals Path to Azure and Big Data.

Another exciting addition included with SQL Server 2016 is the Query Store. Here, Microsoft stayed with a simple, and meaningful name.  The query store does literally store past queries along with performance statistics that allow DBA's to monitor, and manage how they should be run in the future.  Redgate's Simple Talk web site has started a series of posts that cover the Query Store in-depth. you can find it here: TheSQL Server 2016 Query Store: Overview and Architecture.