To set up a new External Table in PolyBase against a file in
Hadoop, you first need to collect some basic information about how the table will be mapped against the file. This is a logical mapping, much like a view. No changes are made to the file in Hadoop.
- Where is the file: What is the file name and path on the Hadoop server?
- Data Source: You will need an External Data Source, create a new one, or use existing?
- File format: Create new a external file format or Use an existing External File Format,
- Table Column definition: You will need to determine the number of columns that defines your file, including column names, data types, and data type size or length. All based on how you want to interpret the file. That is the magic of Hadoop. You can organize Hadoop files anyway you want. However, for many files, perhaps most file, the column break will be along a typical format such as CSV.
The formal syntax in MSDN on how to create an External Table is as follows:
--Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
( [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[ , [ ,...n ] ]
)
[;]
::=
{
| REJECT_TYPE = value | percentage
| REJECT_VALUE = reject_value
| REJECT_SAMPLE_VALUE = reject_sample_value
}
Locating the File in Hadoop
If you do not already know the folder location and name of
the file you have a few options. One is to
ask the Hadoop administrator. Or, if you are using Hortonworks, you can connect
to Hadoop using a browser and navigate to the File Browser section.
Using the File Browser in Hortonworks, once you locate your
target file, File Browser will provide you with both the file name, and the
folder navigation path. File NYSE_dividends_A.csv is located in the folder /user/hue. So, when we create the External Table, we'll
set LOCATION='/user/hue/NYSE_dividends_A.csv'.
Update: if you plan on inserting or update data in Hadoop from PolyBase, the Location can only be a directory (folder). See PolyBase - Insert data into new Hadoop Directory).
Best Practices - use a folder
Use Folders! For production, create a new folder for each file type. Then point your LOCATION= parameter to just the folder, and not the specific file. Why? 1) you can add more files to the directory, and your Polybase External table will automagically read them. 2) you can do INSERTS and UPDATES from PolyBase back to your files in Hadoop. 3) It's cleaner. ( See PolyBase - Insert data into a Hadoop Hue Directory).
Data Source
The list of available data sources for your database are located in
External Resources. We've created two - they are identical except for the name.
In an earlier post, we covered how to create an External Data Source.
What happens if you reference an invalid data source? You get this message:
Msg
46501, Level 16, State 1, Line 82
External
table references 'DATA_SOURCE' that does not exist.
File Format and Table Layout
For both the file format and the table layout, you will need to assess the overall structure of your file in Hadoop, and how you want the data to be presented in a table format. For our example, we are using a fairly simple CSV file in Hadoop that has four distinct columns. The first two are string values, the third is a date column and the last is numeric.
If you have approached these posts as a tutorial to install, setup and configure PolyBase, up to this point, you don't know if your PolyBase connection to Hadoop is properly setup. To minimize problems, for this initial table test, we kept the file and the table configuration simple. That includes setting up all four columns as long string values.Once we have confirmed our PolyBase configuration is working, we can update the table column data types. For our simple CSV file, we will use the following syntax:
CREATE EXTERNAL TABLE [dbo].[stock] (
[sname] varchar(200) NULL,
[CO] varchar(200) NULL ,
[DATE] varchar(200) NULL ,
[div] varchar(200) NULL
)
WITH (
LOCATION='/user/hue/NYSE_dividends_A.csv',
DATA_SOURCE = hdpDemo ,
FILE_FORMAT = CSVfile
);
( If you are trying to connect to a Hive table, please refer to: Connect PolyBase to your Hive Database.)
Final Test
With our table created, we can run a simple query in SSMS against our Hadoop table. To keep the record set small, we'll limit the result set to the top 100 records.
select top 100 sname, CO, DATE, div from [dbo].[stock];
If all goes well, we see something like this:
.
.
So what happens if you reference an unknown file format? You
get this message:
Msg
46501, Level 16, State 2, Line 82
External
table references 'FILE_FORMAT' that does not exist.
And what happens if you create an external table, but the
declared column width is too small. For example, if we declare the column to be
type varchar(2) and it is too small for the incoming data.
PolyBase returns the following message:
Msg
107090, Level 16, State 1, Line 79
Query
aborted-- the maximum reject threshold (0 rows) was reached while reading from
an external source: 1 rows rejected out of total 1 rows processed.
OLE
DB provider "SQLNCLI11" for linked server "(null)" returned
message "Unspecified error".
Msg
7421, Level 16, State 2, Line 79
Cannot
fetch the rowset from OLE DB provider "SQLNCLI11" for linked server
"(null)". .
And what Happens if you try to create an External Table, and the
External Data Source is not valid? To test this, we created a bogus data source with an invalid IP address in the LOCATION option. First off, we were able to create the bogus External Data Source with
no error. But when you try to create a table using the invalid External Data
Source, you'll get this message:
Msg
105019, Level 16, State 1, Line 82
EXTERNAL
TABLE access failed due to internal error: 'Java exception raised on call to
HdfsBridge_IsDirExist: Error [Call From WIN-RC147EPV3T3/192.168.1.138 to
192.168.1.113:8020 failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=192.168.1.113/192.168.1.113:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout]
occurred while accessing external file.'
Verify the table connection values using SSMS Table Properties:
Simply highlight the external table in SSMS, select properties, and then click on External Table.
The popup will display the information that was provided for Location, File Format and Data Source when the table was created.
4 comments:
Hi,
I am trying to create an external table to a file on Azure Blob Storage. And I'm getting the following error:
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist: Error [com.microsoft.windowsazure.services.core.storage.StorageException: The value for one of the HTTP headers is not in the correct format.] occurred while accessing external file.'
However, creating an external table on another Azure Blob Storage file is successful.
This makes me think that there is a version difference between these two Azure Blob containers?? Perhaps I need to change the Azure tools I have installed to something older/newer?
Any suggestions on how to resolve this?
Hi Mehmet,
It's great that you are working on connections to Azure Blob's. I've only focused on making connections to Hadoop, so I have not seen that error. One of my big issues with PolyBase is the is the lack of rich PolyBase error messages.
If you can, move a copy of the file that works on the first Azure system to the one that is not working. This will help eliminate one possible problem.
It might also be that your Location address has a hidden character that is causing a problem.
If you do get this resolved, please do add what steps you followed.
As i trying to enable the option force pushdown in polybase setup without option the query is executing successfully when use force pushdown hint getting below error:
Cannot execute the query "Remote Query" against OLE DB provider "SQLNCLI11" for linked server "(null)". Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
Even after defining values of the class.path in yarn-site.xml and mapred-site.xml as defined in cloudera hadoop i am getting same above error.
Hi, I am facing below issue while doing selects for ORC file formats in hadoop horton works. It works fine for CSV but when i create a external table with ORC and do a select i get this issue. Any idea suggestions on how to resolve this ?
Msg 7320, Level 16, State 110, Line 103
Cannot execute the query "Remote Query" against OLE DB provider "SQLNCLI11" for linked server "(null)". 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Java exception raised on call to HdfsBridge_CreateRecordReader: Error [HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader.] occurred while accessing external file [/apps-dev/hive/warehouse/dev_attunity.db/agt0180_lkp_compagency_master_COPY2/delta_1713242_1713242_0000/bucket_00009][0].
PolyBase DMS LOGS:
[2019-08-23 01:38:36,414] [WARN ] [Thread-1766] [com.microsoft.polybase.client.HdfsBridge] HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader.InputFormat: com.microsoft.polybase.storage.input.ORCFileToPaxBlockInputFormat, OutputBufferSize: null
[2019-08-23 01:38:36,414] [WARN ] [Thread-1766] [com.microsoft.polybase.client.HdfsBridgeException] HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader.Stack trace: java.lang.ArrayIndexOutOfBoundsException
Post a Comment