import complex data structure in hive with custom separators

andy_64bit

I have a huge dataset with the following structure

fieldA,fieldB,fieldC;fieldD|fieldE,FieldF;fieldG|fieldH,FieldI ...

where:

fieldA,fieldB and fieldC are strings that should be imported into separate columns

fieldD|fieldE,FieldF;fieldG|fieldH,FieldI is an array (elements separated by semicolon) of maps (elements separated by |) of arrays (elements separated by comma, e.g. fieldE,FieldF)

My problem is that the initial array is separated from the fieldA,fieldB,fieldC with a semicolon. My question is how do I set the separators correctly when I create a table.

This one does not recognize an array - although I provide a semicolon as a field separator

CREATE TABLE string_array(
    first_part STRING # this would be to store fieldA,fieldB,fieldC
   ,second_part ARRAY<STRING> # this would be to store fieldD|fieldE,FieldF;fieldG|fieldH,FieldI and split it by semicolon
       )
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\\u003b'
    COLLECTION ITEMS TERMINATED BY '\\u003b'
    MAP KEYS TERMINATED BY '|'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '...' INTO TABLE string_array;

Any ideas how to make it work so I can build upon it? Thanks a lot in advance!

nochum

Great question.

I think that we can break this problem up into two discrete pieces: (1) Hive table structure, and (2) data delimiters.

Let's start by looking at the Hive table structure. If I understood your data structure correctly (please correct me if I didn't), the table structure that would best describe your data could be represented as:

CREATE TABLE string_array
AS
SELECT 'fieldA,fieldB,fieldC' AS first_part, array(map('fieldD', array('fieldE', 'FieldF')), map('fieldG', array('fieldH','FieldI'))) AS second_part;

Note that the field second_part is an array of maps, where the key to each map references an array of strings. In other words, the field second_part consists of an array within a map within an array.

If I use the statement above to create a table, I can then copy the resulting table to the local filesystem and look at how Hive assigns default delimiters to it. I know that you don't want to use default delimiters, but please bear with me here. The resulting table looks like this in its serialized on-disk representation:

00000000  66 69 65 6c 64 41 2c 66  69 65 6c 64 42 2c 66 69  |fieldA,fieldB,fi|
00000010  65 6c 64 43 01 66 69 65  6c 64 44 04 66 69 65 6c  |eldC.fieldD.fiel|
00000020  64 45 05 46 69 65 6c 64  46 02 66 69 65 6c 64 47  |dE.FieldF.fieldG|
00000030  04 66 69 65 6c 64 48 05  46 69 65 6c 64 49 0a     |.fieldH.FieldI.|

If we look at how Hive sees the delimiters we note that Hive actually sees five types or levels of delimiters:

delimiter 1 = x'01' (between fieldC & fieldD) -- between first_part and second_part
delimiter 2 = x'02' (between fieldF & fieldG) -- between the two maps in the array of maps
delimiter 3 = x'03' not used
delimiter 4 = x'04' (between fieldD & fieldE) -- between the key and the array of fields within the map
delimiter 5 = x'05' (between fieldE & fieldF) -- between the fields within the array within the map

And herein lies your problem. Current versions of Hive (as of 0.11.0) only allow you to override three levels of delimiters. But due to the levels of nesting within your data, Hive is seeing a requirement for greater than three levels of delimiters.

My suggestion would be to pre-process your data to use Hive's default delimiters. With this approach you should be able to load your data into Hive and reference it.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Import file with : separators into MySQL database

分類Dev

Perl: share complex data-structure between threads

分類Dev

WP all import is not importing data to custom field

分類Dev

Implementing complex TreeList structure

分類Dev

What is the proper way to fill some complex structure with data from IO using Lenses?

分類Dev

Custom List data structure implementing SequenceType with using of GeneratorOf struct

分類Dev

Entity Framework complex tree structure

分類Dev

Binding ngModel to complex data

分類Dev

Sass - Complex folder structure with a lot of css files

分類Dev

How to go from long to wide structure in HIVE?

分類Dev

How do I implement a begin() member function in my custom data structure?

分類Dev

Unable to query complex SQL statements, from hive table using pyspark

分類Dev

Fitting data with a complex function in python

分類Dev

Working with data returned as a Complex Array

分類Dev

Custom control structure with two blocks

分類Dev

Insert data in hive using multidelimeter

分類Dev

Access Hive Data Using Python

分類Dev

Worker pool data structure

分類Dev

Structure array erases data

分類Dev

how to deallocate a data structure

分類Dev

Transform Linq Data Structure

分類Dev

Binary Tree Data Structure

分類Dev

In Angular, how to parse and extract HTTP response that has Complex structure (JSON)?

分類Dev

MDS import data queue

分類Dev

Import Data from Module

分類Dev

Hive partition folder changes after import to another table

分類Dev

Retrieve complex nested data from firebase

分類Dev

Fetch complex format data in list with dart language

分類Dev

SQL Server 2008 Pivot, no Aggregation, complex data

Related 関連記事

ホットタグ

アーカイブ