我在猪脚本中具有以下关系:
my_relation: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())
bytearray列中可以有任意数量的值/时间戳对(甚至为零)。
我想将此关系转换为此(每个entityId,attributeName,值,timestamp四重奏一行):
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,
另外,这也很好-我对没有值/时间戳的行不感兴趣
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
有任何想法吗?基本上,我想规范化bytearray列中的映射元组,以使架构如下所示:
my_relation: {entityId: chararray,
attributeName: chararray,
value: float,
timestamp: int}
我是猪的初学者,所以很抱歉,如果这很明显!我需要UDF来做到这一点吗?
这个问题类似,但到目前为止没有答案:如何在Pig中将许多地图的元组拆分为不同的行
我正在运行Apache Pig版本0.12.0-cdh5.1.2
编辑-添加到目前为止我所做的详细信息。
这是一个猪脚本片段,其输出如下:
-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java.
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);
c_no_flatten = foreach b generate
$0 as entityId,
$1 as attributeName,
TOBAG($2 ..);
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;
输出如下。请注意,在关系“ c”中,第二个值/时间戳对[value#52.0,timestamp#1388683516000]如何丢失。
(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}
这应该可以解决问题。首先,展平图元组以摆脱封装元组:
b = foreach a generate entityId, attributeName, FLATTEN($2);
现在,我们可以将除前两个字段之外的所有内容转换为一个包。可以将袋子扁平化(请参阅http://pig.apache.org/docs/r0.12.0/basic.html#flatten)以获取每个值/时间戳对的行:
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
最后,从地图中获取所需的值:
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
更新:一些其他方法可以从地图元组中制作出一袋地图:
foo()
答案中的Python UDF:Pig-如何在地图包上进行迭代本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句