使用 PIG 读取 XML
Reading XML using PIG
我正在尝试使用 PIG 从 xml 文件中读取数据,但我得到的输出不完整。
输入文件-
<document>
<url>htp://www.abc.com/</url>
<category>Sports</category>
<usercount>120</usercount>
<reviews>
<review>good site</review>
<review>This is Avg site</review>
<review>Bad site</review>
</reviews>
</document>
我使用的代码是:
register 'Desktop/piggybank-0.11.0.jar';
A = load 'input3' using org.apache.pig.piggybank.storage.XMLLoader('document') as (data:chararray);
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(data,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?<reviews>.*?<review>\s*([^>]*?)\s*</review>.*?</reviews>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int,review:chararray);
我得到的输出是:
(htp://www.abc.com/,Sports,120,good site)
哪个不完整output.Can有人请帮忙解决我遗漏的问题吗?
呵呵!!终于使用 cross
让它工作了。我正在使用 XPath
,你可以根据需要使用正则表达式。我发现,XPath 比正则表达式更简单、更简洁。我想,你也可以看到它。不要忘记将 testXML.xml
替换为您的 XML.
XPath 方式:
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'document/url'), XPath(x, 'document/category'), XPath(x, 'document/usercount');
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE XPath(review,'review');
E = cross B,D;
dump E;
正则表达式方式:
A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int);
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE FLATTEN(REGEX_EXTRACT_ALL(review,'<review>([^>]*?)</review>'));
E = cross B,D;
dump E;
输出:
(htp://www.abc.com/,Sports,120,Bad site)
(htp://www.abc.com/,Sports,120,This is Avg site)
(htp://www.abc.com/,Sports,120,good site)
这不是你期待的吗? ;)
我正在尝试使用 PIG 从 xml 文件中读取数据,但我得到的输出不完整。
输入文件-
<document>
<url>htp://www.abc.com/</url>
<category>Sports</category>
<usercount>120</usercount>
<reviews>
<review>good site</review>
<review>This is Avg site</review>
<review>Bad site</review>
</reviews>
</document>
我使用的代码是:
register 'Desktop/piggybank-0.11.0.jar';
A = load 'input3' using org.apache.pig.piggybank.storage.XMLLoader('document') as (data:chararray);
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(data,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?<reviews>.*?<review>\s*([^>]*?)\s*</review>.*?</reviews>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int,review:chararray);
我得到的输出是:
(htp://www.abc.com/,Sports,120,good site)
哪个不完整output.Can有人请帮忙解决我遗漏的问题吗?
呵呵!!终于使用 cross
让它工作了。我正在使用 XPath
,你可以根据需要使用正则表达式。我发现,XPath 比正则表达式更简单、更简洁。我想,你也可以看到它。不要忘记将 testXML.xml
替换为您的 XML.
XPath 方式:
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'document/url'), XPath(x, 'document/category'), XPath(x, 'document/usercount');
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE XPath(review,'review');
E = cross B,D;
dump E;
正则表达式方式:
A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int);
C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
D = FOREACH C GENERATE FLATTEN(REGEX_EXTRACT_ALL(review,'<review>([^>]*?)</review>'));
E = cross B,D;
dump E;
输出:
(htp://www.abc.com/,Sports,120,Bad site)
(htp://www.abc.com/,Sports,120,This is Avg site)
(htp://www.abc.com/,Sports,120,good site)
这不是你期待的吗? ;)