在 Pig 中解析复杂的嵌套 JSON
Parsing complex nested JSON in Pig
我想将亿万富翁 JSON 数据集解析为 Pig.The JSON 文件可以找到 here.
这是每个条目的内容:
{
"wealth": {
"worth in billions": 1.2,
"how": {
"category": "Resource Related",
"from emerging": true,
"industry": "Mining and metals",
"was political": false,
"inherited": true,
"was founder": true
},
"type": "privatized and resources"
},
"company": {
"sector": "aluminum",
"founded": 1993,
"type": "privatization",
"name": "Guangdong Dongyangguang Aluminum",
"relationship": "owner"
},
"rank": 1372,
"location": {
"gdp": 0.0,
"region": "East Asia",
"citizenship": "China",
"country code": "CHN"
},
"year": 2014,
"demographics": {
"gender": "male",
"age": 50
},
"name": "Zhang Zhongneng"
}
尝试 1
我尝试在 grunt 中使用以下命令加载此数据:
billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth:
(worth in billions:double, how: (category:chararray, from
emerging:chararray, industry:chararray, was political:chararray,
inherited:chararray, was founder:chararray), type:chararray), company:
(sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country
code:chararray), year:int, demographics: (gender:chararray,age:int),
name:chararray');
然而,这给了我错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN
尝试 2
接下来我尝试使用 Twitter 的 elephantbird 项目的加载器 com.twitter.elephantbird.pig.load.JsonLoader
。 Here 是这个 UDF 的代码。这就是我所做的:
billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;
现在可以运行了,我没有收到任何错误!但是什么也没有显示。我得到如下输出:
Input(s): Successfully read 0 records (1445335 bytes) from:
"hdfs://localhost:9000/user/purak/billionaires.json"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_1478889184960_0005
我做错了什么?
这可能不是最好的方法,但这是我最终要做的:
从字段名称中删除空格:我用 "worth_in_billions" 替换了 "worth in billions"、"from emerging" 等字段, json 数据集中的 "from_emerging" 等。 (为此我做了一个简单的'find and replace')
逗号分隔 json 到换行符分隔 json :我拥有的 json 文件的格式是[{"_comment":"first entry" ...},{"_comment":"second entry" ...}]
。但是 Pig 中的 JsonLoader 将每个换行符作为一个新条目。为了使 json 文件以换行符分隔而不是逗号,我使用了 js 这是一个命令行 JSON 处理器。使用 sudo apt-get install js
和 运行 cat billionaires.json | jq -c ".[]" > newBillionaires.json
安装它。
newBillionaires.json 文件现在每个条目都换行。现在使用以下命令将此文件加载到 Pig 中:
copyFromLocal /home/purak/Desktop/newBillionaires.json /user/purak
billionaires = LOAD 'newBillionaires.json' USING
JsonLoader('name:chararray, demographics:
(age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double),rank:int,company:
(relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray),
wealth:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political:chararray,industry:chararray,
from_emerging:chararray,category:chararray),worth_in_biilions:double)');
注意:使用js颠倒了每个条目中的字段顺序。因此,在加载命令中,与问题中的加载命令相比,所有字段的顺序都是相反的。
- 您现在可以使用 :
解除每个元组的嵌套
billionairesFinal = foreach billionaires generate name,
demographics.age as age, demographics.gender as gender, year,
location.country_code as countryCode, location.citizenship as
citizenship, location.region as region, location.gdp as gdp, rank,
company.relationship as companyRelationship, company.name as
companyName, company.type as companyType, company.founded as
companyFounded, company.sector as companySector, wealth.type as
wealthType, wealth.how.was_founder as wasFounder, wealth.how.inherited
as inherited, wealth.how.was_political as wasPolitical,
wealth.how.industry as industry, wealth.how.from_emerging as
fromEmerging, wealth.how.category as category,
wealth.worth_in_biilions as worthInBillions;
- 使用
describe billionairesFinal;
检查一次结构:
billionairesFinal: {name: chararray,age: int,gender: chararray,year:
int,countryCode: chararray,citizenship: chararray,region:
chararray,gdp: double,rank: int,companyRelationship:
chararray,companyName: chararray,companyType:
chararray,companyFounded: int,companySector: chararray,wealthType:
chararray,wasFounder: chararray,inherited: chararray,wasPolitical:
chararray,industry: chararray,fromEmerging: chararray,category:
chararray,worthInBillions: double}
这就是我想要在 Pig 中使用的数据结构!现在我可以继续分析数据集了:)
我想将亿万富翁 JSON 数据集解析为 Pig.The JSON 文件可以找到 here.
这是每个条目的内容:
{
"wealth": {
"worth in billions": 1.2,
"how": {
"category": "Resource Related",
"from emerging": true,
"industry": "Mining and metals",
"was political": false,
"inherited": true,
"was founder": true
},
"type": "privatized and resources"
},
"company": {
"sector": "aluminum",
"founded": 1993,
"type": "privatization",
"name": "Guangdong Dongyangguang Aluminum",
"relationship": "owner"
},
"rank": 1372,
"location": {
"gdp": 0.0,
"region": "East Asia",
"citizenship": "China",
"country code": "CHN"
},
"year": 2014,
"demographics": {
"gender": "male",
"age": 50
},
"name": "Zhang Zhongneng"
}
尝试 1
我尝试在 grunt 中使用以下命令加载此数据:
billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth: (worth in billions:double, how: (category:chararray, from emerging:chararray, industry:chararray, was political:chararray, inherited:chararray, was founder:chararray), type:chararray), company: (sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country code:chararray), year:int, demographics: (gender:chararray,age:int), name:chararray');
然而,这给了我错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN
尝试 2
接下来我尝试使用 Twitter 的 elephantbird 项目的加载器 com.twitter.elephantbird.pig.load.JsonLoader
。 Here 是这个 UDF 的代码。这就是我所做的:
billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;
现在可以运行了,我没有收到任何错误!但是什么也没有显示。我得到如下输出:
Input(s): Successfully read 0 records (1445335 bytes) from: "hdfs://localhost:9000/user/purak/billionaires.json"
Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"
Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_1478889184960_0005
我做错了什么?
这可能不是最好的方法,但这是我最终要做的:
从字段名称中删除空格:我用 "worth_in_billions" 替换了 "worth in billions"、"from emerging" 等字段, json 数据集中的 "from_emerging" 等。 (为此我做了一个简单的'find and replace')
逗号分隔 json 到换行符分隔 json :我拥有的 json 文件的格式是
[{"_comment":"first entry" ...},{"_comment":"second entry" ...}]
。但是 Pig 中的 JsonLoader 将每个换行符作为一个新条目。为了使 json 文件以换行符分隔而不是逗号,我使用了 js 这是一个命令行 JSON 处理器。使用sudo apt-get install js
和 运行cat billionaires.json | jq -c ".[]" > newBillionaires.json
安装它。newBillionaires.json 文件现在每个条目都换行。现在使用以下命令将此文件加载到 Pig 中:
copyFromLocal /home/purak/Desktop/newBillionaires.json /user/purak
billionaires = LOAD 'newBillionaires.json' USING JsonLoader('name:chararray, demographics: (age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double),rank:int,company: (relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray), wealth:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political:chararray,industry:chararray, from_emerging:chararray,category:chararray),worth_in_biilions:double)');
注意:使用js颠倒了每个条目中的字段顺序。因此,在加载命令中,与问题中的加载命令相比,所有字段的顺序都是相反的。
- 您现在可以使用 : 解除每个元组的嵌套
billionairesFinal = foreach billionaires generate name, demographics.age as age, demographics.gender as gender, year, location.country_code as countryCode, location.citizenship as citizenship, location.region as region, location.gdp as gdp, rank, company.relationship as companyRelationship, company.name as companyName, company.type as companyType, company.founded as companyFounded, company.sector as companySector, wealth.type as wealthType, wealth.how.was_founder as wasFounder, wealth.how.inherited as inherited, wealth.how.was_political as wasPolitical, wealth.how.industry as industry, wealth.how.from_emerging as fromEmerging, wealth.how.category as category, wealth.worth_in_biilions as worthInBillions;
- 使用
describe billionairesFinal;
检查一次结构:
billionairesFinal: {name: chararray,age: int,gender: chararray,year: int,countryCode: chararray,citizenship: chararray,region: chararray,gdp: double,rank: int,companyRelationship: chararray,companyName: chararray,companyType: chararray,companyFounded: int,companySector: chararray,wealthType: chararray,wasFounder: chararray,inherited: chararray,wasPolitical: chararray,industry: chararray,fromEmerging: chararray,category: chararray,worthInBillions: double}
这就是我想要在 Pig 中使用的数据结构!现在我可以继续分析数据集了:)