使用 Pig 将 Json 数据转换为特定的 table 格式
Convert Json Data into specific table format using Pig
我有 Json 个格式如下的文件:
"Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]
"Properties2":[{"K":"A","T":"String","V":"W”"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]
我想使用 pig 从上述 json 格式中提取 table 格式的数据:
预期格式:
注意:- 在第一条记录中,C 列应该为空或空,因为在第一条记录中,C 列没有值。
我尝试使用 jsonloader 和 eliphantbird jar 但没有得到预期的输出请建议我任何正确的方法来获得预期的输出。
你能试试这个自定义 UDF 吗?
样本输入1:
input.json
{"Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]}
{"Properties2":[{"K":"A","T":"String","V":"W"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]}
PigScript:
REGISTER jsonparse.jar
A= LOAD 'input.json' Using JsonLoader('Properties2:{(K:chararray,T:chararray,V:chararray)}');
B= FOREACH A GENERATE FLATTEN(STRSPLIT(mypackage.JSONPARSE(BagToString(Properties2)),'_',4));
STORE B INTO 'output' USING PigStorage();
输出:
M N O
W X Y Z
样本输入2:
{"Properties2":[{"K":"A","T":"String","V":"W"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]}
{"Properties2":[{"K":"A","T":"String","V":"M"},{"K":"B","T":"String","V":"N"},{"K":"D","T":"String","V":"O"}]}
{"Properties2":[{"K":"A","T":"String","V":"J"}]}
{"Properties2":[{"K":"B","T":"String","V":"X"}]}
{"Properties2":[{"K":"C","T":"String","V":"Y"}]}
{"Properties2":[{"K":"D","T":"String","V":"Z"}]}
输出2:
W X Y Z
M N O
J
X
Y
Z
UDF代码:下面的java文件编译生成为jsonparse.jar
(这只是一个临时的java代码,你可以根据您的需要进行优化或修改)
JSONPARSE.java
package mypackage;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.util.LinkedHashMap;
import org.apache.commons.lang.StringUtils;
public class JSONPARSE extends EvalFunc<String> {
@Override
public String exec(Tuple arg0) throws IOException {
try
{
//Get the input
String input = ((String) arg0.get(0));
//Parse the input "_" as the delimiter
String[] parts = input.split("_");
//Init the hash with key as(A,B,C,D) and value as empty string
LinkedHashMap<String,String> mymap= new LinkedHashMap<String,String>();
mymap.put("A", "");
mymap.put("B", "");
mymap.put("C", "");
mymap.put("D", "");
for(int i=0,j=2;i<parts.length;i=i+3,j=j+3)
{
//Find each key from the input and update the respective value
if(mymap.containsKey(parts[i]))
{
mymap.put(parts[i],parts[j]);
}
}
//Final output.
String output="";
for(String key: mymap.keySet())
{
//append each output "_" as delimiter
output=output+(String)mymap.get(key)+"_";
}
//Remove the extra delimiter "_" from the output
return StringUtils.removeEnd(output,"_");
}
catch(Exception e)
{
throw new IOException("Caught exception while processing the input row ", e);
}
}
}
如何编译构建jar文件:
1.Download 2 jar files from the below link(apache-commons-lang.jar,piggybank.jar)
http://www.java2s.com/Code/Jar/a/Downloadapachecommonslangjar.htm
http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm
2. Set the above 2 jar files to your class path
>> export CLASSPATH=/tmp/piggybank.jar:/tmp/apache-commons-lang.jar
3. Create directory name mypackage
>>mkdir mypackage
4. Compile your JSONPARSE.java file (make sure the two jars are included in the classpath otherwise compilation issue will come)
>>javac JSONPARSE.java
5. Move the class file to mypackage folder
>>mv JSONPARSE.class mypackage/
6. Create jar file name jsonparse.jar
>>jar -cvf jsonparse.jar mypackage/
7. (jsonparse.jar) file will be created, include into your pig script using REGISTER command.
命令行示例:
$ ls
JSONPARSE.java input.json
$ javac JSONPARSE.java
$ mkdir mypackage
$ mv JSONPARSE.class mypackage/
$ jar -cvf jsonparse.jar mypackage/
$ ls
JSONPARSE.java input.json jsonparse.jar mypackage
我有 Json 个格式如下的文件:
"Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]
"Properties2":[{"K":"A","T":"String","V":"W”"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]
我想使用 pig 从上述 json 格式中提取 table 格式的数据:
预期格式:
注意:- 在第一条记录中,C 列应该为空或空,因为在第一条记录中,C 列没有值。
我尝试使用 jsonloader 和 eliphantbird jar 但没有得到预期的输出请建议我任何正确的方法来获得预期的输出。
你能试试这个自定义 UDF 吗?
样本输入1:
input.json
{"Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]}
{"Properties2":[{"K":"A","T":"String","V":"W"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]}
PigScript:
REGISTER jsonparse.jar
A= LOAD 'input.json' Using JsonLoader('Properties2:{(K:chararray,T:chararray,V:chararray)}');
B= FOREACH A GENERATE FLATTEN(STRSPLIT(mypackage.JSONPARSE(BagToString(Properties2)),'_',4));
STORE B INTO 'output' USING PigStorage();
输出:
M N O
W X Y Z
样本输入2:
{"Properties2":[{"K":"A","T":"String","V":"W"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]}
{"Properties2":[{"K":"A","T":"String","V":"M"},{"K":"B","T":"String","V":"N"},{"K":"D","T":"String","V":"O"}]}
{"Properties2":[{"K":"A","T":"String","V":"J"}]}
{"Properties2":[{"K":"B","T":"String","V":"X"}]}
{"Properties2":[{"K":"C","T":"String","V":"Y"}]}
{"Properties2":[{"K":"D","T":"String","V":"Z"}]}
输出2:
W X Y Z
M N O
J
X
Y
Z
UDF代码:下面的java文件编译生成为jsonparse.jar
(这只是一个临时的java代码,你可以根据您的需要进行优化或修改)
JSONPARSE.java
package mypackage;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.util.LinkedHashMap;
import org.apache.commons.lang.StringUtils;
public class JSONPARSE extends EvalFunc<String> {
@Override
public String exec(Tuple arg0) throws IOException {
try
{
//Get the input
String input = ((String) arg0.get(0));
//Parse the input "_" as the delimiter
String[] parts = input.split("_");
//Init the hash with key as(A,B,C,D) and value as empty string
LinkedHashMap<String,String> mymap= new LinkedHashMap<String,String>();
mymap.put("A", "");
mymap.put("B", "");
mymap.put("C", "");
mymap.put("D", "");
for(int i=0,j=2;i<parts.length;i=i+3,j=j+3)
{
//Find each key from the input and update the respective value
if(mymap.containsKey(parts[i]))
{
mymap.put(parts[i],parts[j]);
}
}
//Final output.
String output="";
for(String key: mymap.keySet())
{
//append each output "_" as delimiter
output=output+(String)mymap.get(key)+"_";
}
//Remove the extra delimiter "_" from the output
return StringUtils.removeEnd(output,"_");
}
catch(Exception e)
{
throw new IOException("Caught exception while processing the input row ", e);
}
}
}
如何编译构建jar文件:
1.Download 2 jar files from the below link(apache-commons-lang.jar,piggybank.jar)
http://www.java2s.com/Code/Jar/a/Downloadapachecommonslangjar.htm
http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm
2. Set the above 2 jar files to your class path
>> export CLASSPATH=/tmp/piggybank.jar:/tmp/apache-commons-lang.jar
3. Create directory name mypackage
>>mkdir mypackage
4. Compile your JSONPARSE.java file (make sure the two jars are included in the classpath otherwise compilation issue will come)
>>javac JSONPARSE.java
5. Move the class file to mypackage folder
>>mv JSONPARSE.class mypackage/
6. Create jar file name jsonparse.jar
>>jar -cvf jsonparse.jar mypackage/
7. (jsonparse.jar) file will be created, include into your pig script using REGISTER command.
命令行示例:
$ ls
JSONPARSE.java input.json
$ javac JSONPARSE.java
$ mkdir mypackage
$ mv JSONPARSE.class mypackage/
$ jar -cvf jsonparse.jar mypackage/
$ ls
JSONPARSE.java input.json jsonparse.jar mypackage