MapReduce 过滤让客户不在订单列表中?
MapReduce filtering to get customers not in order list?
目前正在学习 MapReduce 并试图找出如何将其编码为 Java。
两个输入文件,名为 customers.txt 和 car_orders.txt:
customers.txt
===================
12345 Peter
12346 Johnson
12347 Emily
12348 Brad
[custNum, custName]
car_orders.txt
===================
00034 12345 23413
00035 12345 94832
00036 12346 8532
00037 12348 9483
[orderNo, custNum, carValue]
想法是应用 MapReduce 并输出没有下车订单的客户 - 在上面的场景中是 Emily。
Output:
===================
12347 Emily
这是我的想法:
Map phase:
1. Read the data inside customers.txt, get key-value pair, (custNum, custName)
2. Read the data inside car_orders.txt, get key-value pair, (custNum, [orderNo, carValue])
3. Partition into groups based on the key
Reduce phase:
1. Compare key-value A and key-value B, if key-value B is NULL
2. Output key-value A
非常感谢以伪代码的形式对此应用程序提供任何帮助。
它基本上是一个 reduce-side-join,您可以在其中丢弃两边都填充的输出 - 与您在伪代码中放置的相同。
Hadoop MapReduce 中的代码如下所示:
class TextMap extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context) {
String[] a = value.toString().split(" "); // assuming space separation
if (a.length == 2) {
context.write(new Text(a[0]), new Text(a[1]));
} else if (a.length == 3) {
context.write(new Text(a[1]), new Text(a[2]));
}
}
}
那会发出:
12345 Peter
12346 Johnson
12347 Emily
12348 Brad
12345 23413
12345 94832
12346 8532
12348 9483
所以减速器看起来相当简单:
class TextReduce extends Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context) {
List<String> vals = new ArrayList<>();
for(Text t : values) {
vals.add(t.toString());
}
if(vals.size() == 1) {
context.write(new Text(vals.get(0)), new Text(""));
}
}
}
那应该只发出 Emily
。
目前正在学习 MapReduce 并试图找出如何将其编码为 Java。
两个输入文件,名为 customers.txt 和 car_orders.txt:
customers.txt
===================
12345 Peter
12346 Johnson
12347 Emily
12348 Brad
[custNum, custName]
car_orders.txt
===================
00034 12345 23413
00035 12345 94832
00036 12346 8532
00037 12348 9483
[orderNo, custNum, carValue]
想法是应用 MapReduce 并输出没有下车订单的客户 - 在上面的场景中是 Emily。
Output:
===================
12347 Emily
这是我的想法:
Map phase:
1. Read the data inside customers.txt, get key-value pair, (custNum, custName)
2. Read the data inside car_orders.txt, get key-value pair, (custNum, [orderNo, carValue])
3. Partition into groups based on the key
Reduce phase:
1. Compare key-value A and key-value B, if key-value B is NULL
2. Output key-value A
非常感谢以伪代码的形式对此应用程序提供任何帮助。
它基本上是一个 reduce-side-join,您可以在其中丢弃两边都填充的输出 - 与您在伪代码中放置的相同。
Hadoop MapReduce 中的代码如下所示:
class TextMap extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context) {
String[] a = value.toString().split(" "); // assuming space separation
if (a.length == 2) {
context.write(new Text(a[0]), new Text(a[1]));
} else if (a.length == 3) {
context.write(new Text(a[1]), new Text(a[2]));
}
}
}
那会发出:
12345 Peter
12346 Johnson
12347 Emily
12348 Brad
12345 23413
12345 94832
12346 8532
12348 9483
所以减速器看起来相当简单:
class TextReduce extends Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context) {
List<String> vals = new ArrayList<>();
for(Text t : values) {
vals.add(t.toString());
}
if(vals.size() == 1) {
context.write(new Text(vals.get(0)), new Text(""));
}
}
}
那应该只发出 Emily
。