如何通过 MapReduce 的第二个选项卡拆分单词?
How can I split words by the second tab for MapReduce?
我正在对一些网络数据进行 MapReduce。 (我是 MapReduce 的新手,所以想想经典的 WordCount 类型的东西。)输入文件如下,数字后跟一个制表符:
3 2 2 4 2 2 2 3 3
虽然我了解如何获得经典的 'word count' 数字,但我真正想做的是成对评估数字,因此上面的映射器会将其读取为“3 2”, “2 2”、“2 4”、“2 2”等。我该怎么做呢?我想所需要的只是调整 StringTokenizer 以按第二个选项卡或其他方式拆分单词,但我该怎么做呢?这可能吗?
这是我正在使用的 Java 代码,到目前为止,它只是 MapReduce 中的经典 WordCount 示例:
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
您可以轻松修改 WordCount 以获得您预期的行为。
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
String myString = word.toString();
String [] numbers = myString.split("\t"); // split by tab
if (numbers.length> 2)
{
// you need at least two numbers to make one pair
int first = Integer.parseInt(numbers[0]);
int second;
for (int i=1; i < numbers.length; ++i)
{
second = Integer.parseInt(numbers[i]);
Text keynew = new Text(first+"\t"+second);
context.write(keynew, one);
// your second will be the first in the next loop iteration
first = second;
}
}
}
}
}
试试这个:
String data = "0\t0\t1\t2\t4\t5\t3\t4\t6\t7";
String[] array = data.split("(?<=\G\w{1,3}\t\w{1,3})\t");
for(String s : array){
System.out.println(s);
}
其中 {1,3} 是数字中位数的范围。
输出:
0 0
1 2
4 5
3 4
6 7
对于您的代码,
String[] pairsArray = value.toString().split("(?<=\G\w{1,3}\t\w{1,3})\t");
for (String pair : pairsArray) {
context.write(new Text(pair), one);
}
感谢大家的帮助!这最终成为我想出的解决方案(在添加一些前导零以帮助格式化之后):
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String data = value.toString();
for (int i = 0; i < (data.length() / 3) - 1; i++) {
String pair = data.substring(i*3, (i*3)+5);
context.write(new Text(pair), one);
}
}
}
我正在对一些网络数据进行 MapReduce。 (我是 MapReduce 的新手,所以想想经典的 WordCount 类型的东西。)输入文件如下,数字后跟一个制表符:
3 2 2 4 2 2 2 3 3
虽然我了解如何获得经典的 'word count' 数字,但我真正想做的是成对评估数字,因此上面的映射器会将其读取为“3 2”, “2 2”、“2 4”、“2 2”等。我该怎么做呢?我想所需要的只是调整 StringTokenizer 以按第二个选项卡或其他方式拆分单词,但我该怎么做呢?这可能吗?
这是我正在使用的 Java 代码,到目前为止,它只是 MapReduce 中的经典 WordCount 示例:
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
您可以轻松修改 WordCount 以获得您预期的行为。
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
String myString = word.toString();
String [] numbers = myString.split("\t"); // split by tab
if (numbers.length> 2)
{
// you need at least two numbers to make one pair
int first = Integer.parseInt(numbers[0]);
int second;
for (int i=1; i < numbers.length; ++i)
{
second = Integer.parseInt(numbers[i]);
Text keynew = new Text(first+"\t"+second);
context.write(keynew, one);
// your second will be the first in the next loop iteration
first = second;
}
}
}
}
}
试试这个:
String data = "0\t0\t1\t2\t4\t5\t3\t4\t6\t7";
String[] array = data.split("(?<=\G\w{1,3}\t\w{1,3})\t");
for(String s : array){
System.out.println(s);
}
其中 {1,3} 是数字中位数的范围。
输出:
0 0
1 2
4 5
3 4
6 7
对于您的代码,
String[] pairsArray = value.toString().split("(?<=\G\w{1,3}\t\w{1,3})\t");
for (String pair : pairsArray) {
context.write(new Text(pair), one);
}
感谢大家的帮助!这最终成为我想出的解决方案(在添加一些前导零以帮助格式化之后):
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String data = value.toString();
for (int i = 0; i < (data.length() / 3) - 1; i++) {
String pair = data.substring(i*3, (i*3)+5);
context.write(new Text(pair), one);
}
}
}