Mapreduce:平均计算不起作用
Mapreduce: Average calculation is not working
我正在尝试使用此 map reduce 代码来计算平均值,但由于某种原因,平均值计算不正确。这个想法是计算每年的平均电影评分
映射器代码
public class AverageRatingMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
//initialize the writable datatype variables
private final static DoubleWritable tempWritable = new DoubleWritable(0);
private Text ReleaseYear = new Text();
//Override the original map methods
@Override
//map takes in three parameters
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
//creating an array called line of type String
//Split the values in the fields and store each value in one array element
String[] line = value.toString().split("\t");
//create a variable ID with type String and store the the 4th element (index 3) from array line. This contains the Year in th data file
String Year = line[3];
//Set the value of the year object created from the Text class to be the value of the Year read from the data file
ReleaseYear.set(Year);
//Create a variable, temp, of type double and convert the value stored in the 15th element (14th index) of the line array from String to Double and store it in temp
double temp = Double.parseDouble(line[14].trim());
//Store temp in tempWritable
tempWritable.set(temp);
//Emit Year and the average rating in tempWritable to the Reducer class
context.write(ReleaseYear, tempWritable);
}
}
减速器代码:
public class AverageRatingReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
//Create an arraylist of type double called ratingList
ArrayList<Double> ratingList = new ArrayList<Double>();
//Override reduce method
@Override
//reduce takes in thre parameters
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException
{
//create a variable SumofRatings of type double and initialize it to 0
double SumofRating = 0.0;
//use a for loop to store ratings in the ratingList array and sum the ratings in the SumofRatings variable
for(DoubleWritable value : values)
{
ratingList.add(value.get());
//calculate the cumulative sum
SumofRating = SumofRating + value.get();
}
//get the number of rating in the arrayList
int size = ratingList.size();
//calculate the average rating
double averageRating = SumofRating/size;
//Emit the year and the average rating to the output file
context.write(key, new DoubleWritable(averageRating));
}
}
主要 class:
public class AverageRating
{
public static void main(String[] args) throws Exception
{
//Create an object, conf, from the configuration class
Configuration conf = new Configuration();
if (args.length != 3)
{
System.err.println("Usage: MeanTemperature <input path> <output path>");
System.exit(-1);
}
//create an object, job, from the Job class
Job job;
//configure the parameters for the job
job = Job.getInstance(conf, "Average Rating");
//specify the driver class in the JAR file
job.setJarByClass(AverageRating.class);
//setting the input and output paths for the job
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
//Set the mapper and reducer for the job
job.setMapperClass(AverageRatingMapper.class);
job.setReducerClass(AverageRatingReducer.class);
//Set the key class (Text) and value class (DoubleWritable) for te job output data
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
//Delete output if it exists
FileSystem hdfs = FileSystem.get(conf);
Path outputDir = new Path(args[2]);
if(hdfs.exists(outputDir))
{
hdfs.delete(outputDir, true);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
请求的数据样本:
imdbtitle
title
year
avg_rating
tt0000009
XYZ
1911
6.9
tt0001892
PQR
1912
6.2
tt0002154
ABC
1912
8.2
tt0000458
JKL
1913
6.3
tt0015263
TGH
1913
7.1
tt0000053
PLO
1912
4.9
注意:还有更多列。我刚刚添加了重要的
第一年结果显示正确,第二年和第三年完全不准确。
- 1911 6.9
- 1912 5.25
- 1913 年 2.2
有人能帮帮我吗!
你真的不需要在 Reduce 函数中使用 ArrayList
,因为每个 reducer 都会得到所有按给定键分组的值(所以在这种情况下,每个 reducer 都有一年的所有评级).
此外,声明 Reduce 函数体的 ArrayList
outside 有点自找麻烦,因为您使用此列表只是为了计算值的数量已经在减速器上了。我的猜测是,在不同的 reducer 扫描之间,列表会不断填充来自下一个键(又名年份)的评级的评级。所以第一个键值对是正确的,但是后面的就不是了,因为列表中的元素个数一直在增加。
你可以保持它更传统,通过一个简单的 int
变量命名 numOfRatings
来简单地计算值的数量(也就是这里给出的特定年份的评级),并使用该变量除以求出每年的平均评分。
public static class AverageRatingReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
// create a variable SumofRatings of type double and initialize it to 0
double sumOfRating = 0.0;
int numOfRatings = 0;
// use a for loop to store ratings in the ratingList array and sum the ratings in the SumofRatings variable
for(DoubleWritable value : values)
{
sumOfRating += value.get();
numOfRatings++;
}
// calculate the average rating
double averageRating = sumOfRating/numOfRatings;
// emit the year and the average rating to the output file
context.write(key, new DoubleWritable(averageRating));
}
}
我正在尝试使用此 map reduce 代码来计算平均值,但由于某种原因,平均值计算不正确。这个想法是计算每年的平均电影评分
映射器代码
public class AverageRatingMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
//initialize the writable datatype variables
private final static DoubleWritable tempWritable = new DoubleWritable(0);
private Text ReleaseYear = new Text();
//Override the original map methods
@Override
//map takes in three parameters
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
//creating an array called line of type String
//Split the values in the fields and store each value in one array element
String[] line = value.toString().split("\t");
//create a variable ID with type String and store the the 4th element (index 3) from array line. This contains the Year in th data file
String Year = line[3];
//Set the value of the year object created from the Text class to be the value of the Year read from the data file
ReleaseYear.set(Year);
//Create a variable, temp, of type double and convert the value stored in the 15th element (14th index) of the line array from String to Double and store it in temp
double temp = Double.parseDouble(line[14].trim());
//Store temp in tempWritable
tempWritable.set(temp);
//Emit Year and the average rating in tempWritable to the Reducer class
context.write(ReleaseYear, tempWritable);
}
}
减速器代码:
public class AverageRatingReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
//Create an arraylist of type double called ratingList
ArrayList<Double> ratingList = new ArrayList<Double>();
//Override reduce method
@Override
//reduce takes in thre parameters
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException
{
//create a variable SumofRatings of type double and initialize it to 0
double SumofRating = 0.0;
//use a for loop to store ratings in the ratingList array and sum the ratings in the SumofRatings variable
for(DoubleWritable value : values)
{
ratingList.add(value.get());
//calculate the cumulative sum
SumofRating = SumofRating + value.get();
}
//get the number of rating in the arrayList
int size = ratingList.size();
//calculate the average rating
double averageRating = SumofRating/size;
//Emit the year and the average rating to the output file
context.write(key, new DoubleWritable(averageRating));
}
}
主要 class:
public class AverageRating
{
public static void main(String[] args) throws Exception
{
//Create an object, conf, from the configuration class
Configuration conf = new Configuration();
if (args.length != 3)
{
System.err.println("Usage: MeanTemperature <input path> <output path>");
System.exit(-1);
}
//create an object, job, from the Job class
Job job;
//configure the parameters for the job
job = Job.getInstance(conf, "Average Rating");
//specify the driver class in the JAR file
job.setJarByClass(AverageRating.class);
//setting the input and output paths for the job
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
//Set the mapper and reducer for the job
job.setMapperClass(AverageRatingMapper.class);
job.setReducerClass(AverageRatingReducer.class);
//Set the key class (Text) and value class (DoubleWritable) for te job output data
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
//Delete output if it exists
FileSystem hdfs = FileSystem.get(conf);
Path outputDir = new Path(args[2]);
if(hdfs.exists(outputDir))
{
hdfs.delete(outputDir, true);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
请求的数据样本:
imdbtitle | title | year | avg_rating |
---|---|---|---|
tt0000009 | XYZ | 1911 | 6.9 |
tt0001892 | PQR | 1912 | 6.2 |
tt0002154 | ABC | 1912 | 8.2 |
tt0000458 | JKL | 1913 | 6.3 |
tt0015263 | TGH | 1913 | 7.1 |
tt0000053 | PLO | 1912 | 4.9 |
注意:还有更多列。我刚刚添加了重要的
第一年结果显示正确,第二年和第三年完全不准确。
- 1911 6.9
- 1912 5.25
- 1913 年 2.2
有人能帮帮我吗!
你真的不需要在 Reduce 函数中使用 ArrayList
,因为每个 reducer 都会得到所有按给定键分组的值(所以在这种情况下,每个 reducer 都有一年的所有评级).
此外,声明 Reduce 函数体的 ArrayList
outside 有点自找麻烦,因为您使用此列表只是为了计算值的数量已经在减速器上了。我的猜测是,在不同的 reducer 扫描之间,列表会不断填充来自下一个键(又名年份)的评级的评级。所以第一个键值对是正确的,但是后面的就不是了,因为列表中的元素个数一直在增加。
你可以保持它更传统,通过一个简单的 int
变量命名 numOfRatings
来简单地计算值的数量(也就是这里给出的特定年份的评级),并使用该变量除以求出每年的平均评分。
public static class AverageRatingReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
// create a variable SumofRatings of type double and initialize it to 0
double sumOfRating = 0.0;
int numOfRatings = 0;
// use a for loop to store ratings in the ratingList array and sum the ratings in the SumofRatings variable
for(DoubleWritable value : values)
{
sumOfRating += value.get();
numOfRatings++;
}
// calculate the average rating
double averageRating = sumOfRating/numOfRatings;
// emit the year and the average rating to the output file
context.write(key, new DoubleWritable(averageRating));
}
}