为什么我们使用 hadoop mapreduce 进行数据处理?为什么不在本地机器上做?
why we use hadoop mapreduce for data process ? why not do on local machine?
我很困惑,我尝试将概率取为一百万个随机数。我尝试在 google dataProc 中使用 MapReduce 以及在 spyder 上使用 运行 python 脚本来做同样的事情。但更快的是本地机器。那么为什么我们为此使用Mapreduce?
我使用以下代码。
#!/usr/bin/env python3
import timeit
start = timeit.default_timer()
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
#Random Number Generating
x = np.random.randint(low=1, high=100, size=1000000)
counts = Counter(x)
total = sum(counts.values())
d1 = {k:v/total for k,v in counts.items()}
grad = d1.keys()
prob = d1.values()
#print(str(grad))
#print(str(prob))
#bins = 20
plt.hist(prob,bins=20, normed=1, facecolor='blue', alpha=0.5)
#plt.plot(bins, hist, 'r--')
plt.xlabel('Probability')
plt.ylabel('Number Of Students')
plt.title('Histogram of Students Grade')
plt.subplots_adjust(left=0.15)
plt.show()
stop = timeit.default_timer()
print('Time: ', stop - start)
#!/usr/bin/env python3
"""mapper.py"""
import sys
# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
#line = line.strip()
# Split it into tokens
#tokens = line.split()
#Get probability_mass values
for probability_mass in line:
print("None\t{}".format(probability_mass))
#print(str(probability_mass)+ '\t1')
#print('%s\t%s' % (probability_mass, None))
#!/usr/bin/env python3
"""reducer.py"""
import sys
from collections import defaultdict
counts = defaultdict(float)
# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
#line = line.strip()
# skip empty lines
if not line:
continue
# parse the input from mapper.py
k,v = line.split('\t', 1)
counts[v] += 1
total = (float(sum(counts.values())))
#total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)
Hadoop 用于存储和处理大数据。在 Hadoop 中,数据存储在 运行 作为集群的廉价商品服务器上。它是一个分布式文件系统,允许并发处理和容错。 Hadoop MapReduce 编程模型用于从其节点更快地存储和检索数据。
Google Dataproc 是云上的 Apache Hadoop。当数据量很大时,单机不足以处理Map/Reduce。
100万是小量。
我很困惑,我尝试将概率取为一百万个随机数。我尝试在 google dataProc 中使用 MapReduce 以及在 spyder 上使用 运行 python 脚本来做同样的事情。但更快的是本地机器。那么为什么我们为此使用Mapreduce? 我使用以下代码。
#!/usr/bin/env python3
import timeit
start = timeit.default_timer()
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
#Random Number Generating
x = np.random.randint(low=1, high=100, size=1000000)
counts = Counter(x)
total = sum(counts.values())
d1 = {k:v/total for k,v in counts.items()}
grad = d1.keys()
prob = d1.values()
#print(str(grad))
#print(str(prob))
#bins = 20
plt.hist(prob,bins=20, normed=1, facecolor='blue', alpha=0.5)
#plt.plot(bins, hist, 'r--')
plt.xlabel('Probability')
plt.ylabel('Number Of Students')
plt.title('Histogram of Students Grade')
plt.subplots_adjust(left=0.15)
plt.show()
stop = timeit.default_timer()
print('Time: ', stop - start)
#!/usr/bin/env python3
"""mapper.py"""
import sys
# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
#line = line.strip()
# Split it into tokens
#tokens = line.split()
#Get probability_mass values
for probability_mass in line:
print("None\t{}".format(probability_mass))
#print(str(probability_mass)+ '\t1')
#print('%s\t%s' % (probability_mass, None))
#!/usr/bin/env python3
"""reducer.py"""
import sys
from collections import defaultdict
counts = defaultdict(float)
# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
#line = line.strip()
# skip empty lines
if not line:
continue
# parse the input from mapper.py
k,v = line.split('\t', 1)
counts[v] += 1
total = (float(sum(counts.values())))
#total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)
Hadoop 用于存储和处理大数据。在 Hadoop 中,数据存储在 运行 作为集群的廉价商品服务器上。它是一个分布式文件系统,允许并发处理和容错。 Hadoop MapReduce 编程模型用于从其节点更快地存储和检索数据。
Google Dataproc 是云上的 Apache Hadoop。当数据量很大时,单机不足以处理Map/Reduce。 100万是小量。