使用 **persist()** 后无法从 rdd 打印行
Cannot print lines from rdd after using **persist()**
我正在使用以下代码
def process_row(row):
items = row.replace('"', '')
items2 = items.split(' ')
for i in range(len(items2)):
#if we find ‘-’ we will replace it with ‘0’
if(row[-1]=='-'):
row[i]='0'
return [items2[0], items2[1], items2[2],items2[3], items2[4], int(items2[5])]
nasa = (
nasa_raw.map(process_row)
)
nasa = nasa.persist()
first10 = nasa.collect()[:10]
for i, line in enumerate(first10):
print(f'Line {i+1:02}: {line}')
在文本文件上:
in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 1839 uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0 ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713 uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0 slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687 piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853 slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
我正在尝试打印 RDD 的前 10 个元素,每个元素出现在不同的输出行上。
我收到类似
的错误
object does not support item assignment'
有什么想法吗?
使用您提供的示例数据,您的代码看起来不错(我按如下方式重新格式化)。我想问题可能来自您的数据本身。尝试分解或限制您的数据集?
in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 1839
uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0
ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713
uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0
slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687
piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853
slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
def process_row(row):
items = row.replace('"', '')
items2 = items.split(' ')
for i in range(len(items2)):
#if we find ‘-’ we will replace it with ‘0’
if(row[-1]=='-'):
row[i]='0'
return [items2[0], items2[1], items2[2],items2[3], items2[4], int(items2[5])]
nasa_raw = sc.textFile('a.txt')
nasa = (
nasa_raw.map(process_row)
)
nasa = nasa.persist()
first10 = nasa.collect()[:10]
for i, line in enumerate(first10):
print(f'Line {i+1:02}: {line}')
# Output
Line 01: ['in24.inetnebr.com', '[01/Aug/1995:00:00:01]', 'GET', '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt', '200', 1839]
Line 02: ['uplherc.upl.com', '[01/Aug/1995:00:00:07]', 'GET', '/', '304', 0]
Line 03: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/ksclogo-medium.gif', '304', 0]
Line 04: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/MOSAIC-logosmall.gif', '304', 0]
Line 05: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/USA-logosmall.gif', '304', 0]
Line 06: ['ix-esc-ca2-07.ix.netcom.com', '[01/Aug/1995:00:00:09]', 'GET', '/images/launch-logo.gif', '200', 1713]
Line 07: ['uplherc.upl.com', '[01/Aug/1995:00:00:10]', 'GET', '/images/WORLD-logosmall.gif', '304', 0]
Line 08: ['slppp6.intermind.net', '[01/Aug/1995:00:00:10]', 'GET', '/history/skylab/skylab.html', '200', 1687]
Line 09: ['piweba4y.prodigy.com', '[01/Aug/1995:00:00:10]', 'GET', '/images/launchmedium.gif', '200', 11853]
Line 10: ['slppp6.intermind.net', '[01/Aug/1995:00:00:11]', 'GET', '/history/skylab/skylab-small.gif', '200', 9202]
我正在使用以下代码
def process_row(row):
items = row.replace('"', '')
items2 = items.split(' ')
for i in range(len(items2)):
#if we find ‘-’ we will replace it with ‘0’
if(row[-1]=='-'):
row[i]='0'
return [items2[0], items2[1], items2[2],items2[3], items2[4], int(items2[5])]
nasa = (
nasa_raw.map(process_row)
)
nasa = nasa.persist()
first10 = nasa.collect()[:10]
for i, line in enumerate(first10):
print(f'Line {i+1:02}: {line}')
在文本文件上:
in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 1839 uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0 ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713 uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0 slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687 piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853 slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
我正在尝试打印 RDD 的前 10 个元素,每个元素出现在不同的输出行上。
我收到类似
的错误object does not support item assignment'
有什么想法吗?
使用您提供的示例数据,您的代码看起来不错(我按如下方式重新格式化)。我想问题可能来自您的数据本身。尝试分解或限制您的数据集?
in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 1839
uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0
ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713
uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0
slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687
piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853
slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
def process_row(row):
items = row.replace('"', '')
items2 = items.split(' ')
for i in range(len(items2)):
#if we find ‘-’ we will replace it with ‘0’
if(row[-1]=='-'):
row[i]='0'
return [items2[0], items2[1], items2[2],items2[3], items2[4], int(items2[5])]
nasa_raw = sc.textFile('a.txt')
nasa = (
nasa_raw.map(process_row)
)
nasa = nasa.persist()
first10 = nasa.collect()[:10]
for i, line in enumerate(first10):
print(f'Line {i+1:02}: {line}')
# Output
Line 01: ['in24.inetnebr.com', '[01/Aug/1995:00:00:01]', 'GET', '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt', '200', 1839]
Line 02: ['uplherc.upl.com', '[01/Aug/1995:00:00:07]', 'GET', '/', '304', 0]
Line 03: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/ksclogo-medium.gif', '304', 0]
Line 04: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/MOSAIC-logosmall.gif', '304', 0]
Line 05: ['uplherc.upl.com', '[01/Aug/1995:00:00:08]', 'GET', '/images/USA-logosmall.gif', '304', 0]
Line 06: ['ix-esc-ca2-07.ix.netcom.com', '[01/Aug/1995:00:00:09]', 'GET', '/images/launch-logo.gif', '200', 1713]
Line 07: ['uplherc.upl.com', '[01/Aug/1995:00:00:10]', 'GET', '/images/WORLD-logosmall.gif', '304', 0]
Line 08: ['slppp6.intermind.net', '[01/Aug/1995:00:00:10]', 'GET', '/history/skylab/skylab.html', '200', 1687]
Line 09: ['piweba4y.prodigy.com', '[01/Aug/1995:00:00:10]', 'GET', '/images/launchmedium.gif', '200', 11853]
Line 10: ['slppp6.intermind.net', '[01/Aug/1995:00:00:11]', 'GET', '/history/skylab/skylab-small.gif', '200', 9202]