Return 来自与 Pyspark 中给定查询相同行的数据
Return data from same row as a given query in Pyspark
在 Pyspark 中,Spark/Hadoop 输入语言:我想在数据集中查找关键字,例如 "SJC," 和 return 第二列中对应于行的文本找到关键字 "SJC"。
例如,以下数据集为:
[年份][延误][目的地][航班号]
|1987| |-5| |SJC| |500|
|1987| |-5| |SJC| |250|
|1987| |07| |旧金山国际机场| |700|
|1987| |09| |SJC| |350|
|1987| |-5| |SJC| |650|
我希望能够以列表或字符串形式查询 "SJC" 和 return [延迟] 值。
我已经走到这一步了,但运气不好:
import sys
from pyspark import SparkContext
logFile = "hdfs://<ec2 host address>:9000/<dataset folder (on ec2)>"
sc = SparkContext("local", "simple app")
logData = sc.textFile(logFile).cache()
numSJC = logData.filter(lambda line: 'SJC' in line).first()
print "Lines with SJC:" + ''.join(numSJC)
感谢您的帮助!
你几乎一个人完成了
假设您有一个竖线分隔的文件“/tmp/demo.txt”:
Year|Delay|Dest|Flight #
1987|-5|SJC|500
1987|-5|SJC|250
1987|07|SFO|700
1987|09|SJC|350
1987|-5|SJC|650
在 PySpark 中你应该这样做:
# First, point Spark to the file
log = sc.textFile('file:///tmp/demo.txt')
# Second, replace each line with array of the values, thus string
# '1987|-5|SJC|500' is replaced with ['1987', '-5', 'SJC', '500']
log = log.map(lambda line: line.split('|'))
# Now filter leaving only the lists with 3rd element equal to 'SJC'
log = log.filter(lambda x: x[2]=='SJC')
# Now leave only the second column, 'Delay'
log = log.map(lambda x: x[1])
# And here's the result
log.collect()
在 Pyspark 中,Spark/Hadoop 输入语言:我想在数据集中查找关键字,例如 "SJC," 和 return 第二列中对应于行的文本找到关键字 "SJC"。
例如,以下数据集为:
[年份][延误][目的地][航班号]
|1987| |-5| |SJC| |500|
|1987| |-5| |SJC| |250|
|1987| |07| |旧金山国际机场| |700|
|1987| |09| |SJC| |350|
|1987| |-5| |SJC| |650|
我希望能够以列表或字符串形式查询 "SJC" 和 return [延迟] 值。
我已经走到这一步了,但运气不好:
import sys
from pyspark import SparkContext
logFile = "hdfs://<ec2 host address>:9000/<dataset folder (on ec2)>"
sc = SparkContext("local", "simple app")
logData = sc.textFile(logFile).cache()
numSJC = logData.filter(lambda line: 'SJC' in line).first()
print "Lines with SJC:" + ''.join(numSJC)
感谢您的帮助!
你几乎一个人完成了
假设您有一个竖线分隔的文件“/tmp/demo.txt”:
Year|Delay|Dest|Flight #
1987|-5|SJC|500
1987|-5|SJC|250
1987|07|SFO|700
1987|09|SJC|350
1987|-5|SJC|650
在 PySpark 中你应该这样做:
# First, point Spark to the file
log = sc.textFile('file:///tmp/demo.txt')
# Second, replace each line with array of the values, thus string
# '1987|-5|SJC|500' is replaced with ['1987', '-5', 'SJC', '500']
log = log.map(lambda line: line.split('|'))
# Now filter leaving only the lists with 3rd element equal to 'SJC'
log = log.filter(lambda x: x[2]=='SJC')
# Now leave only the second column, 'Delay'
log = log.map(lambda x: x[1])
# And here's the result
log.collect()