如何在过滤器pyspark RDD中过滤掉以'URL'开头的行
How to filter out lines starting with 'URL' in filter pyspark RDD
我已经初始化了一个 pyspark sc。
task1 = (text.filter(lambda x: len(x)>0 )) # to filter empty lines
task1.collect()
我的目标是过滤掉此文本片段中以 'URL' 开头的行:
['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
'WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes.
如何使用 pyspark 语法轻松地做到这一点?
你可以使用正则表达式
import re
reg = re.compile('^(?!URL).*')
task1 = text.filter(lambda x: reg.match(x))
问题需要示例输入和输出。我假设提供的数据是 table 中的行。如果不是这种情况,很乐意在澄清后更改答案。如果是的话;
说数据是;
+---+--------------------+
|SID| Attribute|
+---+--------------------+
| 1|[URL: http://www....|
| 2|scherzer-baffles-...|
| 3|kept the Mets afl...|
+---+--------------------+
让我们在 PySpark expr()
旁边使用 filter
;一个 SQL 函数,用于在数据帧中执行 SQL 类表达式
from pyspark.sql.functions import *
df.filter(expr("Attribute like '[__%'")).show()#Finds any values that start with "[" and are at least 3 characters in length
+---+--------------------+
|SID| Attribute|
+---+--------------------+
| 1|[URL: http://www....|
+---+--------------------+
我已经初始化了一个 pyspark sc。
task1 = (text.filter(lambda x: len(x)>0 )) # to filter empty lines
task1.collect()
我的目标是过滤掉此文本片段中以 'URL' 开头的行:
['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html', 'WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes.
如何使用 pyspark 语法轻松地做到这一点?
你可以使用正则表达式
import re
reg = re.compile('^(?!URL).*')
task1 = text.filter(lambda x: reg.match(x))
问题需要示例输入和输出。我假设提供的数据是 table 中的行。如果不是这种情况,很乐意在澄清后更改答案。如果是的话;
说数据是;
+---+--------------------+
|SID| Attribute|
+---+--------------------+
| 1|[URL: http://www....|
| 2|scherzer-baffles-...|
| 3|kept the Mets afl...|
+---+--------------------+
让我们在 PySpark expr()
旁边使用 filter
;一个 SQL 函数,用于在数据帧中执行 SQL 类表达式
from pyspark.sql.functions import *
df.filter(expr("Attribute like '[__%'")).show()#Finds any values that start with "[" and are at least 3 characters in length
+---+--------------------+
|SID| Attribute|
+---+--------------------+
| 1|[URL: http://www....|
+---+--------------------+