PySpark 中的正则表达式
regexp in PySpark
我正在尝试在 pyspark 中重现 django ORM 查询的结果:
social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)'
Collection.objects.filter(social__iregex=social_filter)
我的主要问题是它应该不区分大小写。
我试过这个:
social_filter = "social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'"
df = sessions.filter(social_filter)
导致以下错误:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.22] failure: end of input expected
social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'
以及以下表达式:
social_filter = "social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)"
df = sessions.filter(social_filter)
崩溃:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.17] failure: identifier expected
social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
求求你帮忙!
下面的怎么样:
>>> rdd = sc.parallelize([Row(name='bob', social='TWITter'),
Row(name='steve', social='facebook')])
>>> df = sqlContext.createDataFrame(rdd)
>>> df.where("LOWER(social) LIKE 'twitter'").collect()
[Row(name=u'bob', social=u'TWITter')]
如果您需要实际的正则表达式,您可以为您想要的所有社交网络执行此操作。否则,如果匹配准确,你可以这样做:
>>> df.where("LOWER(social) IN ('twitter', 'facebook')").collect()
[Row(name=u'bob', social=u'TWITter'), Row(name=u'steve', social=u'facebook')]
您现在也可以使用 UDF 来完成:
from pyspark.sql import functions as F
from pyspark.sql.types import BooleanType
import re as re
def filter_fn(s):
return re.search('(facebook|flipboard|linkedin|pinterest|reddit|twitter)', s, re.IGNORECASE) is not None
filter_udf = F.udf(filter_fn, BooleanType())
sessions_filtered = sessions.filter(filter_udf(sessions['social']))
我正在尝试在 pyspark 中重现 django ORM 查询的结果:
social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)'
Collection.objects.filter(social__iregex=social_filter)
我的主要问题是它应该不区分大小写。
我试过这个:
social_filter = "social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'"
df = sessions.filter(social_filter)
导致以下错误:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.22] failure: end of input expected
social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'
以及以下表达式:
social_filter = "social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)"
df = sessions.filter(social_filter)
崩溃:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.17] failure: identifier expected
social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
求求你帮忙!
下面的怎么样:
>>> rdd = sc.parallelize([Row(name='bob', social='TWITter'),
Row(name='steve', social='facebook')])
>>> df = sqlContext.createDataFrame(rdd)
>>> df.where("LOWER(social) LIKE 'twitter'").collect()
[Row(name=u'bob', social=u'TWITter')]
如果您需要实际的正则表达式,您可以为您想要的所有社交网络执行此操作。否则,如果匹配准确,你可以这样做:
>>> df.where("LOWER(social) IN ('twitter', 'facebook')").collect()
[Row(name=u'bob', social=u'TWITter'), Row(name=u'steve', social=u'facebook')]
您现在也可以使用 UDF 来完成:
from pyspark.sql import functions as F
from pyspark.sql.types import BooleanType
import re as re
def filter_fn(s):
return re.search('(facebook|flipboard|linkedin|pinterest|reddit|twitter)', s, re.IGNORECASE) is not None
filter_udf = F.udf(filter_fn, BooleanType())
sessions_filtered = sessions.filter(filter_udf(sessions['social']))