来自列值和正则表达式的 Pyspark 字符串模式
Pyspark string pattern from columns values and regexp expression
嗨,我有一个包含 2 列的数据框:
+----------------------------------------+----------+
| Text | Key_word |
+----------------------------------------+----------+
| First random text tree cheese cat | tree |
| Second random text apple pie three | text |
| Third random text burger food brain | brain |
| Fourth random text nothing thing chips | random |
+----------------------------------------+----------+
我想生成第 3 列,其中一个词出现在文本的 key_word 之前。
+----------------------------------------+----------+-------------------+--+
| Text | Key_word | word_bef_key_word | |
+----------------------------------------+----------+-------------------+--+
| First random text tree cheese cat | tree | text | |
| Second random text apple pie three | text | random | |
| Third random text burger food brain | brain | food | |
| Fourth random text nothing thing chips | random | Fourth | |
+----------------------------------------+----------+-------------------+--+
我试过了,但没用
df2=df1.withColumn('word_bef_key_word',regexp_extract(df1.Text,('\w+)'df1.key_word,1))
这是创建数据框示例的代码
df = sqlCtx.createDataFrame(
[
('First random text tree cheese cat' , 'tree'),
('Second random text apple pie three', 'text'),
('Third random text burger food brain' , 'brain'),
('Fourth random text nothing thing chips', 'random')
],
('Text', 'Key_word')
)
更新
您还可以 by using pyspark.sql.functions.expr
to pass to pyspark.sql.functions.regexp_extract
:
from pyspark.sql.functions import expr
df = df.withColumn(
'word_bef_key_word',
expr(r"regexp_extract(Text, concat('\w+(?= ', Key_word, ')'), 0)")
)
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
原答案
一种方法是使用 udf
执行正则表达式:
import re
from pyspark.sql.functions import udf
def get_previous_word(text, key_word):
matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
return matches[0] if matches else None
get_previous_word_udf = udf(
lambda text, key_word: get_previous_word(text, key_word),
StringType()
)
df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
正则表达式模式 '\w+(?= {kw})'.format(kw=key_word)
表示匹配后跟 space 和 key_word
的单词。如果有多个匹配项,我们将 return 第一个。如果没有匹配项,函数 returns None
.
嗨,我有一个包含 2 列的数据框:
+----------------------------------------+----------+
| Text | Key_word |
+----------------------------------------+----------+
| First random text tree cheese cat | tree |
| Second random text apple pie three | text |
| Third random text burger food brain | brain |
| Fourth random text nothing thing chips | random |
+----------------------------------------+----------+
我想生成第 3 列,其中一个词出现在文本的 key_word 之前。
+----------------------------------------+----------+-------------------+--+
| Text | Key_word | word_bef_key_word | |
+----------------------------------------+----------+-------------------+--+
| First random text tree cheese cat | tree | text | |
| Second random text apple pie three | text | random | |
| Third random text burger food brain | brain | food | |
| Fourth random text nothing thing chips | random | Fourth | |
+----------------------------------------+----------+-------------------+--+
我试过了,但没用
df2=df1.withColumn('word_bef_key_word',regexp_extract(df1.Text,('\w+)'df1.key_word,1))
这是创建数据框示例的代码
df = sqlCtx.createDataFrame(
[
('First random text tree cheese cat' , 'tree'),
('Second random text apple pie three', 'text'),
('Third random text burger food brain' , 'brain'),
('Fourth random text nothing thing chips', 'random')
],
('Text', 'Key_word')
)
更新
您还可以pyspark.sql.functions.expr
to pass pyspark.sql.functions.regexp_extract
:
from pyspark.sql.functions import expr
df = df.withColumn(
'word_bef_key_word',
expr(r"regexp_extract(Text, concat('\w+(?= ', Key_word, ')'), 0)")
)
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
原答案
一种方法是使用 udf
执行正则表达式:
import re
from pyspark.sql.functions import udf
def get_previous_word(text, key_word):
matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
return matches[0] if matches else None
get_previous_word_udf = udf(
lambda text, key_word: get_previous_word(text, key_word),
StringType()
)
df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
正则表达式模式 '\w+(?= {kw})'.format(kw=key_word)
表示匹配后跟 space 和 key_word
的单词。如果有多个匹配项,我们将 return 第一个。如果没有匹配项,函数 returns None
.