比较行以在 PySpark 中创建名词块
Compare Rows To Make Noun Chunk in PySpark
我有一个 Spark 数据框,其中每一行都是一个句子的标记,包括它的词性。我正在尝试找到将一行与下一行进行比较以创建最长名词块的最佳方法。
+------+-----------+---------------------------+--------+-------+-------+-----+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS|
+------+-----------+---------------------------+--------+-------+-------+-----+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN|
我知道 for 循环效率不高,但我不确定还有什么方法可以获得如下所示的预期结果:
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB| NULL|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN| hour|
尝试使用 window 函数。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("SENT_ID").orderBy("TOKEN_ID")
w1=Window().partitionBy("SENT_ID", "list")
df\
.withColumn("list", F.sum(F.when(F.col("POS")=='NOUN', F.lit(0)).otherwise(F.lit(1))).over(w))\
.withColumn("list", F.expr("""IF(POS!='NOUN',null,list)"""))\
.withColumn("NOUN_CHUNK", F.when(F.col("list").isNotNull(),F.array_join(F.collect_list("LEMMA").over(w1),' '))\
.otherwise(F.lit(None))).drop("list").orderBy("SENT_ID","TOKEN_ID").show()
#+------+-------+--------------------+--------+------+------+----+---------------+
#|REV_ID|SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
#+------+-------+--------------------+--------+------+------+----+---------------+
#| 1| 1|Ice hockey game t...| 1| Ice| ice|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 2|hockey|hockey|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 3| game| game|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 4| took| take|VERB| null|
#| 1| 1|Ice hockey game t...| 5| hours| hour|NOUN| hour|
#+------+-------+--------------------+--------+------+------+----+---------------+
我有一个 Spark 数据框,其中每一行都是一个句子的标记,包括它的词性。我正在尝试找到将一行与下一行进行比较以创建最长名词块的最佳方法。
+------+-----------+---------------------------+--------+-------+-------+-----+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS|
+------+-----------+---------------------------+--------+-------+-------+-----+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN|
我知道 for 循环效率不高,但我不确定还有什么方法可以获得如下所示的预期结果:
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB| NULL|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN| hour|
尝试使用 window 函数。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("SENT_ID").orderBy("TOKEN_ID")
w1=Window().partitionBy("SENT_ID", "list")
df\
.withColumn("list", F.sum(F.when(F.col("POS")=='NOUN', F.lit(0)).otherwise(F.lit(1))).over(w))\
.withColumn("list", F.expr("""IF(POS!='NOUN',null,list)"""))\
.withColumn("NOUN_CHUNK", F.when(F.col("list").isNotNull(),F.array_join(F.collect_list("LEMMA").over(w1),' '))\
.otherwise(F.lit(None))).drop("list").orderBy("SENT_ID","TOKEN_ID").show()
#+------+-------+--------------------+--------+------+------+----+---------------+
#|REV_ID|SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
#+------+-------+--------------------+--------+------+------+----+---------------+
#| 1| 1|Ice hockey game t...| 1| Ice| ice|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 2|hockey|hockey|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 3| game| game|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 4| took| take|VERB| null|
#| 1| 1|Ice hockey game t...| 5| hours| hour|NOUN| hour|
#+------+-------+--------------------+--------+------+------+----+---------------+