PySpark SQL 查询到 return 行的字数最多

PySpark SQL query to return row with most number of words

我正在尝试提出一个 pyspark sql 查询 return review 数据框的 text 列中单词最多的行.

我想要return全文和字数。这个问题是关于 Yelp 数据集的评论。这是我到目前为止所拥有的,但显然它不(完全)正确:

query = """
SELECT text,LENGTH(text) - LENGTH(REPLACE(text,' ', '')) + 1 as count
    FROM review
    GROUP BY text
    ORDER BY count DESC
"""
spark.sql(query).show()

这是数据框中几行的示例:

[Row(business_id='ujmEBvifdJM6h6RLv4wQIg', cool=0, date='2013-05-07 04:34:36', funny=1, review_id='Q1sbwvVQXV2734tPgoKj4Q', stars=1.0, text='Total bill for this horrible service? Over Gs. These crooks actually had the nerve to charge us  for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', useful=6, user_id='hG7b0MtEbXx5QzbzE6C_VA'),
 Row(business_id='NZnhc2sEQy3RmzKTZnqtwQ', cool=0, date='2017-01-14 21:30:33', funny=0, review_id='GJXCdrto3ASJOqKeVWPi6Q', stars=5.0, text="I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much fun to be there! \n\nNext Travis started with the flat iron.  The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable.  It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists.  At the end of the blowout & style my hair was perfectly bouncey and looked terrific.  The only thing better?  That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas.  You make me feel beauuuutiful!", useful=0, user_id='yXQM5uF2jS6es16SJzNHfg'),
 Row(business_id='WTqjgwHlXbSFevF32_DJVw', cool=0, date='2016-11-09 20:09:03', funny=0, review_id='2TzJjDVDEuAW6MR5Vuc1ug', stars=5.0, text="I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for  something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!", useful=3, user_id='n6-Gk65cPZL6Uz8qRm3NYw')]

如果这是包含最多字数的评论,则预期输出:

I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for  something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!

然后是 Word count = xxxx

编辑:这里是使用此代码进行第一次审核的示例输出:

query = """
SELECT text, size(split(text, ' ')) AS word_count 
    FROM review 
    ORDER BY word_count DESC
"""
spark.sql(query).show(20, False)

评论 returned 的字数最多:

Got a date with de$tiny?
 
                          ** A ROMANTIC MOMENT WITH ** 
                            ** THE BEST VIEW IN TOWN**                                                 

                         ------------------------------------------------
                      /                   **CN TOWER'S**                  \ 
                     /         **REVOLVING RESTAURANT**     \         
                      \                                                                     /
                        \  ----------------------------------------------- /
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                               |                 |
                                             /                     \
                                            ===========

               o     o~
               /|~  ~|\
               /\     /  \        uhm, maybe not. the view may be great but a  to  
                                   bleh $teak ain't necessarily gonna get you some
                                  action later. Cheaper to get takeout from Harvey's and 
                                  eat and the beach!                                                                                                        |4329      |

通过将字符串拆分为单词数组并查找数组大小,将您拥有的 UDF 封装到本机 SQL 逻辑中。

spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)

示例

data = [("This is a sentence.",),  ("This sentence has 5 words.", )]

review = spark.createDataFrame(data, ("text", ))
review.registerTempTable("review")

spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)

输出

+--------------------------+----------+
|text                      |word_count|
+--------------------------+----------+
|This sentence has 5 words.|5         |
|This is a sentence.       |4         |
+--------------------------+----------+