如何使用 pyspark 打印文件中元素的总数？

Question

我有一个文本文件，其中包含用逗号 (',') 分隔的许多产品的名称。我想将此文件中存在的产品总数保存到一个文件中。如何使用 pyspark 执行此操作？

到目前为止我已经尝试了下面的代码，我得到的计数似乎不正确。

import pyspark
import random
if not 'sc' in globals():
    sc = pyspark.SparkContext()


text_file = sc.textFile("anytext.txt")
counts = text_file.flatMap(lambda line: line.split(",")) \
             .map(lambda word: (word)) 
       



counts.count()

谁能帮我解决这个问题？提前致谢。

Answer 1

尝试如下使用它 -

初始数据-

df = spark.read.text("/FileStore/word_count/citrus_fruit.txt")
df.show(5, False)

+-------------------------------------------------------------------+
|value                                                              |
+-------------------------------------------------------------------+
|citrus fruit,semi-finished bread,margarine,ready soups             |
|tropical fruit,yogurt,coffee                                       |
|whole milk                                                         |
|pip fruit,yogurt,cream cheese ,meat spreads                        |
|other vegetables,whole milk,condensed milk,long life bakery product|
+-------------------------------------------------------------------+

现在统计每一行的字数 -

from pyspark.sql import functions as F

df2 = df.withColumn('count', F.size(F.split('value', ',')))
df2.show(5, False)

+-------------------------------------------------------------------+-----+
|value                                                              |count|
+-------------------------------------------------------------------+-----+
|citrus fruit,semi-finished bread,margarine,ready soups             |4    |
|tropical fruit,yogurt,coffee                                       |3    |
|whole milk                                                         |1    |
|pip fruit,yogurt,cream cheese ,meat spreads                        |4    |
|other vegetables,whole milk,condensed milk,long life bakery product|4    |
+-------------------------------------------------------------------+-----+

一旦你有了计数，你只需要总结如下 -

df3 = df2.groupBy().agg(F.sum('count').alias('word_count_sum')).show(5,False)

df3.show()
+--------------+
|word_count_sum|
+--------------+
|43368         |
+--------------+

现在您可以将此数据帧写入如下文件（我在这里使用的是 csv）-

df3.write.format('csv').mode('overwrite').save('/FileStore/tmp/')

如何使用 pyspark 打印文件中元素的总数？

How to print the total count of elements in a file using pyspark?

python

count

apache-spark

pyspark