如何使用 pyspark 打印文件中元素的总数?
How to print the total count of elements in a file using pyspark?
我有一个文本文件,其中包含用逗号 (',') 分隔的许多产品的名称。我想将此文件中存在的产品总数保存到一个文件中。如何使用 pyspark 执行此操作?
到目前为止我已经尝试了下面的代码,我得到的计数似乎不正确。
import pyspark
import random
if not 'sc' in globals():
sc = pyspark.SparkContext()
text_file = sc.textFile("anytext.txt")
counts = text_file.flatMap(lambda line: line.split(",")) \
.map(lambda word: (word))
counts.count()
谁能帮我解决这个问题?提前致谢。
尝试如下使用它 -
初始数据-
df = spark.read.text("/FileStore/word_count/citrus_fruit.txt")
df.show(5, False)
+-------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------+
|citrus fruit,semi-finished bread,margarine,ready soups |
|tropical fruit,yogurt,coffee |
|whole milk |
|pip fruit,yogurt,cream cheese ,meat spreads |
|other vegetables,whole milk,condensed milk,long life bakery product|
+-------------------------------------------------------------------+
现在统计每一行的字数 -
from pyspark.sql import functions as F
df2 = df.withColumn('count', F.size(F.split('value', ',')))
df2.show(5, False)
+-------------------------------------------------------------------+-----+
|value |count|
+-------------------------------------------------------------------+-----+
|citrus fruit,semi-finished bread,margarine,ready soups |4 |
|tropical fruit,yogurt,coffee |3 |
|whole milk |1 |
|pip fruit,yogurt,cream cheese ,meat spreads |4 |
|other vegetables,whole milk,condensed milk,long life bakery product|4 |
+-------------------------------------------------------------------+-----+
一旦你有了计数,你只需要总结如下 -
df3 = df2.groupBy().agg(F.sum('count').alias('word_count_sum')).show(5,False)
df3.show()
+--------------+
|word_count_sum|
+--------------+
|43368 |
+--------------+
现在您可以将此数据帧写入如下文件(我在这里使用的是 csv)-
df3.write.format('csv').mode('overwrite').save('/FileStore/tmp/')
我有一个文本文件,其中包含用逗号 (',') 分隔的许多产品的名称。我想将此文件中存在的产品总数保存到一个文件中。如何使用 pyspark 执行此操作?
到目前为止我已经尝试了下面的代码,我得到的计数似乎不正确。
import pyspark
import random
if not 'sc' in globals():
sc = pyspark.SparkContext()
text_file = sc.textFile("anytext.txt")
counts = text_file.flatMap(lambda line: line.split(",")) \
.map(lambda word: (word))
counts.count()
谁能帮我解决这个问题?提前致谢。
尝试如下使用它 -
初始数据-
df = spark.read.text("/FileStore/word_count/citrus_fruit.txt")
df.show(5, False)
+-------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------+
|citrus fruit,semi-finished bread,margarine,ready soups |
|tropical fruit,yogurt,coffee |
|whole milk |
|pip fruit,yogurt,cream cheese ,meat spreads |
|other vegetables,whole milk,condensed milk,long life bakery product|
+-------------------------------------------------------------------+
现在统计每一行的字数 -
from pyspark.sql import functions as F
df2 = df.withColumn('count', F.size(F.split('value', ',')))
df2.show(5, False)
+-------------------------------------------------------------------+-----+
|value |count|
+-------------------------------------------------------------------+-----+
|citrus fruit,semi-finished bread,margarine,ready soups |4 |
|tropical fruit,yogurt,coffee |3 |
|whole milk |1 |
|pip fruit,yogurt,cream cheese ,meat spreads |4 |
|other vegetables,whole milk,condensed milk,long life bakery product|4 |
+-------------------------------------------------------------------+-----+
一旦你有了计数,你只需要总结如下 -
df3 = df2.groupBy().agg(F.sum('count').alias('word_count_sum')).show(5,False)
df3.show()
+--------------+
|word_count_sum|
+--------------+
|43368 |
+--------------+
现在您可以将此数据帧写入如下文件(我在这里使用的是 csv)-
df3.write.format('csv').mode('overwrite').save('/FileStore/tmp/')