Python - Polars - 值在字符串列上计数
Python - Polars - value counts on string column
如何在 Polars DataFrame 上应用字数统计
我有一个字符串列,我想对所有文本进行字数统计。
谢谢
DataFrame 示例:
0 Would never order again.
1 I'm not sure it gives me any type of glow and ...
2 Goes on smoothly a bit sticky and color is clo...
3 Preferisco altri prodotti della stessa marca.
4 The moisturizing advertised is non-existent.
如果我使用 pandas,它是这样的
df.Description.str.split(expand=True).stack().value_counts().reset_index()
结果:
index 0
0 the 2
1 and 2
2 brown 2
3 is 2
4 any 1
5 The 1
6 moisturizing 1
7 like 1
8 I'm 1
9 not 1
10 closer 1
11 stessa 1
12 prodotti 1
13 non-existent. 1
14 advertised 1
15 I 1
16 of 1
17 order 1
...
你可以这样做:
csv = """
0, Would never order again.
1, I'm not sure it gives me any type of glow and ...
2, Goes on smoothly a bit sticky and color is clo...
3, Preferisco altri prodotti della stessa marca.
4, The moisturizing advertised is non-existent.
""".encode()
(pl.read_csv(csv, has_header=False, new_columns=["idx", "lines"])
.select(pl.col("lines").str.split(" ").flatten().alias("words"))
.groupby("words").agg(pl.count())
.sort("count", reverse=True)
.filter(pl.col("words").str.lengths() > 0)
)
或者像这样:
(pl.read_csv(csv, has_header=False, new_columns=["idx", "lines"])
.select(pl.col("lines").str.split(" ").flatten().alias("words"))
.to_series()
.value_counts()
.filter(pl.col("words").str.lengths() > 0)
)
两者输出:
shape: (35, 2)
┌────────┬───────┐
│ words ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═══════╡
│ is ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ and ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ order ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ it ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Goes ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ The ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ stessa ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ altri ┆ 1 │
└────────┴───────┘
如何在 Polars DataFrame 上应用字数统计 我有一个字符串列,我想对所有文本进行字数统计。 谢谢
DataFrame 示例:
0 Would never order again.
1 I'm not sure it gives me any type of glow and ...
2 Goes on smoothly a bit sticky and color is clo...
3 Preferisco altri prodotti della stessa marca.
4 The moisturizing advertised is non-existent.
如果我使用 pandas,它是这样的
df.Description.str.split(expand=True).stack().value_counts().reset_index()
结果:
index 0
0 the 2
1 and 2
2 brown 2
3 is 2
4 any 1
5 The 1
6 moisturizing 1
7 like 1
8 I'm 1
9 not 1
10 closer 1
11 stessa 1
12 prodotti 1
13 non-existent. 1
14 advertised 1
15 I 1
16 of 1
17 order 1
...
你可以这样做:
csv = """
0, Would never order again.
1, I'm not sure it gives me any type of glow and ...
2, Goes on smoothly a bit sticky and color is clo...
3, Preferisco altri prodotti della stessa marca.
4, The moisturizing advertised is non-existent.
""".encode()
(pl.read_csv(csv, has_header=False, new_columns=["idx", "lines"])
.select(pl.col("lines").str.split(" ").flatten().alias("words"))
.groupby("words").agg(pl.count())
.sort("count", reverse=True)
.filter(pl.col("words").str.lengths() > 0)
)
或者像这样:
(pl.read_csv(csv, has_header=False, new_columns=["idx", "lines"])
.select(pl.col("lines").str.split(" ").flatten().alias("words"))
.to_series()
.value_counts()
.filter(pl.col("words").str.lengths() > 0)
)
两者输出:
shape: (35, 2)
┌────────┬───────┐
│ words ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═══════╡
│ is ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ and ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ order ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ it ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Goes ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ The ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ stessa ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ altri ┆ 1 │
└────────┴───────┘