如何从带有逗号的单个字符串中删除字符串 None 并计算一行中最常见的单词？

Question

我在 df 中有行由一个字符串组成，该字符串包含多个用逗号分隔的元素。

在行中，有感兴趣的词（例如 Car、Bus）和词 None。此外，有些行只有单词 None.

这里有一个 df 的例子：

Col
Car, None, None, Car, Bus, None
None
Bus, Bus, None, Car, Car, None
None, None, None

这是预期结果的示例：

Col	Most common words
Car, Car, Bus	Car (2)
Bus, Bus, Car, Car	Bus (2), Car (2)

简而言之：我需要从包含感兴趣单词的行中删除 None；删除仅包含 None 的行，最后我需要计算行中剩余的感兴趣的单词？

在 Python 中有什么方法可以做到这一点吗？

Answer 1

您可以使用 .str.split(", ") 拆分“Col”列，过滤掉 None 值，空列表并使用 .value_counts():

计算唯一项目

df.Col = df.Col.str.split(", ").apply(lambda x: [v for v in x if v != "None"])
df = df[df.Col.str.len() > 0]

df["Most common words"] = df.Col.apply(
    lambda x: ", ".join(
        f"{a} ({b})" for a, b in pd.Series(x).value_counts().to_dict().items()
    )
)
df.Col = df.Col.apply(", ".join)
print(df)

打印：

                  Col Most common words
0       Car, Car, Bus  Car (2), Bus (1)
2  Bus, Bus, Car, Car  Bus (2), Car (2)

df 使用：

                               Col
0  Car, None, None, Car, Bus, None
1                             None
2   Bus, Bus, None, Car, Car, None
3                 None, None, None

如何从带有逗号的单个字符串中删除字符串 None 并计算一行中最常见的单词？

How to remove string None from a single string with commas and count the most common words in a row?

python

dataframe

pandas

python-re