如何根据文本中单词的某些结尾来计算文本的行数

Question

我正在尝试计算我的数据框中包含 words/letters 'red' 的行数，以某种形式作为单独的单词或单词的一部分。

df = pd.DataFrame({'id': [10, 46, 75, 12, 99],
                   'text': [['The blurred vision is no good'],         
                            ['start', '15', 'tag', '#redding'],
                            [],
                            ['The books were blue instead'],
                            ['Red is the new Black ']
                            ]
                    })

输出应计算第 0、1 和 4 行，即 count=3。

我尝试了以下代码：

df['text'].str.contains(r'[a-zA-Z]red+', na=False).sum()

但是没用。如果有人能帮我修复它，我将不胜感激。

Answer 1

由于“文本”列中有字符串列表，我将首先使用 space 连接这些单独的字符串。

然后，我将字符串小写以进行 case-insensitive 匹配，最后在整个连接字符串上使用 contains。以一种形式出现的任何“红色”都可以通过这种方式轻松过滤：

>>> df['text'].str.join(" ").str.lower().str.contains('red')
0     True
1     True
2    False
3    False
4     True
Name: text, dtype: bool

以及行数：

>>> df['text'].str.join(" ").str.lower().str.contains('red').sum()
3

Answer 2

一个选项是在列表理解中使用 any 来检查字符串 "red" 是否出现在子列表的任何字符串中：

out = sum(any(True for x in lst if 'red' in x.lower()) for lst in df['text'])

输出：

如何根据文本中单词的某些结尾来计算文本的行数

How to count rows of text based on certain endings of words in the text

python

text

pandas