通过循环内的关键字过滤网络文章

Question

我编写了一个用于抓取网络文章的函数，但我想对其进行调整，使其检查文章是否与我相关（基于关键字列表），如果不相关则忽略它。我找到了几种方法来检查一个字符串是否在另一个字符串中，但不知何故我无法让它们在 for 循环中工作。这是函数的简单示例：

combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []

for i in combos:
    
    for j in my_favorites:
        if j not in i:
            continue
    
    caps.append(i.upper())
    
print(caps)

如果至少有一个我最喜欢的水果不包括在内，我想跳到循环的下一次迭代。但是列表中的所有字符串都通过过滤器：

['APPLE AND PEAR', 'PEAR AND BANANA', 'APPLE AND PEACH', 'BANANA AND KIWI', 'PEACH AND ORANGE']

有人可以解释一下我在这里理解的失败吗？

Answer 1

您需要将 caps.append(i.upper()) 添加到 else 条件。

combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []

for i in combos:

    for j in my_favorites:
        if j not in i:
            continue
        else:
            caps.append(i.upper())

print(caps)

Answer 2

无论关键字是否存在，您都附加 combos 项目的大写字母。

使用continue影响内循环。因此，您遍历整个 my_favorites 列表，完成后，将 i 的大写附加到 caps.

下面的代码实现了你想要的：

combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []

for i in combos:
    if any([fav in i for fav in my_favorites]):
        caps.append(i.upper())

print(caps)

Answer 3

我发现正则表达式是过滤文本的最佳方式，尤其是当输入是一个庞大的数据集时。下面，我使用 python 内置的 re 模块来编译所需的模式，并使用正则表达式匹配功能来搜索列表并与模式匹配。

import re

combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']

my_favorites = ['apple', 'peach']

regex_pattern = "|".join(my_favorites)

r = re.compile(regex_pattern)

filtered_list = filter(r.match, combos)

caps = [item.upper() for item in filtered_list]

通过循环内的关键字过滤网络文章

Filtering web articles by keywords inside of a loop

python

filter

statements