Python: 如何使用关键字列表在文本中搜索字符串
Python: How to use list of keywords to search for a string in a text
所以我正在编写一个循环遍历多个 .txt 文件并搜索任意数量的预先指定的关键字的程序。我在寻找一种方法来传递要搜索的关键字列表时遇到了一些问题。
下面的代码目前returns出现以下错误:
TypeError: 'in <string>' requires string as left operand, not list
我知道错误是由关键字列表引起的,但我不知道如何在没有它的情况下输入大量关键字运行这个错误。
当前代码:
from os import listdir
keywords=['Example', 'Use', 'Of', 'Keywords']
with open("/home/user/folder/project/result.txt", "w") as f:
for filename in listdir("/home/user/folder/project/data"):
with open('/home/user/folder/project/data/' + filename) as currentFile:
text = currentFile.read()
#Error Below
if (keywords in text):
f.write('Keyword found in ' + filename[:-4] + '\n')
else:
f.write('No keyword in ' + filename[:-4] + '\n')
上述代码第 10 行的注释部分指出了错误。我不确定为什么我不能调用列表来搜索关键字。感谢任何帮助,谢谢!
尝试遍历列表以查看每个元素是否在文本中
for i in range(0, len(keywords)):
if keywords[i] in text:
f.write('Keyword found in ' + filename[:-4] + '\n')
break
else:
f.write('No keyword in ' + filename[:-4] + '\n')
break
你不能使用 in
来查看列表是否在字符串中
我会使用 regular expressions,因为它们专门用于搜索子字符串的文本。
您只需要 re.search
块。我添加了 findall
和 finditer
的示例来揭开它们的神秘面纱。
# lets pretend these 4 sentences in `text` are 4 different files
text = '''Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum'''.split(sep='. ')
# add more keywords
keywords=[r'publishing', r'industry']
regex = '|'.join(keywords)
import re
for t in text:
lst = re.findall(regex, t, re.I) # re.I make case-insensitive
for el in lst:
print(el)
iterator = re.finditer(regex, t, re.I)
for el in iterator:
print(el.span())
if re.search(regex, t, re.I):
print('Keyword found in `' + t + '`\n')
else:
print('No keyword in `' + t + '`\n')
输出:
industry
(65, 73)
Keyword found in `Lorem Ipsum is simply dummy text of the printing and typesetting industry`
industry
(25, 33)
Keyword found in `Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book`
No keyword in `It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged`
publishing
(132, 142)
Keyword found in `It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum`
你可以替换
if (keywords in text):
...
和
if any(keyword in text for keyword in keywords):
...
所以我正在编写一个循环遍历多个 .txt 文件并搜索任意数量的预先指定的关键字的程序。我在寻找一种方法来传递要搜索的关键字列表时遇到了一些问题。
下面的代码目前returns出现以下错误:
TypeError: 'in <string>' requires string as left operand, not list
我知道错误是由关键字列表引起的,但我不知道如何在没有它的情况下输入大量关键字运行这个错误。
当前代码:
from os import listdir
keywords=['Example', 'Use', 'Of', 'Keywords']
with open("/home/user/folder/project/result.txt", "w") as f:
for filename in listdir("/home/user/folder/project/data"):
with open('/home/user/folder/project/data/' + filename) as currentFile:
text = currentFile.read()
#Error Below
if (keywords in text):
f.write('Keyword found in ' + filename[:-4] + '\n')
else:
f.write('No keyword in ' + filename[:-4] + '\n')
上述代码第 10 行的注释部分指出了错误。我不确定为什么我不能调用列表来搜索关键字。感谢任何帮助,谢谢!
尝试遍历列表以查看每个元素是否在文本中
for i in range(0, len(keywords)):
if keywords[i] in text:
f.write('Keyword found in ' + filename[:-4] + '\n')
break
else:
f.write('No keyword in ' + filename[:-4] + '\n')
break
你不能使用 in
来查看列表是否在字符串中
我会使用 regular expressions,因为它们专门用于搜索子字符串的文本。
您只需要 re.search
块。我添加了 findall
和 finditer
的示例来揭开它们的神秘面纱。
# lets pretend these 4 sentences in `text` are 4 different files
text = '''Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum'''.split(sep='. ')
# add more keywords
keywords=[r'publishing', r'industry']
regex = '|'.join(keywords)
import re
for t in text:
lst = re.findall(regex, t, re.I) # re.I make case-insensitive
for el in lst:
print(el)
iterator = re.finditer(regex, t, re.I)
for el in iterator:
print(el.span())
if re.search(regex, t, re.I):
print('Keyword found in `' + t + '`\n')
else:
print('No keyword in `' + t + '`\n')
输出:
industry
(65, 73)
Keyword found in `Lorem Ipsum is simply dummy text of the printing and typesetting industry`
industry
(25, 33)
Keyword found in `Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book`
No keyword in `It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged`
publishing
(132, 142)
Keyword found in `It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum`
你可以替换
if (keywords in text):
...
和
if any(keyword in text for keyword in keywords):
...