将文本文件拆分成多个部分,然后在这些部分中搜索关键短语

Splitting up text file into pieces, then searching key phrases in those sections

我是 Python 的新手,我已经是这门语言的粉丝了。我有一个程序执行以下操作:

  1. 打开一个文本文件,其中的文本部分由星号分隔 (***)

  2. 使用 split() 函数将此文本文件拆分为由这些星号分隔的部分。星号线在整个文本文件中是统一的。

  3. 我希望我的代码遍历每个部分并执行以下操作:

    • 我有一本字典,其中 "key phrases" 分配给了值。字典中每个key的值为0.

    • 代码需要遍历从拆分创建的每个部分,并检查是否在每个部分中找到字典中的键。如果找到关键字,则该关键字的值增加 1。

    • 一旦代码遍历一个部分并计算出该部分中有多少个键并相应地添加值,它应该打印出字典键和该设置的计数(值),将值设置为 0,然后继续从 #3 开始的下一段文本。

我的代码是:

    from bs4 import BeautifulSoup
   import re
   import time
   import random
   import glob, os
   import string


termz = {'does not exceed' : 0, 'shall not exceed' : 0, 'not exceeding' : 0,
  'do not exceed' : 0, 'not to exceed' : 0, 'shall at no time exceed' : 0,
  'shall not be less than' : 0, 'not less than' : 0}
with open('Q:/hello/place/textfile.txt', 'r') as f:
  sections = f.read().split('**************************************************')
  for p in sections[1:]:
      for eachKey in termz.keys():
        if eachKey in p:
          termz[eachKey] = termz.get(eachKey) + 1
          print(termz)  


#print(len(sections))  #there are thirty sections      

        #should be if code encounters ***** then it resets the counters and just moves on....
        #so far only can count the phrases over the entire text file....

#GO BACK TO .SPLIT()
# termz = dict.fromkeys(termz,0) #resets the counter

它会吐出它计算的内容,但它不是第一个、最后一个,甚至不是它正在跟踪的整个文件 - 我不知道它在做什么。

末尾的打印语句不合适。 termz = dict.fromkeys(termz,0) 行是我发现的一种将字典的值重置为 0 的方法,但被注释掉了,因为我不确定如何处理它。本质上,与 Python 控制结构作斗争。如果有人能给我指出正确的方向,那就太好了。

if eachKey in p:
          termz[eachKey] += 1  # might do it
          print(termz)

您的代码非常接近。请参阅以下评论:

termz = {
    'does not exceed': 0,
    'shall not exceed': 0,
    'not exceeding': 0,
    'do not exceed': 0,
    'not to exceed': 0,
    'shall at no time exceed': 0,
    'shall not be less than': 0,
    'not less than': 0
}

with open('Q:/hello/place/textfile.txt', 'r') as f:
    sections = f.read().split('**************************************************')

    # Skip the first section. (I assume this is on purpose?)
    for p in sections[1:]:
        for eachKey in termz:
            if eachKey in p:
                # This is simpler than termz[eachKey] = termz.get(eachKey) + 1
                termz[eachKey] += 1

        # Move this outside of the inner loop
        print(termz)

        # After printing the results for that section, reset the counts
        termz = dict.fromkeys(termz, 0)

编辑

示例输入和输出:

input = '''
Section 1:

This section is ignored.
does not exceed
**************************************************
Section 2:

shall not exceed
not to exceed
**************************************************
Section 3:

not less than'''

termz = {
    'does not exceed': 0,
    'shall not exceed': 0,
    'not exceeding': 0,
    'do not exceed': 0,
    'not to exceed': 0,
    'shall at no time exceed': 0,
    'shall not be less than': 0,
    'not less than': 0
}

sections = input.split('**************************************************')

# Skip the first section. (I assume this is on purpose?)
for p in sections[1:]:
    for eachKey in termz:
        if eachKey in p:
            # This is simpler than termz[eachKey] = termz.get(eachKey) + 1
            termz[eachKey] += 1

    # Move this outside of the inner loop
    print(termz)

    # After printing the results for that section, reset the counts
    termz = dict.fromkeys(termz, 0)

# OUTPUT:
# {'not exceeding': 0, 'shall not exceed': 1, 'not less than': 0, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 1, 'do not exceed': 0, 'does not exceed': 0}
# {'not exceeding': 0, 'shall not exceed': 0, 'not less than': 1, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 0, 'do not exceed': 0, 'does not exceed': 0}