单独的文本块 python

Separate blocks of text python

我想知道如何在同一个文本文件中分隔文本块。示例如下。基本上我有 2 个项目,一个从 "Channel 9" 到 "Brief:.." 行,另一个从 "Southern ..." 开始到 "Brief" 行。如何使用 python 将它们分成 2 个文本文件?我认为常见的分隔线是“(16 岁以上的女性)”。非常感谢!


Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left 
,100 out
hosted by Peter Hitchener
A woman selling her caravan near Bendigo has been left ,100 out of 
pocket after an elderly couple made the purchase with counterfeit money. 
The wildlife worker tried to use the notes to pay for a house deposit, but an 
agent noticed the notes were missing the Coat of Arms on one side. 


Brief: Radio & TV
Demographics: 153,000 (male 16+) • 177,000 (female 
16+)

Southern Cross Victoria Bendigo (1 item)


Heathcote Police are warning the residents to be on the 
lookout a
hosted by Jo Hall
Heathcote Police are warning the residents to be on the lookout after a large 
dash of fake  note was discovered. Victim Marianne Thomas was given 
counterfeit notes from a caravan. The Heathcote resident tried to pay the 
house deposit and that's when the counterfeit notes were spotted. Thomas 
says the caravan is in town for the Spanish Festival.


Brief: Radio & TV
Demographics: 4,000 (male 16+) • 3,000 (female 16+)

实际上,我怀疑您实际上想在以 Demographics: 开头的 link 之后或以 (1 item)(2 items) 或类似行结尾的行之前中断。

但是无论你想破坏什么,都需要两个步骤:

  1. 想出一个规则,你可以把它变成一个你在每一行调用的函数。
  2. 编写一些代码,根据该函数的结果对事物进行分组。

让我们使用你的规则。一个函数可以是:

def is_last_line(line):
    return line.strip().endswith('(female 16+)')

现在,您可以使用该函数对事物进行分组:

i = 1
outfile = open(f'outfile{i}.txt', 'w')
for line in infile:
    outfile.write(line.strip())
    if is_last_line(line):
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')
outfile.close()

您可以通过使用 itertools.groupbyitertools.takewhileiter 或其他函数等方法使它更加简洁。或者您可以编写一个生成器函数,它仍然手动执行操作,但是 yields 行组,这将使创建新文件变得更加简单(让我们使用 with 块)。但像这样明确可能会让新手更容易理解(和调试,并在以后扩展),代价是有点冗长。


例如,从您表达问题的方式来看,您是否真的希望 Demographics: 行出现在输出文件中并不是很清楚。如果你不这样做,那么如何改变应该是显而易见的:

    if not is_last_line(line):
        outfile.write(line.strip())
    else:
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')

这里有硬编码的东西可以完成这个:

s = """Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left ,100 out hosted by Peter Hitchener A woman selling her caravan near Bendigo has been left ,100 out of pocket after an elderly couple made the purchase with counterfeit money. The wildlife worker tried to use the notes to pay for a house deposit, but an agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo Hall Heathcote Police are warning the residents to be on the lookout after a large dash of fake  note was discovered. Victim Marianne Thomas was given counterfeit notes from a caravan. The Heathcote resident tried to pay the house deposit and that's when the counterfeit notes were spotted. Thomas says the caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)"""

part_1 = s[s.index("Channel 9"):s.index("Southern Cross")]

part_2 = s[s.index("Southern Cross"):]

然后将它们保存到文件中。

看起来以“Demographics:”开头的行充当 real 分隔符。我会以两种方式使用正则表达式:首先,按这些行拆分文本;其次,自己提取这些行。然后可以组合结果来重构块:

import re
DIVIDER = 'Demographics: .+' # Make it tunable, in case you change your mind
blocks_1 = re.split(DIVIDER, text)
blocks_2 = re.findall(DIVIDER, text)
blocks = ['\n\n'.join(pair) for pair in zip(blocks_1, blocks_2)
blocks[0]
#Channel 9 (1 item)\n\nA woman selling her caravan near ... 
#... Demographics: 153,000 (male 16+) • 177,000 (female 16+)

这是我最近做的类似事情的修改示例,基本上是逐行逐行复制您的文本。核心逻辑基于附加到当前文件名,该文件名在找到新部分后重置。将使用下一节的第一行作为文件名。

#!/usr/bin/env python
import re

data = """
Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left ,100 out hosted by
Peter Hitchener A woman selling her caravan near Bendigo has been left ,100
out of pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo
Hall Heathcote Police are warning the residents to be on the lookout after a
large dash of fake  note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the house
deposit and that's when the counterfeit notes were spotted. Thomas says the
caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)
"""



current_file = None
for line in data.split('\n'):

    # Set initial filename
    if current_file == None and line != '':
        current_file = line + '.txt'

    # This is to handle the blank line after Brief
    if current_file == None:
        continue

    text_file = open(current_file, "a")
    text_file.write(line + "\n")
    text_file.close()

    # Reset filename if we have finished this section
    # which is idenfitied by:
    #    starts with Brief - ^Brief
    #    contains some random amount of text - .*
    #    ends with ) - )$
    if re.match(r'^Brief:.*\)$', line) is not None:
        current_file = None

这将输出以下文件

Channel 9 (1 item).txt
Southern Cross Victoria Bendigo (1 item).txt