如何从 python3 中的文本文件获取段落的开始和结束偏移量

Question

我正在尝试获取 python 中文本文件段落的开始和结束偏移量。我尝试了以下代码，它给出了开始和结束偏移量，但如果段落以白色 space 或制表符开头，则它不会将其视为段落。

  paraStartOffset = []
  paraEndOffset = []

  for match in re.finditer(r'(?s)((?:[^\n]?)+)', textFile):
      paraStartOffset.append(match.start())
      paraEndOffset.append(match.end())

  print "start Offset --> ",paraStartOffset
  print "end Offset --> ",paraEndOffset

谁能指导我我哪里遗漏了什么。谢谢

Answer 1

我认为这 question / answer 几乎讨论了您正在寻找的内容。如果我在段落开头也使用前导空格对其进行测试，则代码（取自答案）非常有效。

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
    print match.start(), match.end()

它returns下面当我运行它在我的测试文本上（摘自Bram Stoker's Dracula）第一段是标准上的。第二个以 SPACES 开头。第三个以TAB开头。

结果：（显示每个段落的开始、结束偏移量）

0 630
631 1029
1030 1125

测试文本：（我无法获得与原始格式完全相同的格式，但无论如何...）

_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish
rule.

  "My Friend.--Welcome to the Carpathians. I am anxiously expecting
you. Sleep well to-night. At three to-morrow the diligence will
start for Bukovina; a place on it is kept for you. At the Borgo
Pass my carriage will await you and will bring you to me. I trust
that your journey from London has been a happy one, and that you
will enjoy your stay in my beautiful land.

    Just before I was leaving, the old lady came up to my room and said in a
very hysterical way:

如何从 python3 中的文本文件获取段落的开始和结束偏移量

How to get start and end offset of Paragraph from text file in python3

python

text-processing

python-2.7

python-3.x