Python:按关键字将文本拆分为 excel 行
Python: Split text by keyword into excel rows
编程新手,已经找到很多有用的线程,但不是我需要的。
我有一个如下所示的文本文件:
1 of 5000 DOCUMENTS
Copyright 2010 The Deal, L.L.C.
All Rights Reserved
Daily Deal/The Deal
January 12, 2010 Tuesday
HEADLINE: Cadbury slams Kraft bid
BODY:
On cue .....
......
body of article here
......
DEAL SIZE
$ 10-50 Billion
2 of 5000 DOCUMENTS
Copyright 2015 The Deal, L.L.C.
All Rights Reserved
The Deal Pipeline
September 17, 2015 Thursday
HEADLINE: Perrigo rejects formal offer from Mylan
BODY:
(and here again the body of this article)
DEAL SIZE
作为输出,我只希望在一个文件中的新行中的每篇文章正文(每个文章正文一个单元格)(我有大约 5000 篇文章需要这样处理)。输出将是 5000 行和 1 列。
据我所知,'re' 似乎是最好的解决方案。所以重复出现的关键字是 BODY:也许还有 DOCUMENTS。对于每篇文章,如何将这些关键字之间的文本提取到 excel 中的新行中?
import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)
或类似的东西?
section = []
for line in open_file_object:
if line.startswith('BODY:'):
# new section
if section:
process_section(section)
section = [line]
else:
section.append(line)
if section:
process_section(section)
我有点不知道去哪里找,先谢谢了!
编辑:感谢 ewwink 我现在在这里:
import re
articlesBody = None
with open('F:\CloudStation\Bocconi University\MSc. Thesis\test folder\majortest.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sDOCUMENTS', inputtext, re.S)
#print(articlesBody)
#print(type(articlesBody))
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', ' ')
csv.write('"%s",' % item)
处理文件使用 with open('F:\text.txt', mode)
其中 mode
是 'r'
用于读取和 'w'
用于写入,提取内容使用 re.findall
最后你需要转义新行 \n
、双引号 "
和其他字符。
import re
articlesBody = None
with open('text.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)
#print(articlesBody)
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', '\n').replace('"', '""')
csv.write('"%s",' % item)
另注:尝试小内容
编程新手,已经找到很多有用的线程,但不是我需要的。
我有一个如下所示的文本文件:
1 of 5000 DOCUMENTS
Copyright 2010 The Deal, L.L.C.
All Rights Reserved
Daily Deal/The Deal
January 12, 2010 Tuesday
HEADLINE: Cadbury slams Kraft bid
BODY:
On cue .....
......
body of article here
......
DEAL SIZE
$ 10-50 Billion
2 of 5000 DOCUMENTS
Copyright 2015 The Deal, L.L.C.
All Rights Reserved
The Deal Pipeline
September 17, 2015 Thursday
HEADLINE: Perrigo rejects formal offer from Mylan
BODY:
(and here again the body of this article)
DEAL SIZE
作为输出,我只希望在一个文件中的新行中的每篇文章正文(每个文章正文一个单元格)(我有大约 5000 篇文章需要这样处理)。输出将是 5000 行和 1 列。 据我所知,'re' 似乎是最好的解决方案。所以重复出现的关键字是 BODY:也许还有 DOCUMENTS。对于每篇文章,如何将这些关键字之间的文本提取到 excel 中的新行中?
import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)
或类似的东西?
section = []
for line in open_file_object:
if line.startswith('BODY:'):
# new section
if section:
process_section(section)
section = [line]
else:
section.append(line)
if section:
process_section(section)
我有点不知道去哪里找,先谢谢了!
编辑:感谢 ewwink 我现在在这里:
import re
articlesBody = None
with open('F:\CloudStation\Bocconi University\MSc. Thesis\test folder\majortest.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sDOCUMENTS', inputtext, re.S)
#print(articlesBody)
#print(type(articlesBody))
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', ' ')
csv.write('"%s",' % item)
处理文件使用 with open('F:\text.txt', mode)
其中 mode
是 'r'
用于读取和 'w'
用于写入,提取内容使用 re.findall
最后你需要转义新行 \n
、双引号 "
和其他字符。
import re
articlesBody = None
with open('text.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)
#print(articlesBody)
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', '\n').replace('"', '""')
csv.write('"%s",' % item)
另注:尝试小内容