Python Readline 循环和子循环
Python Readline Loop and Subloop
我正在尝试遍历 python 中的一些非结构化文本数据。最终目标是在数据框中构建它。现在我只是想在数组中获取相关数据并理解行,python.
中的 readline() 功能
文字是这样的:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
同一文件中的大量文本文章重复使用相同的格式。到目前为止,我已经弄清楚如何提取包含特定文本的行。例如,我可以遍历它并将所有文章标题放在这样的列表中:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
现在我想做以下事情:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
作为一个 Python 初学者,我正忙着搜索 google 这个主题。任何指针将不胜感激。
如果您想坚持使用 for 循环,您可能需要这样的东西:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(有几点:Python 的名字很奇怪,当你写 list = []
时,你实际上是在覆盖 list
class 的标签, 这可能会在以后给你带来问题。你真的应该像对待关键字一样对待 list
、set
等等——甚至认为 Python 在技术上不是——只是为了避免你自己头疼。还有,考虑到您对数据的描述,startswith
方法在这里更精确一些。)
或者,您可以将文件对象包装在一个迭代器中(i = iter(f)
,然后是 next(i)
),但这会导致一些捕获 StopIteration
异常的麻烦 - 但它会让你对整个事情使用更多 classic while 循环。对于我自己,我会坚持使用上面的状态机方法,并使其足够强大以处理所有您合理预期的边缘情况。
由于您的目标是构建一个DataFrame,这里有一个re
+numpy
+pandas
解决方案:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
输出:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
我正在尝试遍历 python 中的一些非结构化文本数据。最终目标是在数据框中构建它。现在我只是想在数组中获取相关数据并理解行,python.
中的 readline() 功能文字是这样的:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
同一文件中的大量文本文章重复使用相同的格式。到目前为止,我已经弄清楚如何提取包含特定文本的行。例如,我可以遍历它并将所有文章标题放在这样的列表中:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
现在我想做以下事情:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
作为一个 Python 初学者,我正忙着搜索 google 这个主题。任何指针将不胜感激。
如果您想坚持使用 for 循环,您可能需要这样的东西:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(有几点:Python 的名字很奇怪,当你写 list = []
时,你实际上是在覆盖 list
class 的标签, 这可能会在以后给你带来问题。你真的应该像对待关键字一样对待 list
、set
等等——甚至认为 Python 在技术上不是——只是为了避免你自己头疼。还有,考虑到您对数据的描述,startswith
方法在这里更精确一些。)
或者,您可以将文件对象包装在一个迭代器中(i = iter(f)
,然后是 next(i)
),但这会导致一些捕获 StopIteration
异常的麻烦 - 但它会让你对整个事情使用更多 classic while 循环。对于我自己,我会坚持使用上面的状态机方法,并使其足够强大以处理所有您合理预期的边缘情况。
由于您的目标是构建一个DataFrame,这里有一个re
+numpy
+pandas
解决方案:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
输出:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python