For 循环将行写入变量
For loop writing rows into variables
我正在尝试在 python 中编写一个循环,以从每行的句子中提取信息。输入的句子如下所示:
[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
##repost from january 13 , 2004 with a better fit title .
i/p button[+2]##im a more happier person after discovering the i/p button !
dvd player[+1][p]##it practically plays almost everything you give it .
player[+2],sound[-1]##i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
我想只提取句子并使用 ##
之前的信息作为标签并将其全部写入一个变量,然后包含所有信息。预期输出:
Variable: title
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
所以这个变量应该被维护直到一个新的[t]
出现在行中。
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
Variable: tag
i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]
当前输出只保留最后一行而不是变量中的完整列表。
这是我解决这个问题的尝试:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Data/Customer_review_data"
filelists = PlaintextCorpusReader(corpus_root, '.*')
filelists.fileids()
rawlist = filelists.raw('Apex AD2600 Progressive-scan DVD player.txt')
sentence = rawlist.split("\n")[:]
a_line = ""
sentence_only = ""
content = ""
title = ""
tag = ""
for b_line in sentence:
if title != '' or content != '' or sentence_only != '':
content = title, tag, sentence_only
if re.match(r"^\*", b_line):
continue
if re.match(r"^\[t\][ ]", b_line):
title = b_line[4:]
continue
if re.match(r"^\[t\]", b_line):
title = b_line[3:]
continue
if re.match(r"^##", b_line):
sentence_only = b_line[2:]
continue
if re.match(r".*##", b_line):
i = len(b_line.split('##')[0])+2
sentence_only = b_line[i:]
tag = b_line[:i-2]
continue
if re.match(r".*#", b_line):
sentence_only = b_line[2:]
continue
print(test)
实际上,我重新阅读了你的问题,似乎每个文件只包含一项。如果是这样的话,你就可以轻松多了。
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
item = {
"title": data[0][4:],
"contents": [{"tag": line.split("##")[0], "sentence": line.split("##")[1]} for line in data[1:]]
}
这将导致一个与下面旧答案中的相同的字典项...
旧答案
我会使用字典项列表来包含数据,但您可以轻松调整将结果数据放入的变量。
from pprint import pprint
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
result = []
current_item = None
for line in data:
if line.startswith('[t]'):
# add everything stored sofar to result
# check is needed for the first loop
if current_item:
result.append(current_item)
current_item = {
"title": line[4:], # strip the [t] part
"contents": [] # reset the contents list
}
else:
current_item["contents"].append({
"tag": line.split("##")[0], # the first element of the split
"sentence": line.split("##")[1] # the second element of the split
})
# finally, add last item
result.append(current_item)
# usage:
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: sentence_only")
for content in item["contents"]:
print(content["sentence"])
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: tag")
for content in item["contents"]:
print(content["tag"])
# pprint:
pprint(result)
输出如下。
请注意,我只是复制了示例输入并在行中添加了非常有想象力的 NR2
以区分源文件中的两个“项目”...
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: sentence_only
repost from january 13 , 2004 with a better fit title . NR2
im a more happier person after discovering the i/p button ! NR2
it practically plays almost everything you give it . NR2
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor . NR2
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: tag
i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: tag
i/p button[+2] NR2
dvd player[+1][p] NR2
player[+2],sound[-1] NR2
[{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title .',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! ',
'tag': 'i/p button[+2]'},
{'sentence': 'it practically plays almost everything you give '
'it . ',
'tag': 'dvd player[+1][p]'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . ',
'tag': 'player[+2],sound[-1]'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '},
{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title . NR2',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! NR2',
'tag': 'i/p button[+2] NR2'},
{'sentence': 'it practically plays almost everything you give '
'it . NR2',
'tag': 'dvd player[+1][p] NR2'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . NR2',
'tag': 'player[+2],sound[-1] NR2'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '
'NR2'}]
我正在尝试在 python 中编写一个循环,以从每行的句子中提取信息。输入的句子如下所示:
[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
##repost from january 13 , 2004 with a better fit title .
i/p button[+2]##im a more happier person after discovering the i/p button !
dvd player[+1][p]##it practically plays almost everything you give it .
player[+2],sound[-1]##i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
我想只提取句子并使用 ##
之前的信息作为标签并将其全部写入一个变量,然后包含所有信息。预期输出:
Variable: title
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
所以这个变量应该被维护直到一个新的[t]
出现在行中。
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
Variable: tag
i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]
当前输出只保留最后一行而不是变量中的完整列表。
这是我解决这个问题的尝试:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Data/Customer_review_data"
filelists = PlaintextCorpusReader(corpus_root, '.*')
filelists.fileids()
rawlist = filelists.raw('Apex AD2600 Progressive-scan DVD player.txt')
sentence = rawlist.split("\n")[:]
a_line = ""
sentence_only = ""
content = ""
title = ""
tag = ""
for b_line in sentence:
if title != '' or content != '' or sentence_only != '':
content = title, tag, sentence_only
if re.match(r"^\*", b_line):
continue
if re.match(r"^\[t\][ ]", b_line):
title = b_line[4:]
continue
if re.match(r"^\[t\]", b_line):
title = b_line[3:]
continue
if re.match(r"^##", b_line):
sentence_only = b_line[2:]
continue
if re.match(r".*##", b_line):
i = len(b_line.split('##')[0])+2
sentence_only = b_line[i:]
tag = b_line[:i-2]
continue
if re.match(r".*#", b_line):
sentence_only = b_line[2:]
continue
print(test)
实际上,我重新阅读了你的问题,似乎每个文件只包含一项。如果是这样的话,你就可以轻松多了。
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
item = {
"title": data[0][4:],
"contents": [{"tag": line.split("##")[0], "sentence": line.split("##")[1]} for line in data[1:]]
}
这将导致一个与下面旧答案中的相同的字典项...
旧答案
我会使用字典项列表来包含数据,但您可以轻松调整将结果数据放入的变量。
from pprint import pprint
with open("somefile.txt") as infile:
data = infile.read().splitlines() # this seems to work OS agnostic
result = []
current_item = None
for line in data:
if line.startswith('[t]'):
# add everything stored sofar to result
# check is needed for the first loop
if current_item:
result.append(current_item)
current_item = {
"title": line[4:], # strip the [t] part
"contents": [] # reset the contents list
}
else:
current_item["contents"].append({
"tag": line.split("##")[0], # the first element of the split
"sentence": line.split("##")[1] # the second element of the split
})
# finally, add last item
result.append(current_item)
# usage:
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: sentence_only")
for content in item["contents"]:
print(content["sentence"])
for item in result:
print(f"\nTITLE: {item['title']}")
print("Variable: tag")
for content in item["contents"]:
print(content["tag"])
# pprint:
pprint(result)
输出如下。
请注意,我只是复制了示例输入并在行中添加了非常有想象力的 NR2
以区分源文件中的两个“项目”...
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: sentence_only
repost from january 13 , 2004 with a better fit title . NR2
im a more happier person after discovering the i/p button ! NR2
it practically plays almost everything you give it . NR2
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor . NR2
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: tag
i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]
TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: tag
i/p button[+2] NR2
dvd player[+1][p] NR2
player[+2],sound[-1] NR2
[{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title .',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! ',
'tag': 'i/p button[+2]'},
{'sentence': 'it practically plays almost everything you give '
'it . ',
'tag': 'dvd player[+1][p]'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . ',
'tag': 'player[+2],sound[-1]'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '},
{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
'title . NR2',
'tag': ''},
{'sentence': 'im a more happier person after discovering the '
'i/p button ! NR2',
'tag': 'i/p button[+2] NR2'},
{'sentence': 'it practically plays almost everything you give '
'it . NR2',
'tag': 'dvd player[+1][p] NR2'},
{'sentence': "i 've had the player for about 2 years now and it "
'still performs nicely with the exception of an '
'occasional wwhhhrrr sound from the motor . NR2',
'tag': 'player[+2],sound[-1] NR2'}],
'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '
'NR2'}]