For 循环将行写入变量

Question

我正在尝试在 python 中编写一个循环，以从每行的句子中提取信息。输入的句子如下所示：

[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
##repost from january 13 , 2004 with a better fit title .
i/p button[+2]##im a more happier person after discovering the i/p button ! 
dvd player[+1][p]##it practically plays almost everything you give it . 
player[+2],sound[-1]##i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .

我想只提取句子并使用 ## 之前的信息作为标签并将其全部写入一个变量，然后包含所有信息。预期输出：

Variable: title
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .

所以这个变量应该被维护直到一个新的[t]出现在行中。

Variable: sentence_only

repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button ! 
it practically plays almost everything you give it . 
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .

Variable: tag


i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]

当前输出只保留最后一行而不是变量中的完整列表。

这是我解决这个问题的尝试：

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Data/Customer_review_data"

filelists = PlaintextCorpusReader(corpus_root, '.*')

filelists.fileids()

rawlist = filelists.raw('Apex AD2600 Progressive-scan DVD player.txt')

sentence = rawlist.split("\n")[:]

a_line = ""
sentence_only = ""
content = ""
title = ""
tag = ""

for b_line in sentence:
    if title != '' or content != '' or sentence_only != '':
        content = title, tag, sentence_only
    if re.match(r"^\*", b_line):
        continue
    if re.match(r"^\[t\][ ]", b_line):
        title = b_line[4:]
        continue
    if re.match(r"^\[t\]", b_line):
        title = b_line[3:]
        continue
    if re.match(r"^##", b_line):
        sentence_only = b_line[2:]
        continue
    if re.match(r".*##", b_line):
        i = len(b_line.split('##')[0])+2
        sentence_only = b_line[i:]
        tag = b_line[:i-2]
        continue
    if re.match(r".*#", b_line):
        sentence_only = b_line[2:]
        continue
print(test)

Answer 1

实际上，我重新阅读了你的问题，似乎每个文件只包含一项。如果是这样的话，你就可以轻松多了。

with open("somefile.txt") as infile:
    data = infile.read().splitlines() # this seems to work OS agnostic

item = {
    "title": data[0][4:],
    "contents": [{"tag": line.split("##")[0], "sentence": line.split("##")[1]} for line in data[1:]]
}

这将导致一个与下面旧答案中的相同的字典项...

旧答案

我会使用字典项列表来包含数据，但您可以轻松调整将结果数据放入的变量。

from pprint import pprint


with open("somefile.txt") as infile:
    data = infile.read().splitlines() # this seems to work OS agnostic

result = []
current_item = None
for line in data:
    if line.startswith('[t]'):
        # add everything stored sofar to result
        # check is needed for the first loop
        if current_item:
            result.append(current_item)
        current_item = {
            "title": line[4:],    # strip the [t] part
            "contents": []        # reset the contents list
            } 
    else:
        current_item["contents"].append({
            "tag": line.split("##")[0],     # the first element of the split
            "sentence": line.split("##")[1] # the second element of the split
        })
# finally, add last item
result.append(current_item) 


# usage:
for item in result:
    print(f"\nTITLE: {item['title']}")
    print("Variable: sentence_only")
    for content in item["contents"]:
        print(content["sentence"])

for item in result:
    print(f"\nTITLE: {item['title']}")
    print("Variable: tag")
    for content in item["contents"]:
        print(content["tag"])

# pprint:
pprint(result)

输出如下。
请注意，我只是复制了示例输入并在行中添加了非常有想象力的 NR2 以区分源文件中的两个“项目”...

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
Variable: sentence_only
repost from january 13 , 2004 with a better fit title .
im a more happier person after discovering the i/p button !
it practically plays almost everything you give it .
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .     

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: sentence_only
repost from january 13 , 2004 with a better fit title . NR2
im a more happier person after discovering the i/p button !  NR2
it practically plays almost everything you give it .  NR2
i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .  NR2

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w .
Variable: tag

i/p button[+2]
dvd player[+1][p]
player[+2],sound[-1]

TITLE: troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . NR2
Variable: tag

i/p button[+2] NR2
dvd player[+1][p] NR2
player[+2],sound[-1] NR2
[{'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
                            'title .',
                'tag': ''},
               {'sentence': 'im a more happier person after discovering the '
                            'i/p button ! ',
                'tag': 'i/p button[+2]'},
               {'sentence': 'it practically plays almost everything you give '
                            'it . ',
                'tag': 'dvd player[+1][p]'},
               {'sentence': "i 've had the player for about 2 years now and it "
                            'still performs nicely with the exception of an '
                            'occasional wwhhhrrr sound from the motor . ',
                'tag': 'player[+2],sound[-1]'}],
  'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '},
 {'contents': [{'sentence': 'repost from january 13 , 2004 with a better fit '
                            'title . NR2',
                'tag': ''},
               {'sentence': 'im a more happier person after discovering the '
                            'i/p button !  NR2',
                'tag': 'i/p button[+2] NR2'},
               {'sentence': 'it practically plays almost everything you give '
                            'it .  NR2',
                'tag': 'dvd player[+1][p] NR2'},
               {'sentence': "i 've had the player for about 2 years now and it "
                            'still performs nicely with the exception of an '
                            'occasional wwhhhrrr sound from the motor .  NR2',
                'tag': 'player[+2],sound[-1] NR2'}],
  'title': 'troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . '
           'NR2'}]

For 循环将行写入变量

For loop writing rows into variables

python

string

nltk