根据列表拆解和重组字符串
disassemble and reassemble strings based on list
我有四个这样的扬声器:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
他们正在进行对话,所有内容都由一个字符串表示,即。 convo
=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
如何拆解和重新组装字符串,以便将其拆分为 2 个字符串,1 个 Team_A
所说的字符串,1 个 Team_A
所说的字符串?
输出:team_A_said="hello how is it going?"
、team_B_said="hi we are doing fine"
台词无关紧要。
我有这个糟糕的 find
... 然后 slice
不可扩展的代码。有人可以提出其他建议吗?有任何图书馆可以帮助解决这个问题吗?
我在 nltk
图书馆中没有找到任何东西
是语言解析的问题。
答案正在进行中
有限状态机
可以通过将对话想象成由具有以下状态的自动机解析来理解对话记录:
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
您可以通过使其遵循该状态机来解析您的对话。如果您的解析成功(即遵循状态到文本结尾),您可以浏览 "conversation tree" 以获取含义。
标记你的对话(词法分析器)
您需要函数来识别 name
状态。这很简单
name = (Team_A | Team_B) + '\n'
对话交替
在这个回答中,我没有假设对话涉及说话的人之间的交替,就像这个对话一样:
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
如果您的成绩单连接了同一作者的答案,这可能会出现问题
您可以使用正则表达式拆分每个条目。 itertools.ifilter
然后可用于提取每个对话所需的条目。
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
给出以下输出:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
此代码假定 convo
的内容严格 符合
name\nstuff they said\n\n
图案。它使用的唯一棘手的代码是 zip(*[iter(lines)]*3)
,它从 lines
列表中创建一个三元组字符串列表。有关此技术和替代技术的讨论,请参阅 How do you split a list into evenly sized chunks in Python?。
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
输出
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'
我有四个这样的扬声器:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
他们正在进行对话,所有内容都由一个字符串表示,即。 convo
=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
如何拆解和重新组装字符串,以便将其拆分为 2 个字符串,1 个 Team_A
所说的字符串,1 个 Team_A
所说的字符串?
输出:team_A_said="hello how is it going?"
、team_B_said="hi we are doing fine"
台词无关紧要。
我有这个糟糕的 find
... 然后 slice
不可扩展的代码。有人可以提出其他建议吗?有任何图书馆可以帮助解决这个问题吗?
我在 nltk
图书馆中没有找到任何东西
是语言解析的问题。
答案正在进行中
有限状态机
可以通过将对话想象成由具有以下状态的自动机解析来理解对话记录:
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
您可以通过使其遵循该状态机来解析您的对话。如果您的解析成功(即遵循状态到文本结尾),您可以浏览 "conversation tree" 以获取含义。
标记你的对话(词法分析器)
您需要函数来识别 name
状态。这很简单
name = (Team_A | Team_B) + '\n'
对话交替
在这个回答中,我没有假设对话涉及说话的人之间的交替,就像这个对话一样:
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
如果您的成绩单连接了同一作者的答案,这可能会出现问题
您可以使用正则表达式拆分每个条目。 itertools.ifilter
然后可用于提取每个对话所需的条目。
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
给出以下输出:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
此代码假定 convo
的内容严格 符合
name\nstuff they said\n\n
图案。它使用的唯一棘手的代码是 zip(*[iter(lines)]*3)
,它从 lines
列表中创建一个三元组字符串列表。有关此技术和替代技术的讨论,请参阅 How do you split a list into evenly sized chunks in Python?。
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
输出
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'