从文本 Python 中删除括号中的时间戳
Remove timestamp in the bracket from text Python
我想删除下面示例文本数据中括号中的所有时间戳。
输入:
Agent: Can I help you? ( 3s ) Customer: Thank you( 40s ) Customer: I
have a question about X. ( 8m 1s ) Agent: I can help here. Log in this
website (remember to use your new password) ( 11m 31s )
预期输出:
Agent: Can I help you? Customer: Thank you Customer: I have a question
about X. Agent: I can help here. Log in this website (remember to use
your new password)
我试过 re.sub(r'\(.*?\)', '', data)
但它没有用,因为它删除了括号中的所有内容。如果不是时间戳,我想保留括号中的内容,例如,我想在输出中保留“(记得使用你的新密码)”。
对正则表达式还是陌生的,所以希望我能在这里得到一些指导。谢谢!
\(\s(\d{1,2}[smh]\s)+\)
仅供参考:.*
匹配除行终止符之外的所有内容。
不是正则表达式,可能效率不高,但字符串方法可以:
spam = "Agent: Can I help you? ( 3s ) Customer: Thank you( 40s ) Customer: I have a question about X. ( 8m 1s ) Agent: I can help here. Log in this website (remember to use your new password) ( 11m 31s )"
def cleanup(text):
for word in ('Agent', 'Customer'):
text = text.replace(word, f'\n{word}').strip()
clean_text = [line[:line.rindex('(')] for line in text.splitlines()]
# or in slow-motion
# clean_text = []
# for line in text.splitlines():
# idx = line.rindex('(')
# line = line[:idx]
# clean_text.append(line)
return ' '.join(clean_text)
print(cleanup(spam))
输出
Agent: Can I help you? Customer: Thank you Customer: I have a question about X. Agent: I can help here. Log in this website (remember to use your new password)
编辑:正如@DRPK所建议的那样,可以通过将其设为一个衬里来优化它,这将在大语料库中发挥作用
clean_text = ' '.join([line[:line.rindex('(')] for line in text.replace("Agent", '\nAgent').replace("Customer", '\nCustomer').strip().splitlines()])
\( [^\)]++\)
您可以使用此正则表达式在您的代码中替换为“”。
我确实从 http://www.amazingregex.xyz/ 生成了它。你可以用文本例子自己生成
我想删除下面示例文本数据中括号中的所有时间戳。
输入:
Agent: Can I help you? ( 3s ) Customer: Thank you( 40s ) Customer: I have a question about X. ( 8m 1s ) Agent: I can help here. Log in this website (remember to use your new password) ( 11m 31s )
预期输出:
Agent: Can I help you? Customer: Thank you Customer: I have a question about X. Agent: I can help here. Log in this website (remember to use your new password)
我试过 re.sub(r'\(.*?\)', '', data)
但它没有用,因为它删除了括号中的所有内容。如果不是时间戳,我想保留括号中的内容,例如,我想在输出中保留“(记得使用你的新密码)”。
对正则表达式还是陌生的,所以希望我能在这里得到一些指导。谢谢!
\(\s(\d{1,2}[smh]\s)+\)
仅供参考:.*
匹配除行终止符之外的所有内容。
不是正则表达式,可能效率不高,但字符串方法可以:
spam = "Agent: Can I help you? ( 3s ) Customer: Thank you( 40s ) Customer: I have a question about X. ( 8m 1s ) Agent: I can help here. Log in this website (remember to use your new password) ( 11m 31s )"
def cleanup(text):
for word in ('Agent', 'Customer'):
text = text.replace(word, f'\n{word}').strip()
clean_text = [line[:line.rindex('(')] for line in text.splitlines()]
# or in slow-motion
# clean_text = []
# for line in text.splitlines():
# idx = line.rindex('(')
# line = line[:idx]
# clean_text.append(line)
return ' '.join(clean_text)
print(cleanup(spam))
输出
Agent: Can I help you? Customer: Thank you Customer: I have a question about X. Agent: I can help here. Log in this website (remember to use your new password)
编辑:正如@DRPK所建议的那样,可以通过将其设为一个衬里来优化它,这将在大语料库中发挥作用
clean_text = ' '.join([line[:line.rindex('(')] for line in text.replace("Agent", '\nAgent').replace("Customer", '\nCustomer').strip().splitlines()])
\( [^\)]++\)
您可以使用此正则表达式在您的代码中替换为“”。 我确实从 http://www.amazingregex.xyz/ 生成了它。你可以用文本例子自己生成