如何让我的代码删除保存在 txt 文件中的消息中找到的发件人姓名以及使用正则表达式的标签
How do I make my code remove the sender names found in the messages saved in a txt file and the tags using regex
在发送者和接收者之间通过 Discord 进行对话,我需要删除对话者的标签和姓名,在这种情况下,这将帮助我删除冒号 (:) 之前的内容,这样发件人的姓名无关紧要,我总是会删除发件人。
这是 generic_discord_talk.txt 文件中的信息
Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
我创建了一个正则表达式来检测标签
regex = re.compile("^(<@!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<@!.+>){,1}\s{,}$")
regex_tag = re.compile("^<@!.+>")
我需要句子 print(st)
给我 return 给我的单词,但没有发射器和标签
您可以使用交替 |
匹配从字符串开头到第一次出现的逗号,或者匹配 <@!直到第一个结束标记。
^[^:\n]+:\s*|\s*<@!\d+>
模式匹配:
^
字符串开头
[^:\n]+:\s*
匹配除 :
或换行符之外的任何字符出现 1 次以上,然后匹配 :
和可选的空白字符
|
或
\s*<@!
按字面匹配,前面有可选的空白字符
[^<>]+
否定字符 class,匹配除 <
和 >
之外的任何字符出现 1+ 次
>
字面匹配
如果<@!
后面只能有数字
^[^:\n]+:|<@!\d+>
例如
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<@![^<>]+>", "", a, 0, re.M)
如果你还想清除首尾空格,可以加上这一行
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
我认为这应该可行:
import re
data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<@!\d+>", "", line) # remove tags
print(line)
在发送者和接收者之间通过 Discord 进行对话,我需要删除对话者的标签和姓名,在这种情况下,这将帮助我删除冒号 (:) 之前的内容,这样发件人的姓名无关紧要,我总是会删除发件人。
这是 generic_discord_talk.txt 文件中的信息
Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
我创建了一个正则表达式来检测标签
regex = re.compile("^(<@!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<@!.+>){,1}\s{,}$")
regex_tag = re.compile("^<@!.+>")
我需要句子 print(st)
给我 return 给我的单词,但没有发射器和标签
您可以使用交替 |
匹配从字符串开头到第一次出现的逗号,或者匹配 <@!直到第一个结束标记。
^[^:\n]+:\s*|\s*<@!\d+>
模式匹配:
^
字符串开头[^:\n]+:\s*
匹配除:
或换行符之外的任何字符出现 1 次以上,然后匹配:
和可选的空白字符|
或\s*<@!
按字面匹配,前面有可选的空白字符[^<>]+
否定字符 class,匹配除<
和>
之外的任何字符出现 1+ 次
>
字面匹配
如果<@!
^[^:\n]+:|<@!\d+>
例如
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<@![^<>]+>", "", a, 0, re.M)
如果你还想清除首尾空格,可以加上这一行
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
我认为这应该可行:
import re
data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<@!\d+>", "", line) # remove tags
print(line)