如何让我的代码删除保存在 txt 文件中的消息中找到的发件人姓名以及使用正则表达式的标签

Question

在发送者和接收者之间通过 Discord 进行对话，我需要删除对话者的标签和姓名，在这种情况下，这将帮助我删除冒号 (:) 之前的内容，这样发件人的姓名无关紧要，我总是会删除发件人。

这是 generic_discord_talk.txt 文件中的信息

Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me

import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()

with open('stopwords-es.txt') as f:
    st = [word for line in f  for word in line.split()]
    print(st)
    

stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)

我创建了一个正则表达式来检测标签

regex = re.compile("^(<@!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<@!.+>){,1}\s{,}$")
regex_tag = re.compile("^<@!.+>")

我需要句子 print(st) 给我 return 给我的单词，但没有发射器和标签

Answer 1

您可以使用交替 | 匹配从字符串开头到第一次出现的逗号，或者匹配 <@!直到第一个结束标记。

^[^:\n]+:\s*|\s*<@!\d+>

模式匹配：

^ 字符串开头
[^:\n]+:\s* 匹配除 : 或换行符之外的任何字符出现 1 次以上，然后匹配 : 和可选的空白字符
| 或
\s*<@! 按字面匹配，前面有可选的空白字符
[^<>]+ 否定字符 class，匹配除 < 和 >
>字面匹配

Regex demo

如果<@!

后面只能有数字

^[^:\n]+:|<@!\d+>

例如

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<@![^<>]+>", "", a, 0, re.M)

如果你还想清除首尾空格，可以加上这一行

st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)

Answer 2

我认为这应该可行：

import re


data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""


def run():
    for line in data.split("\n"):
        line = re.sub(r"^\w+: ", "", line)  # remove the customer/company part
        line = re.sub(r"<@!\d+>", "", line)  # remove tags
        print(line)

如何让我的代码删除保存在 txt 文件中的消息中找到的发件人姓名以及使用正则表达式的标签

How do I make my code remove the sender names found in the messages saved in a txt file and the tags using regex

python

regex

string

python-3.x

txt