从字符串中删除 chars/signs

Question

我正在为词云准备文本，但卡住了。

我需要删除所有数字，所有符号，例如。 , -? = / ！ @ 等，但我不知道如何。我不想一次又一次地更换。有方法吗？

这是我的概念和我必须做的事情：

将文本连接成一个字符串
将字符设置为小写 <--- 我在这里
现在我要删除特定符号并将文本分成单词（列表）
计算词频
接下来执行停用词脚本...

abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
    abstracts.append(ab)
for ab in abstracts:
    allab += ab
Lower = allab.lower()

文本示例：

MicroRNAs (miRNAs) are a class of noncoding RNA molecules approximately 19 to 25 nucleotides in length that downregulate the expression of target genes at the post-transcriptional level by binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus (EBV) generates at least 44 miRNAs, but the functions of most of these miRNAs have not yet been identified. Previously, we reported BRUCE as a target of miR-BART15-3p, a miRNA produced by EBV, but our data suggested that there might be other apoptosis-associated target genes of miR-BART15-3p. Thus, in this study, we searched for new target genes of miR-BART15-3p using in silico analyses. We found a possible seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The luciferase activity of a reporter vector including the 3'-UTR of TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated the expression of TAX1BP1 mRNA and protein in AGS cells, while an inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1 mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB activity in gastric cancer cell lines. Moreover, miR-BART15-3p strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1 gene in cancer cells, causing increased apoptosis and chemosensitivity to 5-FU.

Answer 1

因此，要将大写字符设置为小写字符，您可以执行以下操作：所以只需将您的文本存储到一个字符串变量中，例如 STRING 然后使用命令

STRING=re.sub('([A-Z]{1})', r'',STRING).lower()

现在您的字符串将没有大写字母。

要再次删除特殊字符，模块 re 可以帮助您使用子命令:

STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )

使用这些命令，您的字符串将没有特殊字符

要确定词频，您可以使用必须从中导入 Counter 的模块集合。

然后使用以下命令确定单词出现的频率：

Counter(STRING.split()).most_common()

Answer 2

我可能会尝试使用 string.isalpha():

abstracts = []
with open('new','r') as abstracts_list:
    for ab in abstracts_list:  # this gives one line of text. 
        if not ab.isalpha():
            ab = ''.join(c for c in ab if c.isalpha() 
        abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)

从字符串中删除 chars/signs

Removing chars/signs from string

python

text

word-cloud