从字符串中删除 chars/signs
Removing chars/signs from string
我正在为词云准备文本,但卡住了。
我需要删除所有数字,所有符号,例如 。 , -? = / ! @ 等,但我不知道如何。我不想一次又一次地更换。有方法吗?
这是我的概念和我必须做的事情:
- 将文本连接成一个字符串
- 将字符设置为小写 <--- 我在这里
- 现在我要删除特定符号并将文本分成单词(列表)
- 计算词频
- 接下来执行停用词脚本...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
文本示例:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.
因此,要将大写字符设置为小写字符,您可以执行以下操作:
所以只需将您的文本存储到一个字符串变量中,例如 STRING 然后使用命令
STRING=re.sub('([A-Z]{1})', r'',STRING).lower()
现在您的字符串将没有大写字母。
要再次删除特殊字符,模块 re 可以帮助您使用子命令:
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
使用这些命令,您的字符串将没有特殊字符
要确定词频,您可以使用必须从中导入 Counter 的模块集合。
然后使用以下命令确定单词出现的频率:
Counter(STRING.split()).most_common()
我可能会尝试使用 string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)
我正在为词云准备文本,但卡住了。
我需要删除所有数字,所有符号,例如 。 , -? = / ! @ 等,但我不知道如何。我不想一次又一次地更换。有方法吗?
这是我的概念和我必须做的事情:
- 将文本连接成一个字符串
- 将字符设置为小写 <--- 我在这里
- 现在我要删除特定符号并将文本分成单词(列表)
- 计算词频
- 接下来执行停用词脚本...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
文本示例:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules approximately 19 to 25 nucleotides in length that downregulate the expression of target genes at the post-transcriptional level by binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus (EBV) generates at least 44 miRNAs, but the functions of most of these miRNAs have not yet been identified. Previously, we reported BRUCE as a target of miR-BART15-3p, a miRNA produced by EBV, but our data suggested that there might be other apoptosis-associated target genes of miR-BART15-3p. Thus, in this study, we searched for new target genes of miR-BART15-3p using in silico analyses. We found a possible seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The luciferase activity of a reporter vector including the 3'-UTR of TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated the expression of TAX1BP1 mRNA and protein in AGS cells, while an inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1 mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB activity in gastric cancer cell lines. Moreover, miR-BART15-3p strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1 gene in cancer cells, causing increased apoptosis and chemosensitivity to 5-FU.
因此,要将大写字符设置为小写字符,您可以执行以下操作: 所以只需将您的文本存储到一个字符串变量中,例如 STRING 然后使用命令
STRING=re.sub('([A-Z]{1})', r'',STRING).lower()
现在您的字符串将没有大写字母。
要再次删除特殊字符,模块 re 可以帮助您使用子命令:
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
使用这些命令,您的字符串将没有特殊字符
要确定词频,您可以使用必须从中导入 Counter 的模块集合。
然后使用以下命令确定单词出现的频率:
Counter(STRING.split()).most_common()
我可能会尝试使用 string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)