如何从 python 中的文件中删除 space 以外的特殊字符?
How to remove special characters except space from a file in python?
我有一个庞大的文本语料库(逐行),我想删除特殊字符但保持字符串的 space 和结构。
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
应该是
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
您也可以使用此模式 regex
:
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
输出:
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
编辑:
否则,您可以将最后几行存储到 list
:
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)
输出:
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
我认为 nfn neil 的回答很棒...但我只想添加一个简单的正则表达式来删除所有无单词字符,但是它会将下划线视为单词的一部分
print re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python
创建字典将特殊字符映射到 None
d = {c:None for c in special_characters}
对全文做一个translation table using the dictionary. Read the entire text into a variable and use str.translate。
一个更优雅的解决方案是
print(re.sub(r"\W+|_", " ", string))
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
在这里,
re
是 python
中的 regex
模块
re.sub
将用 space 替换模式,即 " "
r''
会将输入字符串视为原始字符串 (with \n)
\W
用于所有非单词,即所有特殊字符 *&^%$ 等,不包括下划线 _
+
将匹配零到无限匹配,类似于 *(一到多)
|
是逻辑或
_
代表下划线
你可以试试这个
import re
sentance = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
res = re.sub('[!,*)@#%(&$_?.^]', '', sentance)
print(res)
re.sub('["]') -> 在这里您可以添加要删除的符号
我有一个庞大的文本语料库(逐行),我想删除特殊字符但保持字符串的 space 和结构。
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
应该是
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
您也可以使用此模式 regex
:
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
输出:
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
编辑:
否则,您可以将最后几行存储到 list
:
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)
输出:
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
我认为 nfn neil 的回答很棒...但我只想添加一个简单的正则表达式来删除所有无单词字符,但是它会将下划线视为单词的一部分
print re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python
创建字典将特殊字符映射到 None
d = {c:None for c in special_characters}
对全文做一个translation table using the dictionary. Read the entire text into a variable and use str.translate。
一个更优雅的解决方案是
print(re.sub(r"\W+|_", " ", string))
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
在这里,
re
是 python
regex
模块
re.sub
将用 space 替换模式,即 " "
r''
会将输入字符串视为原始字符串 (with \n)
\W
用于所有非单词,即所有特殊字符 *&^%$ 等,不包括下划线 _
+
将匹配零到无限匹配,类似于 *(一到多)
|
是逻辑或
_
代表下划线
你可以试试这个
import re
sentance = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
res = re.sub('[!,*)@#%(&$_?.^]', '', sentance)
print(res)
re.sub('["]') -> 在这里您可以添加要删除的符号