如何删除每个非字母字符的单词
How to remove every word with non alphabetic characters
我需要编写一个 python 脚本来删除文本文件中包含非字母字符的每个单词,以测试 Zipf 定律。
例如:
asdf@gmail.com said: I've taken 2 reports to the boss
至
taken reports to the boss
我应该如何进行?
使用正则表达式只匹配字母(和下划线),你可以这样做:
import re
s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
这可能会有所帮助
array = string.split(' ')
result = []
for word in array
if word.isalpha()
result.append(word)
string = ' '.join(result)
试试这个:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']
result = ' '.join(words)
# taken reports to the boss
您可以使用正则表达式,也可以在构建函数中使用 python,例如 isalpha()
使用 isalpha() 的示例
result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
if i.isalpha():
print(i+' ',end='')
str.join()
+ comprehension 会给你一行解决方案:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'
您可以使用 split() and is isalpha() 获取只有字母字符且至少有一个字符的单词列表。
>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
然后您可以使用 join() 将列表变成一个字符串:
>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
nltk
包专门用于处理文本,并具有多种功能,您可以使用这些函数将文本 'tokenize' 转换为单词。
您可以使用 RegexpTokenizer
,也可以使用稍作调整的 word_tokenize
。
最简单最简单的就是RegexpTokenizer
:
import nltk
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
哪个returns:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
或者您可以使用更智能的 word_tokenize
,它能够将 didn't
等大多数缩略语拆分为 did
和 n't
.
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
哪个returns:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
我最终为此编写了自己的函数,因为正则表达式和 isalpha()
不适用于我的测试用例。
letters = set('abcdefghijklmnopqrstuvwxyz')
def only_letters(word):
for char in word.lower():
if char not in letters:
return False
return True
# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']
print([x for x in hard_words if only_letters(x)])
# prints ['asdf']
我需要编写一个 python 脚本来删除文本文件中包含非字母字符的每个单词,以测试 Zipf 定律。 例如:
asdf@gmail.com said: I've taken 2 reports to the boss
至
taken reports to the boss
我应该如何进行?
使用正则表达式只匹配字母(和下划线),你可以这样做:
import re
s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
这可能会有所帮助
array = string.split(' ')
result = []
for word in array
if word.isalpha()
result.append(word)
string = ' '.join(result)
试试这个:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']
result = ' '.join(words)
# taken reports to the boss
您可以使用正则表达式,也可以在构建函数中使用 python,例如 isalpha()
使用 isalpha() 的示例
result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
if i.isalpha():
print(i+' ',end='')
str.join()
+ comprehension 会给你一行解决方案:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'
您可以使用 split() and is isalpha() 获取只有字母字符且至少有一个字符的单词列表。
>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
然后您可以使用 join() 将列表变成一个字符串:
>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
nltk
包专门用于处理文本,并具有多种功能,您可以使用这些函数将文本 'tokenize' 转换为单词。
您可以使用 RegexpTokenizer
,也可以使用稍作调整的 word_tokenize
。
最简单最简单的就是RegexpTokenizer
:
import nltk
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
哪个returns:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
或者您可以使用更智能的 word_tokenize
,它能够将 didn't
等大多数缩略语拆分为 did
和 n't
.
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
哪个returns:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
我最终为此编写了自己的函数,因为正则表达式和 isalpha()
不适用于我的测试用例。
letters = set('abcdefghijklmnopqrstuvwxyz')
def only_letters(word):
for char in word.lower():
if char not in letters:
return False
return True
# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']
print([x for x in hard_words if only_letters(x)])
# prints ['asdf']