从元素列表中提取文本计数
Extract text count from a list of elements
我有一个包含文本元素的列表。
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
我需要计算“=”之前的文本数量。我使用了如下的 CountVectorizer 和一个令牌模式,但它没有给出预期的结果
print(text)
vectorizer = CountVectorizer()
vectorizer = CountVectorizer(token_pattern="^[^=]+")
vectorizer.fit(text)
print(vectorizer.vocabulary_)
输出如下
{'a for': 2, 'b for': 3, 'd for': 4, 'e for': 5, '1.': 0, '2.': 1}
但预期的输出应该是
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1.': 1, '2.': 1}
我还需要删除“。”从“1”。这样我的输出就是
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}
有什么办法可以做到吗?
import re
dictionary = {}
def remove_special_characters(value):
if '.' in value:
return re.sub(r'\.=\w+','',value)
return value.split('=')[0]
for value in text:
new_value = remove_special_characters(value)
if new_value in dictionary:
dictionary[new_value] += 1
else:
dictionary[new_value] = 1
print(dictionary)
>>>{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}
from collections import Counter
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
text = [i.split('=')[0] for i in text] #consider only the first part of the split
text = [i.split('.')[0] for i in text]
frequency = {}
for each in text:
if each in frequency:
frequency[each] += 1
else:
frequency[each] = 1
print(frequency) #if you want to use dict
counts =list(Counter(text).items()) #if you want to use collections module
print(counts)
请注意,这仅适用于您的 text
列表所说的内容,即仅包含一个 =
,除此之外,您需要稍微调整一下。
您可以在没有 CountVectorizer 的情况下执行此操作:
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
left_sides = [pair.split('=')[0].replace('.','') for pair in text]
uniques = set(left_sides)
counts = {i:left_sides.count(i) for i in uniques}
print(counts)
产生:
{'d for': 2, 'b for': 1, '1': 1, 'a for': 2, '2': 1, 'e for': 1}
一个简单的方法是使用 collections.Counter()
:
>>> from collections import Counter
>>> text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
>>> Counter(x.split('=')[0].replace('.', '') for x in text)
Counter({'a for': 2, 'd for': 2, 'b for': 1, 'e for': 1, '1': 1, '2': 1})
首先将文本中的每个字符串按 "="
拆分为一个列表,并从中获取第一个元素。然后调用 replace()
以将 "."
的任何实例替换为 ""
。最后,它 return 是一个 Counter()
计数对象。
注意:如果想return一个纯字典放在最后,可以把dict()
换行到最后一行。
我有一个包含文本元素的列表。
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
我需要计算“=”之前的文本数量。我使用了如下的 CountVectorizer 和一个令牌模式,但它没有给出预期的结果
print(text)
vectorizer = CountVectorizer()
vectorizer = CountVectorizer(token_pattern="^[^=]+")
vectorizer.fit(text)
print(vectorizer.vocabulary_)
输出如下
{'a for': 2, 'b for': 3, 'd for': 4, 'e for': 5, '1.': 0, '2.': 1}
但预期的输出应该是
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1.': 1, '2.': 1}
我还需要删除“。”从“1”。这样我的输出就是
{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}
有什么办法可以做到吗?
import re
dictionary = {}
def remove_special_characters(value):
if '.' in value:
return re.sub(r'\.=\w+','',value)
return value.split('=')[0]
for value in text:
new_value = remove_special_characters(value)
if new_value in dictionary:
dictionary[new_value] += 1
else:
dictionary[new_value] = 1
print(dictionary)
>>>{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}
from collections import Counter
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
text = [i.split('=')[0] for i in text] #consider only the first part of the split
text = [i.split('.')[0] for i in text]
frequency = {}
for each in text:
if each in frequency:
frequency[each] += 1
else:
frequency[each] = 1
print(frequency) #if you want to use dict
counts =list(Counter(text).items()) #if you want to use collections module
print(counts)
请注意,这仅适用于您的 text
列表所说的内容,即仅包含一个 =
,除此之外,您需要稍微调整一下。
您可以在没有 CountVectorizer 的情况下执行此操作:
text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
left_sides = [pair.split('=')[0].replace('.','') for pair in text]
uniques = set(left_sides)
counts = {i:left_sides.count(i) for i in uniques}
print(counts)
产生:
{'d for': 2, 'b for': 1, '1': 1, 'a for': 2, '2': 1, 'e for': 1}
一个简单的方法是使用 collections.Counter()
:
>>> from collections import Counter
>>> text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
>>> Counter(x.split('=')[0].replace('.', '') for x in text)
Counter({'a for': 2, 'd for': 2, 'b for': 1, 'e for': 1, '1': 1, '2': 1})
首先将文本中的每个字符串按 "="
拆分为一个列表,并从中获取第一个元素。然后调用 replace()
以将 "."
的任何实例替换为 ""
。最后,它 return 是一个 Counter()
计数对象。
注意:如果想return一个纯字典放在最后,可以把dict()
换行到最后一行。