在忽略大小写敏感性的列表中查找最频繁的字符串

Finding the most frequent strings in a list neglecting case sentivity

我有一个名为 li 的 Twitter 主题标签列表。我想从中创建一个新列表 top_10,其中包含最常见的主题标签。 到目前为止我已经完成了 (#):

li = ['COVID19', 'Covid19', 'covid19', 'coronavirus', 'Coronavirus',...]
tag_counter = dict()
for tag in li:
    if tag in tag_counter:
         tag_counter[tag] += 1
    else:
         tag_counter[tag] = 1
 
popular_tags = sorted(tag_counter, key = tag_counter.get, reverse = True)

top_10 = popular_tags[:10]

print('\nList of the top 10 popular hashtags are :\n',top_10)

由于主题标签不区分大小写,我想在创建 tag_counter.

时应用不区分大小写

使用标准库中的collections.Counter

from collections import Counter

list_of_words = ['hello', 'hello', 'world']
lowercase_words = [w.lower() for w in list_of_words]

Counter(lowercase_words).most_common(1)

Returns:

[('hello', 2)]

首先对数据进行归一化,使用 lower 或 upper。

li = ['COVID19', 'Covid19', 'covid19', 'coronavirus', 'Coronavirus']
li = [x.upper() for x in li] # OR, li = [x.lower() for x in li]
tag_counter = dict()
for tag in li:
    if tag in tag_counter:
         tag_counter[tag] += 1
    else:
         tag_counter[tag] = 1
 
popular_tags = sorted(tag_counter, key = tag_counter.get, reverse = True)

top_10 = popular_tags[:10]

print('\nList of the top 10 popular hashtags are :\n',top_10)

您可以使用 collections 库中的 Counter

from collections import Counter

li = ['COVID19', 'Covid19', 'covid19', 'coronavirus', 'Coronavirus']

print(Counter([i.lower() for i in li]).most_common(10))

输出:

[('covid19', 3), ('coronavirus', 2)]

见下文

from collections import Counter

lst = ['Ab','aa','ab','Aa','Cct','aA']
lower_lst = [x.lower() for x in lst ]
counter = Counter(lower_lst)
print(counter.most_common(1))