如何计算字典中单词的长度

Question

我有这样一个字典列表：

myList = [
    {
        'id':1,
        'text':[
            'I like cheese.', 
            'I love cheese.',
            'oh!'
        ],
        'text_2': [
            ('david', 'david', 'I do not like cheese.'),
            ('david', 'david', 'cheese is good.')
        ]
    },
    {
        'id':2,
        'text':[
            'I like strawberry.',
            'I love strawberry'
        ],
        'text_2':[
            ('alice', 'alice', 'strawberry is good.'),
            ('alice', 'alice', ' strawberry is so so.')
        ]
    }
]

我想通过“id”统计“text”和“text_2”的元素个数和长度。理想的输出是：

myList = [
    {
        'id':1,
        'text':(3,7),
        'text_2': (2,8)   
    },
    {
        'id':2,
        'text':(2,6),
        'text_2':(2,7)    
    }
]

'text':(3,7) 表示：3 个元素（'I like cheese.'、'I love cheese.'、'oh!'）； 7个字（我，喜欢，奶酪，我爱奶酪，哦）

'text_2':(2,8)表示：2个元素(('david','david','I do not like cheese.'),('david','david', 'cheese is good.')); 8个字（I, do, not, like, cheese, cheese, is good））

有什么建议吗？

Answer 1

比如像这样：

from itertools import chain
from string import punctuation

def remove_punctuation(text):
    return "".join(filter(lambda x: x not in punctuation, text))

def count_items_and_words(items, label):
    items_cnt = len(items)
    
    if label == "text":
        total_text = " ".join(items)
    elif label == "text_2":
        total_text = " ".join(chain(*[it[2:] for it in items]))
    total_text_clean = remove_punctuation(total_text)
    
    words_cnt = len(total_text_clean.split())
    return (items_cnt, words_cnt)

def count_all(my_list):
    results = list()
    for it in my_list:
        if not isinstance(it, dict):
            continue
        res = {"id": it["id"]}
        for label in "text", "text_2":
            res[label] = count_items_and_words(it[label], label)
        results.append(res)
    return results

results = count_all(myList)
results

输出：

[{'id': 1, 'text': (3, 7), 'text_2': (2, 8)},
 {'id': 2, 'text': (2, 6), 'text_2': (2, 7)}]

Answer 2

见下文

lst = [
    {
        'id': 1,
        'text': [
            'I like cheese.',
            'I love cheese.',
            'oh!'
        ],
        'text_2': [
            ('david', 'david', 'I do not like cheese.'),
            ('david', 'david', 'cheese is good.')
        ]
    },
    {
        'id': 2,
        'text': [
            'I like strawberry.',
            'I love strawberry'
        ],
        'text_2': [
            ('alice', 'alice', 'strawberry is good.'),
            ('alice', 'alice', ' strawberry is so so.')
        ]
    }
]
out = []
for entry in lst:
    out.append({})
    for k, v in entry.items():
        if k == 'id':
            out[-1][k] = v
        elif k == 'text':
            out[-1][k] = (len(v), sum(len(x.split()) for x in v))
        else:
            out[-1][k] = (len(v),sum(len(x) for x in v))
print(out)

输出

[{'id': 1, 'text': (3, 7), 'text_2': (2, 6)}, {'id': 2, 'text': (2, 6), 'text_2': (2, 6)}]

Answer 3

如果你是新手，我的回答很难消化，但我希望你能找到一些对你未来有用的好组合......也因为你没有提供任何尝试。

' '.join(my_list) 将列表的元素组成一串由白色分隔的字符串-space
my_string.split() 通过切割单个白色-space 来从字符串中创建一个列表（-> 这样你就可以数了）
set(my_list) 删除元素的多次出现
itertools.chain 连接可迭代对象的函数，将列表中的元组合并为单个对象
列表理解，例如[i for i in range(10) if i > 5]

由于您没有指定任何关于如何处理同一元素多次出现的规则，我只计算一次（因此 'david'，'david' 被计算为 1）

我对你的建议请求的回答是分而治之，把一个大问题分成小问题，解决它们，把它们粘在一起。

import itertools as it

myList = # see dictionary in the question

for d in myList:
    for k, v in d.items():
        if isinstance(v, list):
           pair = len(v), len(' '.join(v).split()) if isinstance(v[0], str) else len(' '.join([t for t in set(it.chain(*v))]).split())
            print(pair)
        else:
            print(k, v)

输出

id 1
(3, 7)

(2, 9)

id 2
(2, 6)

(2, 8)

如何计算字典中单词的长度

How to count the length of words in a dictionary

python

dictionary

list

word-count