如何按相同字段合并字典列表并在此过程中对另一个字段求和?
How do I merge a list of dictionaries by an identical field and sum another field in the process?
试图通过url字段合并字典列表,如果列表中有相同的字典项目,将通过该字段合并相同的字典,同时添加另一个字段的总和时间.
我试过使用 'setdefault',但它并不总能按预期工作。 运行 循环后我仍然得到重复的结果。
这是我试图用添加的第二个字段的总和来压缩的字典列表,以获得存在相同 url 的总和:
[
['https://www.website.com/directory/link-1',
21,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-1',
185,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
'String 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
'String 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
'String 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
这是我想要得到的结果:
[
['https://www.website.com/directory/link-1',
206,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
'String 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
'String 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
'String 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
我在努力
for url, long_text, number_to_count, another_field, ..., ... in list:
d = {}
d.setdefault(url, {}).setdefault("long text", []).append(long_text)
d[url].setdefault("number_to_count",[]).append(number_to_count)
d[url].setdefault("another_field",[]).append(another_field)
您可以尝试以下方法。它基本上将来自 lst
的子列表按第一个 URL 分组到列表的 defaultdict 中,然后仅对第二个项目编号求和来构建新结果。
from collections import defaultdict
from pprint import pprint
lst = ...
d = defaultdict(list)
for item in lst:
d[item[0]].append(item)
result = [[v[0][0]] + [sum(x[1] for x in v)] + v[0][2:] for v in d.values()]
pprint(result)
输出:
[['https://www.website.com/directory/link-1',
206,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]]
如果你想使用 pandas
你可以得到类似下面的东西:
Page Count Text String Url Magic
0 https://www.website.com/directory/link-1 21 Long Text Field 1 String 1 https://www.website.com/images/image-1.jpg 255
1 https://www.website.com/directory/link-1 185 Long Text Field 1 String 1 https://www.website.com/images/image-1.jpg 255
2 https://www.website.com/directory/link-2 296 Long Text Field 2 None https://www.website.com/images/image-2.jpg 303
3 https://www.website.com/directory/link-3 354 Long Text Field 3 None https://www.website.com/images/image-3.jpg 388
4 https://www.website.com/directory/link-4 606 Long Text Field 4 None https://www.website.com/images/image-4.jpg 624
----
Page Count Magic String Url Text
0 https://www.website.com/directory/link-1 206 255 String 1 https://www.website.com/images/image-1.jpg Long Text Field 1
1 https://www.website.com/directory/link-2 296 303 None https://www.website.com/images/image-2.jpg Long Text Field 2
2 https://www.website.com/directory/link-3 354 388 None https://www.website.com/images/image-3.jpg Long Text Field 3
3 https://www.website.com/directory/link-4 606 624 None https://www.website.com/images/image-4.jpg Long Text Field 4
通过 运行 下面的代码。请注意,我必须为缺失的字符串添加虚拟值,因为您的数据格式有些不一致。
import pandas as pd
data = [
['https://www.website.com/directory/link-1',
21,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-1',
185,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
columns = ['Page', 'Count', 'Text', 'String', 'Url', 'Magic']
for d in data:
if len(d) != 6:
d.insert(3, None)
d[4] = d[4]['url']
df = pd.DataFrame(data, columns=columns)
agg = dict.fromkeys(columns, 'first')
agg.update({'Count': 'sum'})
del agg['Page']
df2 = df.groupby(['Page'], as_index=False).agg(agg)
pd.options.display.width = 0
print df
print '\n----\n'
print df2
试图通过url字段合并字典列表,如果列表中有相同的字典项目,将通过该字段合并相同的字典,同时添加另一个字段的总和时间.
我试过使用 'setdefault',但它并不总能按预期工作。 运行 循环后我仍然得到重复的结果。
这是我试图用添加的第二个字段的总和来压缩的字典列表,以获得存在相同 url 的总和:
[
['https://www.website.com/directory/link-1',
21,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-1',
185,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
'String 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
'String 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
'String 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
这是我想要得到的结果:
[
['https://www.website.com/directory/link-1',
206,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
'String 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
'String 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
'String 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
我在努力
for url, long_text, number_to_count, another_field, ..., ... in list:
d = {}
d.setdefault(url, {}).setdefault("long text", []).append(long_text)
d[url].setdefault("number_to_count",[]).append(number_to_count)
d[url].setdefault("another_field",[]).append(another_field)
您可以尝试以下方法。它基本上将来自 lst
的子列表按第一个 URL 分组到列表的 defaultdict 中,然后仅对第二个项目编号求和来构建新结果。
from collections import defaultdict
from pprint import pprint
lst = ...
d = defaultdict(list)
for item in lst:
d[item[0]].append(item)
result = [[v[0][0]] + [sum(x[1] for x in v)] + v[0][2:] for v in d.values()]
pprint(result)
输出:
[['https://www.website.com/directory/link-1',
206,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]]
如果你想使用 pandas
你可以得到类似下面的东西:
Page Count Text String Url Magic
0 https://www.website.com/directory/link-1 21 Long Text Field 1 String 1 https://www.website.com/images/image-1.jpg 255
1 https://www.website.com/directory/link-1 185 Long Text Field 1 String 1 https://www.website.com/images/image-1.jpg 255
2 https://www.website.com/directory/link-2 296 Long Text Field 2 None https://www.website.com/images/image-2.jpg 303
3 https://www.website.com/directory/link-3 354 Long Text Field 3 None https://www.website.com/images/image-3.jpg 388
4 https://www.website.com/directory/link-4 606 Long Text Field 4 None https://www.website.com/images/image-4.jpg 624
----
Page Count Magic String Url Text
0 https://www.website.com/directory/link-1 206 255 String 1 https://www.website.com/images/image-1.jpg Long Text Field 1
1 https://www.website.com/directory/link-2 296 303 None https://www.website.com/images/image-2.jpg Long Text Field 2
2 https://www.website.com/directory/link-3 354 388 None https://www.website.com/images/image-3.jpg Long Text Field 3
3 https://www.website.com/directory/link-4 606 624 None https://www.website.com/images/image-4.jpg Long Text Field 4
通过 运行 下面的代码。请注意,我必须为缺失的字符串添加虚拟值,因为您的数据格式有些不一致。
import pandas as pd
data = [
['https://www.website.com/directory/link-1',
21,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-1',
185,
'Long Text Field 1',
'String 1',
{'url': 'https://www.website.com/images/image-1.jpg'},
255],
['https://www.website.com/directory/link-2',
296,
'Long Text Field 2',
{'url': 'https://www.website.com/images/image-2.jpg'},
303],
['https://www.website.com/directory/link-3',
354,
'Long Text Field 3',
{'url': 'https://www.website.com/images/image-3.jpg'},
388],
['https://www.website.com/directory/link-4',
606,
'Long Text Field 4',
{'url': 'https://www.website.com/images/image-4.jpg'},
624]
]
columns = ['Page', 'Count', 'Text', 'String', 'Url', 'Magic']
for d in data:
if len(d) != 6:
d.insert(3, None)
d[4] = d[4]['url']
df = pd.DataFrame(data, columns=columns)
agg = dict.fromkeys(columns, 'first')
agg.update({'Count': 'sum'})
del agg['Page']
df2 = df.groupby(['Page'], as_index=False).agg(agg)
pd.options.display.width = 0
print df
print '\n----\n'
print df2