Pandas 数据汇总

Question

我有一个模糊的数据，如下。请注意，第一项有重复的名称（这一点很重要，需要考虑）。

('Alex', ['String1', 'String34'])
('Piper', ['String5', 'String64', 'String12'])
('Nicky', ['String3', 'String21', 'String42', 'String51'])
('Linda', ['String14'])
('Suzzane', ['String11', 'String36', 'String16'])
('Alex', ['String64', 'String34', 'String12', 'String5'])
('Linda', ['String3', 'String77'])
('Piper', ['String41', 'String64', 'String11', 'String34'])
('Suzzane', ['String12'])
('Nicky', ['String11',  'String51'])
('Alex', ['String77', 'String64', 'String3', 'String5'])
('Linda', ['String51'])
('Nicky', ['String77', 'String12', 'String34'])
('Suzzane', ['String51', 'String3'])
('Piper', ['String11', 'String64', 'String5'])

如果上述数据在一个名为“output.txt”的文件中，如何将其导入并汇总如下所示的数据？

[仅保留唯一名称，对于每个主要名称，将仅从所有存在的重复项中填充唯一字符串]

('Alex', ['String1', 'String34', 'String64', 'String12', 'String5', 'String77', 'String3'])
('Piper', ['String5', 'String64', 'String12', 'String11', 'String41', 'String34'])
('Nicky', ['String3', 'String21', 'String42', 'String51', 'String11', 'String77', 'String12', 'String34'])
('Linda', ['String14', 'String3', 'String77', 'String51'])
('Suzzane', ['String11', 'String36', 'String16', 'String12', 'String51', 'String3'])

Answer 1

您可以将数据加载到 pandas dataframe:

import pandas as pd

df = pd.DataFrame(data=[('Alex', ['String1', 'String34']),
('Alex', ['String64', 'String34', 'String12', 'String5']),
('Nicky', ['String11',  'String51']),
('Nicky', ['String77', 'String12', 'String34'])])
df = df.rename(columns={0:'name', 1:'strings'})

然后创建一个 function 以在 pandas 列上连接列表：

def concatenate(strings):
   strings_agg = []
    for string in strings:
        strings_agg.extend(string)
    return strings_agg

最后 apply 列的函数：

df.groupby('name').apply(lambda x: concatenate(x['strings'])).to_frame()

Answer 2

import ast
import csv
import pandas as pd

#load data from txt file, doesnt has to be csv, can be a txt file!
df = pd.read_csv(r"D:\test\output.txt", sep="/n", header=None, names=["data"], engine='python')

#convert text data to tupels and list
df["data"] = df["data"].map(lambda x: ast.literal_eval(x))
#extract surename
df["surename"] = df["data"].map(lambda x: x[0])
#extract list of strings
df["strings"] = df["data"].map(lambda x: x[1])
#create 1 row for each string in the list of strings
df = df.explode("strings")
#remove duplicate entries
df = df.drop_duplicates(subset=["surename", "strings"], keep="first")
#group the data by surename to get a list of unique strings (unique because we removed duplicates, order will be kept)
df_result = df.groupby(["surename"]).aggregate({"strings":list}).reset_index()
#combine both th extractd surename and the modified list of strings again
df_result["result"] = df_result.apply(lambda x: (x["surename"], x["strings"]), axis=1)

#output the data to a file of your choice
df_result[["result"]].to_csv(r"D:\test\result.txt",index=False, header=None, quoting=csv.QUOTE_NONE, escapechar = '')

Answer 3

我同意 pandas 是一个 很棒的 库，但是使用 python 内置包 [=58] 可以很容易地完成这类事情=]1。您可以简单地使用 python defaultdict with sets, and use regex finditer 进行解析。

^{¹特别有意义，因为您的输入和输出都不属于任何 pandas 数据类型 (pd.Series, pd.DataFrame,..) 甚至是标准的 .csv/表格格式..}

代码

from collections import defaultdict
import re

dataset = defaultdict(set)

with open('output.txt') as f:
    for line in f:
        itr = re.finditer("'(\S+?)'", line)
        name = next(itr).groups()[0]
        strings = {x.groups()[0] for x in itr}
        dataset[name] |= strings

with open('results.txt', 'w') as f:
    for name, strings in dataset.items():
        print(f"('{name}', {list(strings)})", file=f)

示例输出

('Alex', ['String1', 'String5', 'String77', 'String64', 'String34', 'String12', 'String3'])
('Piper', ['String5', 'String11', 'String64', 'String34', 'String12', 'String41'])
('Nicky', ['String21', 'String77', 'String34', 'String11', 'String51', 'String3', 'String12', 'String42'])
('Linda', ['String77', 'String14', 'String51', 'String3'])
('Suzzane', ['String11', 'String36', 'String12', 'String16', 'String51', 'String3'])

解释代码的工作原理

逐行阅读。我们可以使用正则表达式来捕获两个单引号 (') 之间的任何非空白字符 (\S)。因此，正则表达式模式是 '(\S+?)'。加号 + 表示匹配一个或多个字符，而 ? 使搜索成为非贪婪的（匹配尽可能少的字符），因此我们解析了所有单独的字符串，而不仅仅是一个包含所有字符串的字符串该行的内容。
re.finditer用于匹配具有相同模式的多个组。在这种情况下，它被用来代替 re.findall 因为 re.findall 创建了一个 list 而 re.finditer 创建了一个 iterator. （小优化：不创建列表，因为根本不需要）
然后，我们通过在 itr 上调用 next() 来捕获 name。它 returns 来自迭代器的第一个元素。
然后，调用 groups() 并从返回值中取出第一项。这就是访问模式中用括号 (()) 捕获的组的方式。
然后，对于迭代器 itr 的其余部分，我们只有要从中创建 python sets 的字符串，这些字符串保证唯一元素。显示的语法是 set comprehension.
在同一行我们将结果集保存到 dataset 变量中，这是一个 defaultdict. defaultdicts are nice since when accessing to non-existing item, it automatically creates an entry with that type. We have used defaultdict(set) to have set as default type. The operation d[key] |= val is same thing as d[key] = d[key] | val, and the | 创建的集，它是新的 union集合和我们可能已经在 dataset.
最后一部分只是将输出逐行写入results.txt。将 strings 转换为列表是可选的，但这样做是为了使输出类似于问题中的内容。

Answer 4

data = []
a_dict = {}
unique = []

#considering that the file name is a.txt here.
#After opening the file i used the eval function to turn the string into code
#now the list data will have all the file's data, all elements inside list data are tuples
with open('a.txt','r') as file:
    for i in file.readlines():
        a = eval(i)
        data.append(a)

#here i wrote this code for collecting all unique name in a list
for i in data:
    if i[0] not in unique:
        unique.append(i[0])


#after collecting unique names inside list unique, i performed iteration over all values inside list unique.
#
#then i performed iteration on the list which is holding all the data
#
#compared all the unique values with the list data and
#then if they are matching then adding those values inside a list a_list
#
#when it is finished with the iteration inside list data, it will add that list into a dict a_dict with its unique value
#
#a_list will be assigned a new list for the next unique value
for i in unique:
    a_list = []
    for j in data:
        if i==j[0]:
            a_list.extend(j[1])
    a_dict[i] = list(tuple(a_list))
    
#This piece of code is to print out the data in a formatted way.
for i,j in a_dict.items():
    print("('{}', {})".format(i,j))

Pandas 数据汇总

Pandas Data Summarization

summarization

python-3.x

代码

示例输出

解释代码的工作原理