Pandas 数据汇总
Pandas Data Summarization
我有一个模糊的数据,如下。请注意,第一项有重复的名称(这一点很重要,需要考虑)。
('Alex', ['String1', 'String34'])
('Piper', ['String5', 'String64', 'String12'])
('Nicky', ['String3', 'String21', 'String42', 'String51'])
('Linda', ['String14'])
('Suzzane', ['String11', 'String36', 'String16'])
('Alex', ['String64', 'String34', 'String12', 'String5'])
('Linda', ['String3', 'String77'])
('Piper', ['String41', 'String64', 'String11', 'String34'])
('Suzzane', ['String12'])
('Nicky', ['String11', 'String51'])
('Alex', ['String77', 'String64', 'String3', 'String5'])
('Linda', ['String51'])
('Nicky', ['String77', 'String12', 'String34'])
('Suzzane', ['String51', 'String3'])
('Piper', ['String11', 'String64', 'String5'])
如果上述数据在一个名为“output.txt”的文件中,如何将其导入并汇总如下所示的数据?
[仅保留唯一名称,对于每个主要名称,将仅从所有存在的重复项中填充唯一字符串]
('Alex', ['String1', 'String34', 'String64', 'String12', 'String5', 'String77', 'String3'])
('Piper', ['String5', 'String64', 'String12', 'String11', 'String41', 'String34'])
('Nicky', ['String3', 'String21', 'String42', 'String51', 'String11', 'String77', 'String12', 'String34'])
('Linda', ['String14', 'String3', 'String77', 'String51'])
('Suzzane', ['String11', 'String36', 'String16', 'String12', 'String51', 'String3'])
您可以将数据加载到 pandas dataframe
:
import pandas as pd
df = pd.DataFrame(data=[('Alex', ['String1', 'String34']),
('Alex', ['String64', 'String34', 'String12', 'String5']),
('Nicky', ['String11', 'String51']),
('Nicky', ['String77', 'String12', 'String34'])])
df = df.rename(columns={0:'name', 1:'strings'})
然后创建一个 function
以在 pandas 列上连接列表:
def concatenate(strings):
strings_agg = []
for string in strings:
strings_agg.extend(string)
return strings_agg
最后 apply
列的函数:
df.groupby('name').apply(lambda x: concatenate(x['strings'])).to_frame()
import ast
import csv
import pandas as pd
#load data from txt file, doesnt has to be csv, can be a txt file!
df = pd.read_csv(r"D:\test\output.txt", sep="/n", header=None, names=["data"], engine='python')
#convert text data to tupels and list
df["data"] = df["data"].map(lambda x: ast.literal_eval(x))
#extract surename
df["surename"] = df["data"].map(lambda x: x[0])
#extract list of strings
df["strings"] = df["data"].map(lambda x: x[1])
#create 1 row for each string in the list of strings
df = df.explode("strings")
#remove duplicate entries
df = df.drop_duplicates(subset=["surename", "strings"], keep="first")
#group the data by surename to get a list of unique strings (unique because we removed duplicates, order will be kept)
df_result = df.groupby(["surename"]).aggregate({"strings":list}).reset_index()
#combine both th extractd surename and the modified list of strings again
df_result["result"] = df_result.apply(lambda x: (x["surename"], x["strings"]), axis=1)
#output the data to a file of your choice
df_result[["result"]].to_csv(r"D:\test\result.txt",index=False, header=None, quoting=csv.QUOTE_NONE, escapechar = '')
我同意 pandas 是一个 很棒的 库,但是使用 python 内置包 [=58] 可以很容易地完成这类事情=]1。您可以简单地使用 python defaultdict with sets, and use regex finditer 进行解析。
1特别有意义,因为您的输入和输出都不属于任何 pandas 数据类型 (pd.Series, pd.DataFrame,..) 甚至是标准的 .csv/表格格式..
代码
from collections import defaultdict
import re
dataset = defaultdict(set)
with open('output.txt') as f:
for line in f:
itr = re.finditer("'(\S+?)'", line)
name = next(itr).groups()[0]
strings = {x.groups()[0] for x in itr}
dataset[name] |= strings
with open('results.txt', 'w') as f:
for name, strings in dataset.items():
print(f"('{name}', {list(strings)})", file=f)
示例输出
('Alex', ['String1', 'String5', 'String77', 'String64', 'String34', 'String12', 'String3'])
('Piper', ['String5', 'String11', 'String64', 'String34', 'String12', 'String41'])
('Nicky', ['String21', 'String77', 'String34', 'String11', 'String51', 'String3', 'String12', 'String42'])
('Linda', ['String77', 'String14', 'String51', 'String3'])
('Suzzane', ['String11', 'String36', 'String12', 'String16', 'String51', 'String3'])
解释代码的工作原理
- 逐行阅读。我们可以使用正则表达式来捕获两个单引号 (
'
) 之间的任何非空白字符 (\S
)。因此,正则表达式模式是 '(\S+?)'
。加号 +
表示匹配一个或多个字符,而 ?
使搜索成为非贪婪的(匹配尽可能少的字符),因此我们解析了所有单独的字符串,而不仅仅是一个包含所有字符串的字符串该行的内容。
re.finditer
用于匹配具有相同模式的多个组。在这种情况下,它被用来代替 re.findall
因为 re.findall
创建了一个 list 而 re.finditer
创建了一个 iterator. (小优化:不创建列表,因为根本不需要)
- 然后,我们通过在
itr
上调用 next()
来捕获 name
。它 returns 来自迭代器的第一个元素。
- 然后,调用
groups()
并从返回值中取出第一项。这就是访问模式中用括号 (()
) 捕获的组的方式。
- 然后,对于迭代器
itr
的其余部分,我们只有要从中创建 python sets 的字符串,这些字符串保证唯一元素。显示的语法是 set comprehension.
- 在同一行我们将结果集保存到
dataset
变量中,这是一个 defaultdict. defaultdicts are nice since when accessing to non-existing item, it automatically creates an entry with that type. We have used defaultdict(set)
to have set
as default type. The operation d[key] |= val
is same thing as d[key] = d[key] | val
, and the |
创建的集,它是新的 union集合和我们可能已经在 dataset
. 中拥有的集合
- 最后一部分只是将输出逐行写入
results.txt
。将 strings
转换为列表是可选的,但这样做是为了使输出类似于问题中的内容。
data = []
a_dict = {}
unique = []
#considering that the file name is a.txt here.
#After opening the file i used the eval function to turn the string into code
#now the list data will have all the file's data, all elements inside list data are tuples
with open('a.txt','r') as file:
for i in file.readlines():
a = eval(i)
data.append(a)
#here i wrote this code for collecting all unique name in a list
for i in data:
if i[0] not in unique:
unique.append(i[0])
#after collecting unique names inside list unique, i performed iteration over all values inside list unique.
#
#then i performed iteration on the list which is holding all the data
#
#compared all the unique values with the list data and
#then if they are matching then adding those values inside a list a_list
#
#when it is finished with the iteration inside list data, it will add that list into a dict a_dict with its unique value
#
#a_list will be assigned a new list for the next unique value
for i in unique:
a_list = []
for j in data:
if i==j[0]:
a_list.extend(j[1])
a_dict[i] = list(tuple(a_list))
#This piece of code is to print out the data in a formatted way.
for i,j in a_dict.items():
print("('{}', {})".format(i,j))
我有一个模糊的数据,如下。请注意,第一项有重复的名称(这一点很重要,需要考虑)。
('Alex', ['String1', 'String34'])
('Piper', ['String5', 'String64', 'String12'])
('Nicky', ['String3', 'String21', 'String42', 'String51'])
('Linda', ['String14'])
('Suzzane', ['String11', 'String36', 'String16'])
('Alex', ['String64', 'String34', 'String12', 'String5'])
('Linda', ['String3', 'String77'])
('Piper', ['String41', 'String64', 'String11', 'String34'])
('Suzzane', ['String12'])
('Nicky', ['String11', 'String51'])
('Alex', ['String77', 'String64', 'String3', 'String5'])
('Linda', ['String51'])
('Nicky', ['String77', 'String12', 'String34'])
('Suzzane', ['String51', 'String3'])
('Piper', ['String11', 'String64', 'String5'])
如果上述数据在一个名为“output.txt”的文件中,如何将其导入并汇总如下所示的数据?
[仅保留唯一名称,对于每个主要名称,将仅从所有存在的重复项中填充唯一字符串]
('Alex', ['String1', 'String34', 'String64', 'String12', 'String5', 'String77', 'String3'])
('Piper', ['String5', 'String64', 'String12', 'String11', 'String41', 'String34'])
('Nicky', ['String3', 'String21', 'String42', 'String51', 'String11', 'String77', 'String12', 'String34'])
('Linda', ['String14', 'String3', 'String77', 'String51'])
('Suzzane', ['String11', 'String36', 'String16', 'String12', 'String51', 'String3'])
您可以将数据加载到 pandas dataframe
:
import pandas as pd
df = pd.DataFrame(data=[('Alex', ['String1', 'String34']),
('Alex', ['String64', 'String34', 'String12', 'String5']),
('Nicky', ['String11', 'String51']),
('Nicky', ['String77', 'String12', 'String34'])])
df = df.rename(columns={0:'name', 1:'strings'})
然后创建一个 function
以在 pandas 列上连接列表:
def concatenate(strings):
strings_agg = []
for string in strings:
strings_agg.extend(string)
return strings_agg
最后 apply
列的函数:
df.groupby('name').apply(lambda x: concatenate(x['strings'])).to_frame()
import ast
import csv
import pandas as pd
#load data from txt file, doesnt has to be csv, can be a txt file!
df = pd.read_csv(r"D:\test\output.txt", sep="/n", header=None, names=["data"], engine='python')
#convert text data to tupels and list
df["data"] = df["data"].map(lambda x: ast.literal_eval(x))
#extract surename
df["surename"] = df["data"].map(lambda x: x[0])
#extract list of strings
df["strings"] = df["data"].map(lambda x: x[1])
#create 1 row for each string in the list of strings
df = df.explode("strings")
#remove duplicate entries
df = df.drop_duplicates(subset=["surename", "strings"], keep="first")
#group the data by surename to get a list of unique strings (unique because we removed duplicates, order will be kept)
df_result = df.groupby(["surename"]).aggregate({"strings":list}).reset_index()
#combine both th extractd surename and the modified list of strings again
df_result["result"] = df_result.apply(lambda x: (x["surename"], x["strings"]), axis=1)
#output the data to a file of your choice
df_result[["result"]].to_csv(r"D:\test\result.txt",index=False, header=None, quoting=csv.QUOTE_NONE, escapechar = '')
我同意 pandas 是一个 很棒的 库,但是使用 python 内置包 [=58] 可以很容易地完成这类事情=]1。您可以简单地使用 python defaultdict with sets, and use regex finditer 进行解析。
1特别有意义,因为您的输入和输出都不属于任何 pandas 数据类型 (pd.Series, pd.DataFrame,..) 甚至是标准的 .csv/表格格式..
代码
from collections import defaultdict
import re
dataset = defaultdict(set)
with open('output.txt') as f:
for line in f:
itr = re.finditer("'(\S+?)'", line)
name = next(itr).groups()[0]
strings = {x.groups()[0] for x in itr}
dataset[name] |= strings
with open('results.txt', 'w') as f:
for name, strings in dataset.items():
print(f"('{name}', {list(strings)})", file=f)
示例输出
('Alex', ['String1', 'String5', 'String77', 'String64', 'String34', 'String12', 'String3'])
('Piper', ['String5', 'String11', 'String64', 'String34', 'String12', 'String41'])
('Nicky', ['String21', 'String77', 'String34', 'String11', 'String51', 'String3', 'String12', 'String42'])
('Linda', ['String77', 'String14', 'String51', 'String3'])
('Suzzane', ['String11', 'String36', 'String12', 'String16', 'String51', 'String3'])
解释代码的工作原理
- 逐行阅读。我们可以使用正则表达式来捕获两个单引号 (
'
) 之间的任何非空白字符 (\S
)。因此,正则表达式模式是'(\S+?)'
。加号+
表示匹配一个或多个字符,而?
使搜索成为非贪婪的(匹配尽可能少的字符),因此我们解析了所有单独的字符串,而不仅仅是一个包含所有字符串的字符串该行的内容。 re.finditer
用于匹配具有相同模式的多个组。在这种情况下,它被用来代替re.findall
因为re.findall
创建了一个 list 而re.finditer
创建了一个 iterator. (小优化:不创建列表,因为根本不需要)- 然后,我们通过在
itr
上调用next()
来捕获name
。它 returns 来自迭代器的第一个元素。 - 然后,调用
groups()
并从返回值中取出第一项。这就是访问模式中用括号 (()
) 捕获的组的方式。 - 然后,对于迭代器
itr
的其余部分,我们只有要从中创建 python sets 的字符串,这些字符串保证唯一元素。显示的语法是 set comprehension. - 在同一行我们将结果集保存到
dataset
变量中,这是一个 defaultdict. defaultdicts are nice since when accessing to non-existing item, it automatically creates an entry with that type. We have useddefaultdict(set)
to haveset
as default type. The operationd[key] |= val
is same thing asd[key] = d[key] | val
, and the|
创建的集,它是新的 union集合和我们可能已经在dataset
. 中拥有的集合
- 最后一部分只是将输出逐行写入
results.txt
。将strings
转换为列表是可选的,但这样做是为了使输出类似于问题中的内容。
data = []
a_dict = {}
unique = []
#considering that the file name is a.txt here.
#After opening the file i used the eval function to turn the string into code
#now the list data will have all the file's data, all elements inside list data are tuples
with open('a.txt','r') as file:
for i in file.readlines():
a = eval(i)
data.append(a)
#here i wrote this code for collecting all unique name in a list
for i in data:
if i[0] not in unique:
unique.append(i[0])
#after collecting unique names inside list unique, i performed iteration over all values inside list unique.
#
#then i performed iteration on the list which is holding all the data
#
#compared all the unique values with the list data and
#then if they are matching then adding those values inside a list a_list
#
#when it is finished with the iteration inside list data, it will add that list into a dict a_dict with its unique value
#
#a_list will be assigned a new list for the next unique value
for i in unique:
a_list = []
for j in data:
if i==j[0]:
a_list.extend(j[1])
a_dict[i] = list(tuple(a_list))
#This piece of code is to print out the data in a formatted way.
for i,j in a_dict.items():
print("('{}', {})".format(i,j))