检查 python/pandas 中列之间的关系类型? (一对一、一对多或多对多)
Checking the type of relationship between columns in python/pandas? (one-to-one, one-to-many, or many-to-many)
假设我有 5 列。
pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
是否有函数可以知道每个列的关系类型? (一对一、一对多、多对一、多对多)
输出如下:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column3 many-to-many
...
Column4 Column5 one-to-many
这可能不是一个完美的答案,但应该可以进行一些进一步的修改:
a = df.nunique()
is9, is1 = a==9, a==1
one_one = is9[:, None] & is9
one_many = is1[:, None]
many_one = is1[None, :]
many_many = (~is9[:,None]) & (~is9)
pd.DataFrame(np.select([one_one, one_many, many_one],
['one-to-one', 'one-to-many', 'many-to-one'],
'many-to-many'),
df.columns, df.columns)
输出:
Column1 Column2 Column3 Column4 Column5
Column1 one-to-one many-to-many many-to-many one-to-one many-to-one
Column2 many-to-many many-to-many many-to-many many-to-many many-to-one
Column3 many-to-many many-to-many many-to-many many-to-many many-to-one
Column4 one-to-one many-to-many many-to-many one-to-one many-to-one
Column5 one-to-many one-to-many one-to-many one-to-many one-to-many
这应该适合你:
df = pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
def get_relation(df, col1, col2):
first_max = df[[col1, col2]].groupby(col1).count().max()[0]
second_max = df[[col1, col2]].groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return 'one-to-one'
else:
return 'one-to-many'
else:
if second_max==1:
return 'many-to-one'
else:
return 'many-to-many'
from itertools import product
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
print(col_i, col_j, get_relation(df, col_i, col_j))
输出:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column1 many-to-one
Column2 Column3 many-to-many
Column2 Column4 many-to-one
Column2 Column5 many-to-many
Column3 Column1 many-to-one
Column3 Column2 many-to-many
Column3 Column4 many-to-one
Column3 Column5 many-to-many
Column4 Column1 one-to-one
Column4 Column2 one-to-many
Column4 Column3 one-to-many
Column4 Column5 one-to-many
Column5 Column1 many-to-one
Column5 Column2 many-to-many
Column5 Column3 many-to-many
Column5 Column4 many-to-one
首先我们得到所有列的组合 itertools.product
:
最后我们使用 pd.merge
和 validate
参数来检查 "passes" 测试与 try, except
:
的关系
注意,我们省略了 many_to_many
,因为这种关系不是 "checked",引用自文档:
“many_to_many” or “m:m”: allowed, but does not result in checks.
from itertools import product
def check_cardinality(df):
combinations_lst = list(product(df.columns, df.columns))
relations = ['one_to_one', 'one_to_many', 'many_to_one']
output = []
for col1, col2 in combinations_lst:
for relation in relations:
try:
pd.merge(df[[col1]], df[[col2]], left_on=col1, right_on=col2, validate=relation)
output.append([col1, col2, relation])
except:
continue
return output
cardinality = (pd.DataFrame(check_cardinality(df), columns=['first_column', 'second_column', 'cardinality'])
.drop_duplicates(['first_column', 'second_column'])
.reset_index(drop=True))
输出
first_column second_column cardinality
0 Column1 Column1 one_to_one
1 Column1 Column2 one_to_many
2 Column1 Column3 one_to_many
3 Column1 Column4 one_to_one
4 Column1 Column5 one_to_many
5 Column2 Column1 many_to_one
6 Column2 Column4 many_to_one
7 Column3 Column1 many_to_one
8 Column3 Column4 many_to_one
9 Column4 Column1 one_to_one
10 Column4 Column2 one_to_many
11 Column4 Column3 one_to_many
12 Column4 Column4 one_to_one
13 Column4 Column5 one_to_many
14 Column5 Column1 many_to_one
15 Column5 Column4 many_to_one
我尝试使用 Andrea 的回答来调查一些巨大的 CSV 文件,并且几乎所有内容都是多对多的 - 即使是我确定的列也是 1-1。问题是重复的。
这是一个稍微修改过的版本,带有演示,格式与数据库术语相匹配(以及消除歧义的描述)
首先是一个更清晰的例子
医生开很多方子,每个方子可以开好几种药,但每种药都是由一个厂家做的,每个厂家只做一种药。
doctor prescription drug producer
0 Doctor Who 1 aspirin Bayer
1 Dr Welby 2 aspirin Bayer
2 Dr Oz 3 aspirin Bayer
3 Doctor Who 4 paracetamol Tylenol
4 Dr Welby 5 paracetamol Tylenol
5 Dr Oz 6 antibiotics Merck
6 Doctor Who 7 aspirin Bayer
以下函数的正确结果
安德里亚的主要变化:
- drop_duplicates 对,这样 1-1 就不会被视为多对多
- 我把结果放在一个dataframe中(见函数中的
report_df
)以便更容易读取结果
- 我颠倒了逻辑以匹配 UML 术语(我不参与集合与 UML 的争论——这正是我想要的方式)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 2 drugs, and drugs up to 3 d...
2 doctor producer many-to-many doctors had up to 2 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 3 doctors, and doctors up to 2...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer 1-to-1 1 drug has 1 producer and vice versa
9 producer doctor many-to-many producers had up to 3 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug 1-to-1 1 producer has 1 drug and vice versa
错误的结果没有下面的重复
这些是基于我修改的 Andrea 的 aglo 副本,没有删除重复项。
你可以看到最后一行 - 医生到药物 - 是多对多的,而它应该是 1-1 - 这解释了我的初始结果(很难用 1000 条记录进行调试)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 3 drugs, and drugs up to 4 d...
2 doctor producer many-to-many doctors had up to 3 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 4 doctors, and doctors up to 3...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer many-to-many drugs had up to 4 producers, and producers up ...
9 producer doctor many-to-many producers had up to 4 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug many-to-many producers had up to 4 drugs, and drugs up to 4...
新功能
from itertools import product
import pandas as pd
def get_relation(df, col1, col2):
# pair columns, drop duplicates (for proper 1-1), group by each column with
# the count of entries from the other column associated with each group
first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
else:
return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
else:
if second_max==1:
return '1-to-many', f'each {col1} has many {col2}s (some had {first_max})'
else:
return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'
def report_relations(df):
report = []
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
relation = get_relation(df, col_i, col_j)
report.append([col_i, col_j, *relation])
report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
# formating
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
# comment one of these two out depending on where you're using it
display(report_df) # for jupyter
print(report_df) # SO
test_df = pd.DataFrame({
'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
'prescription': [1, 2, 3, 4, 5, 6, 7],
'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']
})
display(test_df)
print(test_df)
report_relations(test_df)
谢谢 Andrea - 这对我帮助很大。
假设我有 5 列。
pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
是否有函数可以知道每个列的关系类型? (一对一、一对多、多对一、多对多)
输出如下:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column3 many-to-many
...
Column4 Column5 one-to-many
这可能不是一个完美的答案,但应该可以进行一些进一步的修改:
a = df.nunique()
is9, is1 = a==9, a==1
one_one = is9[:, None] & is9
one_many = is1[:, None]
many_one = is1[None, :]
many_many = (~is9[:,None]) & (~is9)
pd.DataFrame(np.select([one_one, one_many, many_one],
['one-to-one', 'one-to-many', 'many-to-one'],
'many-to-many'),
df.columns, df.columns)
输出:
Column1 Column2 Column3 Column4 Column5
Column1 one-to-one many-to-many many-to-many one-to-one many-to-one
Column2 many-to-many many-to-many many-to-many many-to-many many-to-one
Column3 many-to-many many-to-many many-to-many many-to-many many-to-one
Column4 one-to-one many-to-many many-to-many one-to-one many-to-one
Column5 one-to-many one-to-many one-to-many one-to-many one-to-many
这应该适合你:
df = pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})
def get_relation(df, col1, col2):
first_max = df[[col1, col2]].groupby(col1).count().max()[0]
second_max = df[[col1, col2]].groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return 'one-to-one'
else:
return 'one-to-many'
else:
if second_max==1:
return 'many-to-one'
else:
return 'many-to-many'
from itertools import product
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
print(col_i, col_j, get_relation(df, col_i, col_j))
输出:
Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column1 many-to-one
Column2 Column3 many-to-many
Column2 Column4 many-to-one
Column2 Column5 many-to-many
Column3 Column1 many-to-one
Column3 Column2 many-to-many
Column3 Column4 many-to-one
Column3 Column5 many-to-many
Column4 Column1 one-to-one
Column4 Column2 one-to-many
Column4 Column3 one-to-many
Column4 Column5 one-to-many
Column5 Column1 many-to-one
Column5 Column2 many-to-many
Column5 Column3 many-to-many
Column5 Column4 many-to-one
首先我们得到所有列的组合 itertools.product
:
最后我们使用 pd.merge
和 validate
参数来检查 "passes" 测试与 try, except
:
注意,我们省略了 many_to_many
,因为这种关系不是 "checked",引用自文档:
“many_to_many” or “m:m”: allowed, but does not result in checks.
from itertools import product
def check_cardinality(df):
combinations_lst = list(product(df.columns, df.columns))
relations = ['one_to_one', 'one_to_many', 'many_to_one']
output = []
for col1, col2 in combinations_lst:
for relation in relations:
try:
pd.merge(df[[col1]], df[[col2]], left_on=col1, right_on=col2, validate=relation)
output.append([col1, col2, relation])
except:
continue
return output
cardinality = (pd.DataFrame(check_cardinality(df), columns=['first_column', 'second_column', 'cardinality'])
.drop_duplicates(['first_column', 'second_column'])
.reset_index(drop=True))
输出
first_column second_column cardinality
0 Column1 Column1 one_to_one
1 Column1 Column2 one_to_many
2 Column1 Column3 one_to_many
3 Column1 Column4 one_to_one
4 Column1 Column5 one_to_many
5 Column2 Column1 many_to_one
6 Column2 Column4 many_to_one
7 Column3 Column1 many_to_one
8 Column3 Column4 many_to_one
9 Column4 Column1 one_to_one
10 Column4 Column2 one_to_many
11 Column4 Column3 one_to_many
12 Column4 Column4 one_to_one
13 Column4 Column5 one_to_many
14 Column5 Column1 many_to_one
15 Column5 Column4 many_to_one
我尝试使用 Andrea 的回答来调查一些巨大的 CSV 文件,并且几乎所有内容都是多对多的 - 即使是我确定的列也是 1-1。问题是重复的。
这是一个稍微修改过的版本,带有演示,格式与数据库术语相匹配(以及消除歧义的描述)
首先是一个更清晰的例子
医生开很多方子,每个方子可以开好几种药,但每种药都是由一个厂家做的,每个厂家只做一种药。
doctor prescription drug producer
0 Doctor Who 1 aspirin Bayer
1 Dr Welby 2 aspirin Bayer
2 Dr Oz 3 aspirin Bayer
3 Doctor Who 4 paracetamol Tylenol
4 Dr Welby 5 paracetamol Tylenol
5 Dr Oz 6 antibiotics Merck
6 Doctor Who 7 aspirin Bayer
以下函数的正确结果
安德里亚的主要变化:
- drop_duplicates 对,这样 1-1 就不会被视为多对多
- 我把结果放在一个dataframe中(见函数中的
report_df
)以便更容易读取结果 - 我颠倒了逻辑以匹配 UML 术语(我不参与集合与 UML 的争论——这正是我想要的方式)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 2 drugs, and drugs up to 3 d...
2 doctor producer many-to-many doctors had up to 2 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 3 doctors, and doctors up to 2...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer 1-to-1 1 drug has 1 producer and vice versa
9 producer doctor many-to-many producers had up to 3 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug 1-to-1 1 producer has 1 drug and vice versa
错误的结果没有下面的重复
这些是基于我修改的 Andrea 的 aglo 副本,没有删除重复项。
你可以看到最后一行 - 医生到药物 - 是多对多的,而它应该是 1-1 - 这解释了我的初始结果(很难用 1000 条记录进行调试)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 3 drugs, and drugs up to 4 d...
2 doctor producer many-to-many doctors had up to 3 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 4 doctors, and doctors up to 3...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer many-to-many drugs had up to 4 producers, and producers up ...
9 producer doctor many-to-many producers had up to 4 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug many-to-many producers had up to 4 drugs, and drugs up to 4...
新功能
from itertools import product
import pandas as pd
def get_relation(df, col1, col2):
# pair columns, drop duplicates (for proper 1-1), group by each column with
# the count of entries from the other column associated with each group
first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
else:
return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
else:
if second_max==1:
return '1-to-many', f'each {col1} has many {col2}s (some had {first_max})'
else:
return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'
def report_relations(df):
report = []
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
relation = get_relation(df, col_i, col_j)
report.append([col_i, col_j, *relation])
report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
# formating
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
# comment one of these two out depending on where you're using it
display(report_df) # for jupyter
print(report_df) # SO
test_df = pd.DataFrame({
'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
'prescription': [1, 2, 3, 4, 5, 6, 7],
'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']
})
display(test_df)
print(test_df)
report_relations(test_df)
谢谢 Andrea - 这对我帮助很大。