如何在 python 中的数据框中查找具有相同值的列列表
how to find list of columns with same values in a dataframe in python
我正在尝试在数据框中查找列中具有相同值的列列表。 R whichAreInDouble 中有一个包,尝试在 python 中实现它。
df =
a b c d e f g h i
1 2 3 4 1 2 3 4 5
2 3 4 5 2 3 4 5 6
3 4 5 6 3 4 5 6 7
它应该给我具有相同值的列列表
喜欢
a, e are equal
b,f are equal
c,g are equal
让我们尝试使用 itertools 和组合:
from itertools import combinations
[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]
输出:
[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]
以上解决方案很好。但是,可能会发生这样的情况,两列基本上具有相同的值,但编码不同。例如:
b c d e f
1 1 3 4 1 a
2 3 4 5 2 c
3 2 5 6 3 b
4 3 4 5 2 c
5 4 5 6 3 d
6 2 4 5 2 b
7 4 5 6 3 d
在上面的示例中,您可以看到 f 列在标签编码后与 b 列具有相同的值。那么,如何捕获像这样的重复列?
给你:
from tqdm import tqdm_notebook
# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features
for col in tqdm_notebook(train_df.columns):
train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
# compare it all the remaining features
for c2 in train_enc.columns[i + 1:]:
# add the entries to above dict, if matches with the column in first loop
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)
与其他匹配的列名称,编码后将打印在标准输出中。
如果你想删除重复的列,你可以这样做:
train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)
from itertools import combinations
cols_to_remove=[]
for i,j in combinations(chk,2):
if chk[i].equals(chk[j]):
cols_to_remove.append(j)
chk=chk.drop(cols_to_remove,axis=1)
我正在尝试在数据框中查找列中具有相同值的列列表。 R whichAreInDouble 中有一个包,尝试在 python 中实现它。
df =
a b c d e f g h i
1 2 3 4 1 2 3 4 5
2 3 4 5 2 3 4 5 6
3 4 5 6 3 4 5 6 7
它应该给我具有相同值的列列表 喜欢
a, e are equal
b,f are equal
c,g are equal
让我们尝试使用 itertools 和组合:
from itertools import combinations
[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]
输出:
[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]
以上解决方案很好。但是,可能会发生这样的情况,两列基本上具有相同的值,但编码不同。例如:
b c d e f
1 1 3 4 1 a
2 3 4 5 2 c
3 2 5 6 3 b
4 3 4 5 2 c
5 4 5 6 3 d
6 2 4 5 2 b
7 4 5 6 3 d
在上面的示例中,您可以看到 f 列在标签编码后与 b 列具有相同的值。那么,如何捕获像这样的重复列?
给你:
from tqdm import tqdm_notebook
# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features
for col in tqdm_notebook(train_df.columns):
train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
# compare it all the remaining features
for c2 in train_enc.columns[i + 1:]:
# add the entries to above dict, if matches with the column in first loop
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)
与其他匹配的列名称,编码后将打印在标准输出中。
如果你想删除重复的列,你可以这样做:
train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)
from itertools import combinations
cols_to_remove=[]
for i,j in combinations(chk,2):
if chk[i].equals(chk[j]):
cols_to_remove.append(j)
chk=chk.drop(cols_to_remove,axis=1)