根据原始列名的几种可能性更改列名 (pandas)

Changing column names based on several possibilities of what the original column names could be (pandas)

我正在制作一个 streamlit 应用程序,用户可以在其中上传 csv 或 excel 文件并进行一些分析。

当然,不同的用户会以不同的方式命名他们的列名称。

我希望程序足够智能,能够推断出哪些列名称可能代表分析所需的内容。

例如,我希望程序能够读取这两个数据帧:

df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'], 'reason_code_description':['A','B','C'], 'name':['karen','pluto','imogen']})

df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'], 'Description of Reason Code': ['A','B','C'], 'Name of User':['karen','pluto','imogen']})

这样程序就会理解包含单词“ID”和“Number”的列名是 id_number 列;包含单词“原因”和“代码”但不包含“描述”的列名是 reason_code 列等...

我认为要做到这一点,最好的选择是使用 str.contains 来识别某些子字符串,然后将这些特定的列名称重命名为程序其余部分所需的名称。

这是我尝试过的示例(不起作用,但没有抛出错误):

df2.columns[(df2.columns.str.contains("reason", case=False)) & (df2.columns.str.contains("code", case=False)) & (~df2.columns.str.contains("description", case=False))].rename("reason_code",inplace=True)

提前致谢

这是一种方法,创建一个函数来重命名列名。

代码

import pandas as pd


def rename_colname(name):
    """
    Define rename logic here.
    """
    cname = name.lower()
    if 'reason' in cname and 'code' in cname and not 'description' in cname:
        return 'reason_code'
    elif 'number' in cname and 'id' in cname:
        return 'id_number'
    elif 'reason' in cname and 'code' in cname and 'description' in cname:
        return 'reason_code_description'
    elif 'name' in cname and 'user' in cname:
        return 'name'

    return name


df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'],
                    'reason_code_description':['A','B','C'],
                    'name':['karen','pluto','imogen']})

df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'],
                   'Description of Reason Code': ['A','B','C'],
                   'Name of User':['karen','pluto','imogen']})

print(f'ideal colname frame:\n{df1}')

print(f'old colname frame:\n{df2}')

df2.columns = [rename_colname(name) for name in df2.columns]
print(f'new colname frame:\n{df2}')

输出

ideal colname frame:
   id_number reason_code reason_code_description    name
0          1         TH7                       A   karen
1          2         JK9                       B   pluto
2          3         PI2                       C  imogen
old colname frame:
   Number (ID) Reason of Code Description of Reason Code Name of User
0            1            TH7                          A        karen
1            2            JK9                          B        pluto
2            3            PI2                          C       imogen
new colname frame:
   id_number reason_code reason_code_description    name
0          1         TH7                       A   karen
1          2         JK9                       B   pluto
2          3         PI2                       C  imogen