使用字典中的 key/value 对数据框进行字符串搜索

String search on dataframe using key/value from dict

我正在尝试将下面数据框的 'Disease' 列中存在的字符串与字典中的键匹配,如果存在该字符串,则将 'category' 列中的值更改为该值字典的键。

df=

Year category Pollutant Disease DiseaseCaseCount Industry
2016 null Pb hypertension 1025 b_battery_ltd
2016 null PM25 lung cancer 180 t_chemicals
2016 null PM25 lung cancer 180 t_powerplant
2016 null Cu lung cancer 200 b_miners
2016 null Cu lung cancer 200 a_preservative_pvt
2016 null PM25 acute bronchitis 367 t_chemicals
2016 null PM25 acute bronchitis 367 t_powerplant

和字典

my_dict = {"cancer": 2, "brain tumor": 8, "acute bronchitis":3}

到目前为止我尝试的是

for x in my_dict:
    for row in df.itertuples(index=True, name='Pandas'):
        searchText = row.text
        #print(type(searchText))
        if (searchText.str.lower().str.contains(x).any()):
            row.class = my_dict[x]
        else:
             row.class = None
  
display(df)

它抛出一个错误:

AttributeError: 'str' object has no attribute 'str'

我正在查看的最终数据帧是

df=

+----+----+---------+----------------+----------------+------------------------+
|Year|category|Pollutant|       Disease  |DiseaseCaseCount|          Industry|
+----+----+---------+----------------+----------------+------------------------+
|2016|   null |       Pb|    hypertension|            1025|     b_battery_ltd|
|2016|   2    |     PM25|     lung cancer|             180|       t_chemicals|
|2016|   2    |     PM25|     lung cancer|             180|      t_powerplant|
|2016|   2    |       Cu|     lung cancer|             200|          b_miners|
|2016|   2    |       Cu|     lung cancer|             200|a_preservative_pvt|
|2016|   3    |     PM25|acute bronchitis|             367|       t_chemicals|
|2016|   3   |     PM25|acute bronchitis|             367|      t_powerplant|
+----+----+---------+----------------+----------------+------------------------+

尝试并利用 pandas apply()。它通常更具可读性和简洁性。我确信有一种更高效的方法可以使用矢量化函数来实现,但这种方法更直观。

def change_class(row, my_dict={"cancer": 2, "brain tumor": 3, "acute bronchitis":8}):
    for key, value in my_dict.items():
        if key == row['Disease']:
            return value
        else: 
            return row['category']

df['category'] = df.apply(lambda x: change_class(x), axis=1)

这是一种使用列表理解的方法,它迭代 Disease 列中的值,并使用 next 和生成器表达式来获取字典值(如果匹配):

df['category'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['Disease'].tolist()]

输出:

   Year  category Pollutant           Disease  DiseaseCaseCount              Industry
0  2016       NaN        Pb      hypertension              1025         b_battery_ltd
1  2016       2.0      PM25       lung cancer               180           t_chemicals
2  2016       2.0      PM25       lung cancer               180          t_powerplant
3  2016       2.0        Cu       lung cancer               200              b_miners
4  2016       2.0        Cu       lung cancer               200    a_preservative_pvt
5  2016       3.0      PM25  acute bronchitis               367           t_chemicals
6  2016       3.0      PM25  acute bronchitis               367          t_powerplant