使用字典中的 key/value 对数据框进行字符串搜索
String search on dataframe using key/value from dict
我正在尝试将下面数据框的 'Disease' 列中存在的字符串与字典中的键匹配,如果存在该字符串,则将 'category' 列中的值更改为该值字典的键。
df=
Year
category
Pollutant
Disease
DiseaseCaseCount
Industry
2016
null
Pb
hypertension
1025
b_battery_ltd
2016
null
PM25
lung cancer
180
t_chemicals
2016
null
PM25
lung cancer
180
t_powerplant
2016
null
Cu
lung cancer
200
b_miners
2016
null
Cu
lung cancer
200
a_preservative_pvt
2016
null
PM25
acute bronchitis
367
t_chemicals
2016
null
PM25
acute bronchitis
367
t_powerplant
和字典
my_dict = {"cancer": 2, "brain tumor": 8, "acute bronchitis":3}
到目前为止我尝试的是
for x in my_dict:
for row in df.itertuples(index=True, name='Pandas'):
searchText = row.text
#print(type(searchText))
if (searchText.str.lower().str.contains(x).any()):
row.class = my_dict[x]
else:
row.class = None
display(df)
它抛出一个错误:
AttributeError: 'str' object has no attribute 'str'
我正在查看的最终数据帧是
df=
+----+----+---------+----------------+----------------+------------------------+
|Year|category|Pollutant| Disease |DiseaseCaseCount| Industry|
+----+----+---------+----------------+----------------+------------------------+
|2016| null | Pb| hypertension| 1025| b_battery_ltd|
|2016| 2 | PM25| lung cancer| 180| t_chemicals|
|2016| 2 | PM25| lung cancer| 180| t_powerplant|
|2016| 2 | Cu| lung cancer| 200| b_miners|
|2016| 2 | Cu| lung cancer| 200|a_preservative_pvt|
|2016| 3 | PM25|acute bronchitis| 367| t_chemicals|
|2016| 3 | PM25|acute bronchitis| 367| t_powerplant|
+----+----+---------+----------------+----------------+------------------------+
尝试并利用 pandas apply()。它通常更具可读性和简洁性。我确信有一种更高效的方法可以使用矢量化函数来实现,但这种方法更直观。
def change_class(row, my_dict={"cancer": 2, "brain tumor": 3, "acute bronchitis":8}):
for key, value in my_dict.items():
if key == row['Disease']:
return value
else:
return row['category']
df['category'] = df.apply(lambda x: change_class(x), axis=1)
这是一种使用列表理解的方法,它迭代 Disease
列中的值,并使用 next
和生成器表达式来获取字典值(如果匹配):
df['category'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['Disease'].tolist()]
输出:
Year category Pollutant Disease DiseaseCaseCount Industry
0 2016 NaN Pb hypertension 1025 b_battery_ltd
1 2016 2.0 PM25 lung cancer 180 t_chemicals
2 2016 2.0 PM25 lung cancer 180 t_powerplant
3 2016 2.0 Cu lung cancer 200 b_miners
4 2016 2.0 Cu lung cancer 200 a_preservative_pvt
5 2016 3.0 PM25 acute bronchitis 367 t_chemicals
6 2016 3.0 PM25 acute bronchitis 367 t_powerplant
我正在尝试将下面数据框的 'Disease' 列中存在的字符串与字典中的键匹配,如果存在该字符串,则将 'category' 列中的值更改为该值字典的键。
df=
Year | category | Pollutant | Disease | DiseaseCaseCount | Industry |
---|---|---|---|---|---|
2016 | null | Pb | hypertension | 1025 | b_battery_ltd |
2016 | null | PM25 | lung cancer | 180 | t_chemicals |
2016 | null | PM25 | lung cancer | 180 | t_powerplant |
2016 | null | Cu | lung cancer | 200 | b_miners |
2016 | null | Cu | lung cancer | 200 | a_preservative_pvt |
2016 | null | PM25 | acute bronchitis | 367 | t_chemicals |
2016 | null | PM25 | acute bronchitis | 367 | t_powerplant |
和字典
my_dict = {"cancer": 2, "brain tumor": 8, "acute bronchitis":3}
到目前为止我尝试的是
for x in my_dict:
for row in df.itertuples(index=True, name='Pandas'):
searchText = row.text
#print(type(searchText))
if (searchText.str.lower().str.contains(x).any()):
row.class = my_dict[x]
else:
row.class = None
display(df)
它抛出一个错误:
AttributeError: 'str' object has no attribute 'str'
我正在查看的最终数据帧是
df=
+----+----+---------+----------------+----------------+------------------------+
|Year|category|Pollutant| Disease |DiseaseCaseCount| Industry|
+----+----+---------+----------------+----------------+------------------------+
|2016| null | Pb| hypertension| 1025| b_battery_ltd|
|2016| 2 | PM25| lung cancer| 180| t_chemicals|
|2016| 2 | PM25| lung cancer| 180| t_powerplant|
|2016| 2 | Cu| lung cancer| 200| b_miners|
|2016| 2 | Cu| lung cancer| 200|a_preservative_pvt|
|2016| 3 | PM25|acute bronchitis| 367| t_chemicals|
|2016| 3 | PM25|acute bronchitis| 367| t_powerplant|
+----+----+---------+----------------+----------------+------------------------+
尝试并利用 pandas apply()。它通常更具可读性和简洁性。我确信有一种更高效的方法可以使用矢量化函数来实现,但这种方法更直观。
def change_class(row, my_dict={"cancer": 2, "brain tumor": 3, "acute bronchitis":8}):
for key, value in my_dict.items():
if key == row['Disease']:
return value
else:
return row['category']
df['category'] = df.apply(lambda x: change_class(x), axis=1)
这是一种使用列表理解的方法,它迭代 Disease
列中的值,并使用 next
和生成器表达式来获取字典值(如果匹配):
df['category'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['Disease'].tolist()]
输出:
Year category Pollutant Disease DiseaseCaseCount Industry
0 2016 NaN Pb hypertension 1025 b_battery_ltd
1 2016 2.0 PM25 lung cancer 180 t_chemicals
2 2016 2.0 PM25 lung cancer 180 t_powerplant
3 2016 2.0 Cu lung cancer 200 b_miners
4 2016 2.0 Cu lung cancer 200 a_preservative_pvt
5 2016 3.0 PM25 acute bronchitis 367 t_chemicals
6 2016 3.0 PM25 acute bronchitis 367 t_powerplant