将 pandas 数据框值插入特定列
Insert to pandas dataframe value to specific column
我使用python和pandas来分析大数据集。我有几个不同长度的数组。我需要将值插入特定列。如果某些值不存在于列中,它应该是 'not defined'。输入数据看起来像数据框中具有不同位置的行。
预期输出:
输入数据示例:
# Example 1
{'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
{'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
{'Water Solubility': '1E+006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
我尝试添加到数组 'Not defined' 但我找不到正确的方法
我认为最好的方法是为每个字典创建一个数据框,然后连接数据框。
d_1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
d_2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
df_1 = pd.DataFrame([d_1], columns=d_1.keys())
df_2 = pd.DataFrame([d_2], columns=d_2.keys())
final_df = pd.concat([df_1, df_2]).fillna(0)
您可以构建一个接受字典列表和returns最终数据框的函数
这应该可以满足您的要求:
import pandas as pd
import numpy as np
# Example 1
ex1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
ex2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
ex3 = {'Water Solubility': '1E+006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
df = pd.DataFrame({
'Boiling Point':[162-165, 'Not defined'],
'Hydrophobicity':[-0.5227, -0.427],
'Isoelectric Point':[9.02, 12.02],
'Melting Point':[1000.0, 'Not defined'],
'Molecular Formula':['C1970H3848N50O947S4', 'Not defined'],
'Molecular Weight':[9.23, 7.13],
'Radioactivity':['Practically insoluble', 'Not defined'],
'Water Solubility':[1.23, 2.87],
'caco2 Permeability':['63.6±55.0', 901],
'logP':[14, 14],
'logS':[0.618, 0.238],
'pKa':['Not defined', 'Not defined']
})
df = pd.concat([df, pd.DataFrame([ex1, ex2, ex3])], ignore_index=True)
df.iloc[-3:] = df.iloc[-3:].apply(lambda x: ['Not defined' if str(v)=='nan' else v for v in x])
print(df.to_string())
输出:
Boiling Point Hydrophobicity Isoelectric Point Melting Point Molecular Formula Molecular Weight Radioactivity Water Solubility caco2 Permeability logP logS pKa
0 -3 -0.5227 9.02 1000.0 C1970H3848N50O947S4 9.23 Practically insoluble 1.23 63.6±55.0 14 0.618 Not defined
1 Not defined -0.427 12.02 Not defined Not defined 7.13 Not defined 2.87 901 14 0.238 Not defined
2 Not defined Not defined Not defined 135-138 °C Not defined Not defined Not defined Insoluble Not defined 4.68 Not defined Not defined
3 Not defined -0.529 7.89 71 °C (whole mAb) C2224H3475N621O698S36 51234.9 Not defined Not defined Not defined Not defined Not defined Not defined
4 Not defined Not defined Not defined 204-205 °C Not defined Not defined Not defined 1E+006 mg/L (at 25 °C) Not defined 1.1 Not defined 6.78
我使用python和pandas来分析大数据集。我有几个不同长度的数组。我需要将值插入特定列。如果某些值不存在于列中,它应该是 'not defined'。输入数据看起来像数据框中具有不同位置的行。
预期输出:
输入数据示例:
# Example 1
{'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
{'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
{'Water Solubility': '1E+006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
我尝试添加到数组 'Not defined' 但我找不到正确的方法
我认为最好的方法是为每个字典创建一个数据框,然后连接数据框。
d_1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
d_2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
df_1 = pd.DataFrame([d_1], columns=d_1.keys())
df_2 = pd.DataFrame([d_2], columns=d_2.keys())
final_df = pd.concat([df_1, df_2]).fillna(0)
您可以构建一个接受字典列表和returns最终数据框的函数
这应该可以满足您的要求:
import pandas as pd
import numpy as np
# Example 1
ex1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
ex2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
ex3 = {'Water Solubility': '1E+006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
df = pd.DataFrame({
'Boiling Point':[162-165, 'Not defined'],
'Hydrophobicity':[-0.5227, -0.427],
'Isoelectric Point':[9.02, 12.02],
'Melting Point':[1000.0, 'Not defined'],
'Molecular Formula':['C1970H3848N50O947S4', 'Not defined'],
'Molecular Weight':[9.23, 7.13],
'Radioactivity':['Practically insoluble', 'Not defined'],
'Water Solubility':[1.23, 2.87],
'caco2 Permeability':['63.6±55.0', 901],
'logP':[14, 14],
'logS':[0.618, 0.238],
'pKa':['Not defined', 'Not defined']
})
df = pd.concat([df, pd.DataFrame([ex1, ex2, ex3])], ignore_index=True)
df.iloc[-3:] = df.iloc[-3:].apply(lambda x: ['Not defined' if str(v)=='nan' else v for v in x])
print(df.to_string())
输出:
Boiling Point Hydrophobicity Isoelectric Point Melting Point Molecular Formula Molecular Weight Radioactivity Water Solubility caco2 Permeability logP logS pKa
0 -3 -0.5227 9.02 1000.0 C1970H3848N50O947S4 9.23 Practically insoluble 1.23 63.6±55.0 14 0.618 Not defined
1 Not defined -0.427 12.02 Not defined Not defined 7.13 Not defined 2.87 901 14 0.238 Not defined
2 Not defined Not defined Not defined 135-138 °C Not defined Not defined Not defined Insoluble Not defined 4.68 Not defined Not defined
3 Not defined -0.529 7.89 71 °C (whole mAb) C2224H3475N621O698S36 51234.9 Not defined Not defined Not defined Not defined Not defined Not defined
4 Not defined Not defined Not defined 204-205 °C Not defined Not defined Not defined 1E+006 mg/L (at 25 °C) Not defined 1.1 Not defined 6.78