在数据框中查找值

Looking up values in dataframe

我觉得基本的数据框用途是查找 'key' 到 return 和 'value' 的等价物,但我已经搜索并尝试了很多天没有成功。所以我认为我没有尝试正确的事情,希望得到任何帮助。

我已经尝试过 .to_dict() 并且无法弄清楚如何将值塑造成我可以查找的内容。制作由 XML 制作的数据框的字典似乎也很低效。所以我又开始尝试 .loc[].

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import pandas as pd

#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Append rows, create and index the dataframe
thing_rows.append({"Thing Name": "thing 1 name",
                   "Thing ID": "thing_1_id"})
thing_rows.append({"Thing Name": "thing 2 name",
                   "Thing ID": "thing_2_id"})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
        
print(thing_df.loc["thing 1 name"])

输出:

Thing ID    thing_1_id
Name: thing 1 name, dtype: object

期望的输出:

thing_1_id

虽然上面只关注这一个问题,但下面是我正在尝试做的事情的大图,以防您看到一种更简单或更好的方法来从 XML.

最终期望输出:

,Collection item,RELATED-THING-IDs
0,name of Item 1,"thing_1_id, thing_2_id"

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "RELATED-THING-IDs"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (row.findtext('type') == "THING"):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
        thing_df = thing_df.set_index(list(thing_df.keys())[0])
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    relat_thing_items = thing_df.loc[[row.xpath(
        "./RELATED-THING/result/row/name/text()")],["THING-ID"]]
    if len(relat_thing_items) > 0:
        relat_thing_id = ', '.join(relat_thing_items)
    else:
        relat_thing_id = ""

    coll_rows.append({"Collection item": name,
                 "RELATED-THING-IDs": relat_thing_id
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')

XML:

<?xml version="1.0" encoding="UTF-8"?>

<result size="4321">
  <row>
    <type>CONTEXT</type>
    <name>collections</name>
  </row>
  <row>
    <type>COLLECTION-ITEM</type>
    <name>name of Item 1</name>
    <ITEM-ID>item_000001</ITEM-ID>
    <RELATED-THING>
      <result size="2">
        <row>
          <type>THING</type>
          <name>thing name 1</name>
          <no>1</no>
        </row>
        <row>
          <type>THING</type>
          <name>thing name 2</name>
          <no>1</no>
        </row>
      </result>
    </RELATED-THING>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 1</name>
    <THING-ID>thing_000783</THING-ID>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 2</name>
    <THING-ID>thing_000803</THING-ID>
  </row>
</result>

使用自定义索引创建数据框

简化版

# define columns
col = []
for i in range(10):
    col.append(f'col{i}')

# define rows
row = []
for i in range(20):
    row.append(f'row{i}')
    
# create dataframe
df = pd.DataFrame(index=row, columns=col)

# fill dataframe with datas
for i in range(len(col)):
    df[f'{col[i]}'] = np.random.rand(len(row))

# this is how it looks like
print(f'this is df with row as index\n{df}')

# define another column as index, [0] in the code line below means 'col0'
df = df.set_index(list(df.keys())[0])

# this is how it looks like
print(f'\nthis is df with col0 as index\n{df}')

输出:

this is df with row as index
           col0      col1      col2      col3      col4      col5      col6      col7      col8      col9
row0   0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
row1   0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
row2   0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
row3   0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
row4   0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
row5   0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
row6   0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
row7   0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
row8   0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
row9   0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
row10  0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
row11  0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
row12  0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
row13  0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
row14  0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
row15  0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
row16  0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
row17  0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
row18  0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
row19  0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

this is df with col0 as index
              col1      col2      col3      col4      col5      col6      col7      col8      col9
col0                                                                                              
0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

希望它能帮助您更多地了解 pandas DataFrame

--- 编辑 --- 在以 row 作为索引的 df 中:

print(df['col1']['row0']) 会给你 0.469264.

在 df 中以 'col0' 作为索引:

print(df['col1'][0.767971]) 会给你 0.469264.

所以基本上是df[column][row]。如果你的行名是字符串,就把它写成字符串,如果是浮点数或整型,就把它写成浮点数或整数。

--- 编辑 ---

如何查找数据框

我会尝试用另一个例子让它更清楚。

import pandas as pd

df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', 'e'],
                   'col2': ['f', 'g', 'h', 'i', 'j'],
                   'col3': ['k', 'l', 'm', 'n', 'o'],
                   'col4': ['p', 'q', 'r', 's', 't'],
                   'col5': ['u', 'v', 'w', 'x', 'y']})
print('1. df is like this :\n', df)
print('-----------------------------------')
print("2. df['col1'][0] =", df['col1'][0])
print('-----------------------------------')
print("3. df['col1'] :\n", df['col1'])
print('-----------------------------------')
print("4. list(df['col1']) =", list(df['col1']))
print('-----------------------------------')
print("5. df.loc[0] :\n", df.loc[0])
print('-----------------------------------')
print("6. df.loc[[0]] :\n", df.loc[[0]])
print('-----------------------------------')
print("7. list(df.loc[0]) = ", list(df.loc[0]))
print('-----------------------------------')
print("8. df.iloc[3, 3] =", df.iloc[3, 3])
print('-----------------------------------')
print("9. df.iloc[3, :] :\n", df.iloc[3, :])
print('-----------------------------------')
print("10. list(df.iloc[3, :]) =", list(df.iloc[3, :]))
print('-----------------------------------')
print("Bonus : *list(df.iloc[3, :]) =", *list(df.iloc[3, :]))

这是输出:

1. df is like this :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
1    b    g    l    q    v
2    c    h    m    r    w
3    d    i    n    s    x
4    e    j    o    t    y
-----------------------------------
2. df['col1'][0] = a
-----------------------------------
3. df['col1'] :
 0    a
1    b
2    c
3    d
4    e
Name: col1, dtype: object
-----------------------------------
4. list(df['col1']) = ['a', 'b', 'c', 'd', 'e']
-----------------------------------
5. df.loc[0] :
 col1    a
col2    f
col3    k
col4    p
col5    u
Name: 0, dtype: object
-----------------------------------
6. df.loc[[0]] :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
-----------------------------------
7. list(df.loc[0]) =  ['a', 'f', 'k', 'p', 'u']
-----------------------------------
8. df.iloc[3, 3] = s
-----------------------------------
9. df.iloc[3, :] :
 col1    d
col2    i
col3    n
col4    s
col5    x
Name: 3, dtype: object
-----------------------------------
10. list(df.iloc[3, :]) = ['d', 'i', 'n', 's', 'x']
-----------------------------------
Bonus : *list(df.iloc[3, :]) = d i n s x

虽然肯定有更多方法可以查找数据框,但我上面显示的示例应该足以回答您的问题(如果我错了请纠正我)。

虽然我仍然想知道是否有更简单的方法来执行此操作,但这只是我正在寻找的单一值。

因此,解决方案是创建一个数据框并将其索引到键 0 上。这样就摆脱了自动数字行 ID(0、1、2 等)。然后使用 .loc 查找键 0 列中值(“事物 1 名称”)的行和 return 指定列(“事物 ID”)中的值。

print(thing_df.loc["thing 1 name","Thing ID"])