如何使用 xlrd 按 python 中的列名读取 Excel 数据
How to read Excel data by column name in python using xlrd
我正在尝试读取大型 excel 文件(将近 100000 行)的数据。
我在 python 中使用 'xlrd Module' 从 excel 获取数据。
我想按列名(Cascade、Schedule Name、Market)而不是列号(0,1,2)获取数据。
因为我的excel列不固定。
我知道如何在固定列的情况下获取数据。
这是我从 excel 中获取固定列数据的代码
import xlrd
file_location =r"C:\Users\Desktop\Vision.xlsx"
workbook=xlrd.open_workbook(file_location)
sheet= workbook.sheet_by_index(0)
print(sheet.ncols,sheet.nrows,sheet.name,sheet.number)
for i in range(sheet.nrows):
flag = 0
for j in range(sheet.ncols):
value=sheet.cell(i,j).value
如果有人对此有任何解决方案,请告诉我
谢谢
Comment: still not working when header of
fieldnames = ['Cascade', 'Market', 'Schedule', 'Name]
and
Sheet(['Cascade', 'Schedule', 'Name', 'Market'])
are equal.
在 col_idx
中保持 fieldnames
的顺序不是我最初的目标。
Question: I want to fetch data by column name
以下 OOP
解决方案将起作用:
class OrderedByName():
"""
Privides a generator method, to iterate in Column Name ordered sequence
Provides subscription, to get columns index by name. using class[name]
"""
def __init__(self, sheet, fieldnames, row=0):
"""
Create a OrderedDict {name:index} from 'fieldnames'
:param sheet: The Worksheet to use
:param fieldnames: Ordered List of Column Names
:param row: Default Row Index for the Header Row
"""
from collections import OrderedDict
self.columns = OrderedDict().fromkeys(fieldnames, None)
for n in range(sheet.ncols):
self.columns[sheet.cell(row, n).value] = n
@property
def ncols(self):
"""
Generator, equal usage as range(xlrd.ncols),
to iterate columns in ordered sequence
:return: yield Column index
"""
for idx in self.columns.values():
yield idx
def __getitem__(self, item):
"""
Make class object subscriptable
:param item: Column Name
:return: Columns index
"""
return self.columns[item]
Usage:
# Worksheet Data
sheet([['Schedule', 'Cascade', 'Market'],
['SF05UB0', 'DO Macro Upgrade', 'Upper Cnetral Valley'],
['DE03HO0', 'DO Macro Upgrade', 'Toledo'],
['SF73XC4', 'DO Macro Upgrade', 'SF Bay']]
)
# Instantiate with Ordered List of Column Names
# NOTE the different Order of Column Names
by_name = OrderedByName(sheet, ['Cascade', 'Market', 'Schedule'])
# Iterate all Rows and all Columns Ordered as instantiated
for row in range(sheet.nrows):
for col in by_name.ncols:
value = sheet.cell(row, col).value
print("cell({}).value == {}".format((row,col), value))
Output:
cell((0, 1)).value == Cascade
cell((0, 2)).value == Market
cell((0, 0)).value == Schedule
cell((1, 1)).value == DO Macro Upgrade
cell((1, 2)).value == Upper Cnetral Valley
cell((1, 0)).value == SF05UB0
cell((2, 1)).value == DO Macro Upgrade
cell((2, 2)).value == Toledo
cell((2, 0)).value == DE03HO0
cell((3, 1)).value == DO Macro Upgrade
cell((3, 2)).value == SF Bay
cell((3, 0)).value == SF73XC4
Get Index of one Column by Name
print("cell{}.value == {}".format((1, by_name['Schedule']),
sheet.cell(1, by_name['Schedule']).value))
#>>> cell(1, 0).value == SF05UB0
测试 Python:3.5
或者您也可以使用 pandas
, which is a comprehensive data analysis library with built-in excel I/O capabilities.
import pandas as pd
file_location =r"C:\Users\esatnir\Desktop\Sprint Vision.xlsx"
# Read out first sheet of excel file and return as pandas dataframe
df = pd.read_excel(file_location)
# Reduce dataframe to target columns (by filtering on column names)
df = df[['Cascade', 'Schedule Name', 'Market']]
快速查看生成的数据框 df
将显示:
In [1]: df
Out[1]:
Cascade Schedule Name Market
0 SF05UB0 DO Macro Upgrade Upper Central Valley
1 DE03HO0 DO Macro Upgrade Toledo
2 SF73XC4 DO Macro Upgrade SF Bay
您的列名在电子表格的第一行,对吗?因此,读取第一行并构建从名称到列索引的映射。
column_pos = [ (sheet.cell(0, i).value, i) for i in range(sheet.ncols) ]
colidx = dict(column_pos)
或单行:
colidx = dict( (sheet.cell(0, i).value, i) for i in range(sheet.ncols) )
然后您可以使用索引来解释列名,例如:
print(sheet.cell(5, colidx["Schedule Name"]).value)
要获取整个列,您可以使用列表理解:
schedule = [ sheet.cell(i, colidx["Schedule Name"]).value for i in range(1, sheet.nrows) ]
如果您真的愿意,可以为 cell
函数创建一个包装器来处理解释。不过我觉得这个够简单了。
你可以利用pandas。下面是用于识别 excel sheet.
中的列和行的示例代码
import pandas as pd
file_location =r"Your_Excel_Path"
# Read out first sheet of excel file and return as pandas dataframe
df = pd.read_excel(file_location)
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
# Print total number of rows in an excel sheet
print("Number of Rows: "+str(total_rows))
# Print total number of columns in an excel sheet
print("Number of Columns: "+str(total_cols))
# Print column names in an excel sheet
print(df.columns.ravel())
现在一旦有了列数据,就可以将其转换为值列表。
我正在尝试读取大型 excel 文件(将近 100000 行)的数据。 我在 python 中使用 'xlrd Module' 从 excel 获取数据。 我想按列名(Cascade、Schedule Name、Market)而不是列号(0,1,2)获取数据。 因为我的excel列不固定。 我知道如何在固定列的情况下获取数据。
这是我从 excel 中获取固定列数据的代码
import xlrd
file_location =r"C:\Users\Desktop\Vision.xlsx"
workbook=xlrd.open_workbook(file_location)
sheet= workbook.sheet_by_index(0)
print(sheet.ncols,sheet.nrows,sheet.name,sheet.number)
for i in range(sheet.nrows):
flag = 0
for j in range(sheet.ncols):
value=sheet.cell(i,j).value
如果有人对此有任何解决方案,请告诉我
谢谢
Comment: still not working when header of
fieldnames = ['Cascade', 'Market', 'Schedule', 'Name]
and
Sheet(['Cascade', 'Schedule', 'Name', 'Market'])
are equal.
在 col_idx
中保持 fieldnames
的顺序不是我最初的目标。
Question: I want to fetch data by column name
以下 OOP
解决方案将起作用:
class OrderedByName():
"""
Privides a generator method, to iterate in Column Name ordered sequence
Provides subscription, to get columns index by name. using class[name]
"""
def __init__(self, sheet, fieldnames, row=0):
"""
Create a OrderedDict {name:index} from 'fieldnames'
:param sheet: The Worksheet to use
:param fieldnames: Ordered List of Column Names
:param row: Default Row Index for the Header Row
"""
from collections import OrderedDict
self.columns = OrderedDict().fromkeys(fieldnames, None)
for n in range(sheet.ncols):
self.columns[sheet.cell(row, n).value] = n
@property
def ncols(self):
"""
Generator, equal usage as range(xlrd.ncols),
to iterate columns in ordered sequence
:return: yield Column index
"""
for idx in self.columns.values():
yield idx
def __getitem__(self, item):
"""
Make class object subscriptable
:param item: Column Name
:return: Columns index
"""
return self.columns[item]
Usage:
# Worksheet Data
sheet([['Schedule', 'Cascade', 'Market'],
['SF05UB0', 'DO Macro Upgrade', 'Upper Cnetral Valley'],
['DE03HO0', 'DO Macro Upgrade', 'Toledo'],
['SF73XC4', 'DO Macro Upgrade', 'SF Bay']]
)
# Instantiate with Ordered List of Column Names
# NOTE the different Order of Column Names
by_name = OrderedByName(sheet, ['Cascade', 'Market', 'Schedule'])
# Iterate all Rows and all Columns Ordered as instantiated
for row in range(sheet.nrows):
for col in by_name.ncols:
value = sheet.cell(row, col).value
print("cell({}).value == {}".format((row,col), value))
Output:
cell((0, 1)).value == Cascade cell((0, 2)).value == Market cell((0, 0)).value == Schedule cell((1, 1)).value == DO Macro Upgrade cell((1, 2)).value == Upper Cnetral Valley cell((1, 0)).value == SF05UB0 cell((2, 1)).value == DO Macro Upgrade cell((2, 2)).value == Toledo cell((2, 0)).value == DE03HO0 cell((3, 1)).value == DO Macro Upgrade cell((3, 2)).value == SF Bay cell((3, 0)).value == SF73XC4
Get Index of one Column by Name
print("cell{}.value == {}".format((1, by_name['Schedule']), sheet.cell(1, by_name['Schedule']).value)) #>>> cell(1, 0).value == SF05UB0
测试 Python:3.5
或者您也可以使用 pandas
, which is a comprehensive data analysis library with built-in excel I/O capabilities.
import pandas as pd
file_location =r"C:\Users\esatnir\Desktop\Sprint Vision.xlsx"
# Read out first sheet of excel file and return as pandas dataframe
df = pd.read_excel(file_location)
# Reduce dataframe to target columns (by filtering on column names)
df = df[['Cascade', 'Schedule Name', 'Market']]
快速查看生成的数据框 df
将显示:
In [1]: df
Out[1]:
Cascade Schedule Name Market
0 SF05UB0 DO Macro Upgrade Upper Central Valley
1 DE03HO0 DO Macro Upgrade Toledo
2 SF73XC4 DO Macro Upgrade SF Bay
您的列名在电子表格的第一行,对吗?因此,读取第一行并构建从名称到列索引的映射。
column_pos = [ (sheet.cell(0, i).value, i) for i in range(sheet.ncols) ]
colidx = dict(column_pos)
或单行:
colidx = dict( (sheet.cell(0, i).value, i) for i in range(sheet.ncols) )
然后您可以使用索引来解释列名,例如:
print(sheet.cell(5, colidx["Schedule Name"]).value)
要获取整个列,您可以使用列表理解:
schedule = [ sheet.cell(i, colidx["Schedule Name"]).value for i in range(1, sheet.nrows) ]
如果您真的愿意,可以为 cell
函数创建一个包装器来处理解释。不过我觉得这个够简单了。
你可以利用pandas。下面是用于识别 excel sheet.
中的列和行的示例代码import pandas as pd
file_location =r"Your_Excel_Path"
# Read out first sheet of excel file and return as pandas dataframe
df = pd.read_excel(file_location)
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
# Print total number of rows in an excel sheet
print("Number of Rows: "+str(total_rows))
# Print total number of columns in an excel sheet
print("Number of Columns: "+str(total_cols))
# Print column names in an excel sheet
print(df.columns.ravel())
现在一旦有了列数据,就可以将其转换为值列表。