一起使用 Pandas 和 xlrd。忽略列 headers 的 absence/presence

Using Pandas and xlrd together. Ignoring absence/presence of column headers

我希望你能帮助我 - 我相信这可能是一件小事,如果有人知道如何解决的话。

在我的车间,我和我的同事都不能通过我们数据库的 front-end 进行 'find and replace all' 更改。老板只是拒绝了我们那个级别的访问权限。如果我们需要更改数十条或数百条记录,则必须全部通过 copy-and-paste 或类似方式完成。疯狂。

我正在尝试使用 Python 2 来解决这个问题,尤其是 Pandas、pyautogui 和 xlrd 等库。

我研究了 serval Whosebug 线程,到目前为止已经成功编写了一些代码,可以很好地读取给定的 XL 文件。在生产中,这将是从数据库 GUI 中找到的数据集导出的文件 front-end 并且对于计算机工作室中的项目来说只是 'Article Numbers' 的一列。这将始终有一个 Excel 列 header。例如

ANR
51234
34567
12345
...

所有记录编号均为5位数字。 我们还可以使用红外扫描仪将物品扫描到 iPad 上的 'Workflow' 应用程序,并自动从扫描的物品列表中生成 XL 文件。

此处的 XL 文件可能与此类似。

56788
12345
89012
...

不同的是没有列header。所有 XL 文件在“Sheet1”的单元格 A1 中都有其数据 'anchored',并且将再次使用单列。这里没有不必要的复杂化!

无论如何,这是脚本。当它完全工作时,系统参数将提供给它。现在,假设我们需要更改记录以将其 'RAM' 值从
更改为 "2GB""2 GB".

import xlrd
import string
import re
import pandas as pd


field = "RAM"
value = "2 GB"

myFile = "/Users/me/folder/testArticles.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"


# data collection and putting into lists.
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in     range(sheet.nrows)]

formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings
# that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')


    # remove the decimal point
    # Searching for the header will cause a database front-end problem. 
    cellValue = cellValue[:-2]
    cellValue = cellValue.translate(None, string.letters)

    # making sure only valid article numbers get through
    # blank rows etc can take a hike
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


#main code block
for i in deDuped:

    #lots going on here involving pyauotgui
    #making sure of no error running searches, checking for warnings, moving/tabbing around DB front-end etc

    #if all goes to plan
    #removing that record number from the excel file and saving the change
    #so that if we run the script again for the same XL file 
    #we don't needlessly update an already OK record again. 

        df = df[~df['ANR'].astype(str).str.startswith(i)]
        df.to_excel(myFile, index=False)

我真正想知道的是我如何 运行 脚本,以便“不关心”是否存在列 header。

df = df[~df['ANR'].astype(str).str.startswith(i)]

似乎是所有这些都挂起的代码行。我以不同的组合对该行进行了多项更改,但我的脚本总是崩溃。

如果列 header,("ANR") 在我的例子中,对于这个特定的 'pandas' 方法是必不可少的,是否有 straight-forward 插入列的方法 header 转换成一个 XL 文件,如果它首先缺少 XL 文件 - 即来自红外扫描仪的 XL 文件和 iPad 上的 'Workflow' 应用程序?

谢谢大家!

更新

我已经尝试按照 Patrick 的建议实施一些代码来检查单元格 "A1" 是否有 header。部分成功。如果它丢失了,我可以将 "ANR" 放在单元格 A1 中,但我首先丢失了那里的任何内容。

import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
import openpyxl

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]



cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    wb = openpyxl.load_workbook(filename= myFile)
    ws = wb['Sheet1']
    ws['A1'] = "ANE"
    wb.save(myFile)
    #re-open XL file again etc etc.

我在 writing to existing workbook using xlwt 找到了这个新代码块。在这种情况下,贡献者实际上使用了 openpyxl。

我想我自己修好了。

仍然有点乱,但似乎可以正常工作。添加了 'if/else' 子句以检查单元格 A1 的值并采取相应的操作。在 找到了大部分代码 - 使用 openpyxl

的建议
import pyperclip
import xlrd
import pyautogui
import string
import re
import os
import pandas as pd
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy


field = "RAM"
value = "2 GB"
myFile = "/Users/me/testSerials.xlsx"
df = pd.read_excel(myFile)


myRegex = "^[0-9]{5}$"

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    headers = ['ANR']
    workbook_name = 'myFile'
    wb = Workbook()
    page = wb.active
    # page.title = 'companies'
    page.append(headers)  # write the headers to the first line

    workbook = xlrd.open_workbook(workbook_name)
    sheet = workbook.sheet_by_index(0)
    data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

    for records in data:
        page.append(records)

        wb.save(filename=workbook_name)

        #then load the data all over again, this time with inserted header
        workbook = xlrd.open_workbook(myFile)
        sheet = workbook.sheet_by_index(0)
        data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]


formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')

    # remove the decimal point
    cellValue = cellValue[:-2]
    # cellValue = cellValue.translate(None, ".0")
    cellValue = cellValue.translate(None, string.letters)

    # making sure any valid ANRs get through
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)
# ------------------------------------------

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


# ref - 
df = pd.read_excel(myFile)

print df


for i in deDuped:
    #pyautogui code is run here...

    #if all goes to plan update the XL file
        df = df[~df['ANR'].astype(str).str.startswith(i)]

        df.to_excel(myFile, index=False)