一起使用 Pandas 和 xlrd。忽略列 headers 的 absence/presence

Question

我希望你能帮助我 - 我相信这可能是一件小事，如果有人知道如何解决的话。

在我的车间，我和我的同事都不能通过我们数据库的 front-end 进行 'find and replace all' 更改。老板只是拒绝了我们那个级别的访问权限。如果我们需要更改数十条或数百条记录，则必须全部通过 copy-and-paste 或类似方式完成。疯狂。

我正在尝试使用 Python 2 来解决这个问题，尤其是 Pandas、pyautogui 和 xlrd 等库。

我研究了 serval Whosebug 线程，到目前为止已经成功编写了一些代码，可以很好地读取给定的 XL 文件。在生产中，这将是从数据库 GUI 中找到的数据集导出的文件 front-end 并且对于计算机工作室中的项目来说只是 'Article Numbers' 的一列。这将始终有一个 Excel 列 header。例如

ANR
51234
34567
12345
...

所有记录编号均为5位数字。我们还可以使用红外扫描仪将物品扫描到 iPad 上的 'Workflow' 应用程序，并自动从扫描的物品列表中生成 XL 文件。

此处的 XL 文件可能与此类似。

不同的是没有列header。所有 XL 文件在“Sheet1”的单元格 A1 中都有其数据 'anchored'，并且将再次使用单列。这里没有不必要的复杂化！

无论如何，这是脚本。当它完全工作时，系统参数将提供给它。现在，假设我们需要更改记录以将其 'RAM' 值从
更改为 "2GB" 到 "2 GB".

import xlrd
import string
import re
import pandas as pd


field = "RAM"
value = "2 GB"

myFile = "/Users/me/folder/testArticles.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"


# data collection and putting into lists.
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in     range(sheet.nrows)]

formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings
# that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')


    # remove the decimal point
    # Searching for the header will cause a database front-end problem. 
    cellValue = cellValue[:-2]
    cellValue = cellValue.translate(None, string.letters)

    # making sure only valid article numbers get through
    # blank rows etc can take a hike
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


#main code block
for i in deDuped:

    #lots going on here involving pyauotgui
    #making sure of no error running searches, checking for warnings, moving/tabbing around DB front-end etc

    #if all goes to plan
    #removing that record number from the excel file and saving the change
    #so that if we run the script again for the same XL file 
    #we don't needlessly update an already OK record again. 

        df = df[~df['ANR'].astype(str).str.startswith(i)]
        df.to_excel(myFile, index=False)

我真正想知道的是我如何运行脚本，以便“不关心”是否存在列 header。

df = df[~df['ANR'].astype(str).str.startswith(i)]

似乎是所有这些都挂起的代码行。我以不同的组合对该行进行了多项更改，但我的脚本总是崩溃。

如果列 header，("ANR") 在我的例子中，对于这个特定的 'pandas' 方法是必不可少的，是否有 straight-forward 插入列的方法 header 转换成一个 XL 文件，如果它首先缺少 XL 文件 - 即来自红外扫描仪的 XL 文件和 iPad 上的 'Workflow' 应用程序？

谢谢大家！

更新

我已经尝试按照 Patrick 的建议实施一些代码来检查单元格 "A1" 是否有 header。部分成功。如果它丢失了，我可以将 "ANR" 放在单元格 A1 中，但我首先丢失了那里的任何内容。

import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
import openpyxl

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]



cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    wb = openpyxl.load_workbook(filename= myFile)
    ws = wb['Sheet1']
    ws['A1'] = "ANE"
    wb.save(myFile)
    #re-open XL file again etc etc.

我在 writing to existing workbook using xlwt 找到了这个新代码块。在这种情况下，贡献者实际上使用了 openpyxl。

Answer 1

我想我自己修好了。

仍然有点乱，但似乎可以正常工作。添加了 'if/else' 子句以检查单元格 A1 的值并采取相应的操作。在找到了大部分代码 - 使用 openpyxl

的建议

import pyperclip
import xlrd
import pyautogui
import string
import re
import os
import pandas as pd
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy


field = "RAM"
value = "2 GB"
myFile = "/Users/me/testSerials.xlsx"
df = pd.read_excel(myFile)


myRegex = "^[0-9]{5}$"

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    headers = ['ANR']
    workbook_name = 'myFile'
    wb = Workbook()
    page = wb.active
    # page.title = 'companies'
    page.append(headers)  # write the headers to the first line

    workbook = xlrd.open_workbook(workbook_name)
    sheet = workbook.sheet_by_index(0)
    data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

    for records in data:
        page.append(records)

        wb.save(filename=workbook_name)

        #then load the data all over again, this time with inserted header
        workbook = xlrd.open_workbook(myFile)
        sheet = workbook.sheet_by_index(0)
        data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]


formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')

    # remove the decimal point
    cellValue = cellValue[:-2]
    # cellValue = cellValue.translate(None, ".0")
    cellValue = cellValue.translate(None, string.letters)

    # making sure any valid ANRs get through
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)
# ------------------------------------------

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


# ref - 
df = pd.read_excel(myFile)

print df


for i in deDuped:
    #pyautogui code is run here...

    #if all goes to plan update the XL file
        df = df[~df['ANR'].astype(str).str.startswith(i)]

        df.to_excel(myFile, index=False)

一起使用 Pandas 和 xlrd。忽略列 headers 的 absence/presence

Using Pandas and xlrd together. Ignoring absence/presence of column headers

python

xlrd

python-2.7

pandas