Python-Validate 文件 headers 及其元数据
Python-Validate file headers with its metadata
我有文件名 metadata.txt,其中包含所有 xlsx 文件的元数据以及工作表名称和列 header 信息
我需要做一些验证比较 metadata.txt 与 xlsx 文件并抛出异常。(验证在下面提供)
我有大约 30 个带有不同工作表的 xlsx(我提供了几个文件的示例)
我是 python 的新手,正在寻找有关如何实现它的建议/示例代码。
Validatons :
Check metadata.txt and compare with emp.xlsx , dept.xlsx,locations.xlsx
(basically i need to loop filenames and sheetnames from metadata.txt with
directory path C://Files)
if there is mismatch in header(ie Col_header of metadata with header of
xlsx(example: dept.xlsx(description not matching with dept_name) )
then throw error
If there is duplicates found with column header
(ex:locations.xlsx(loc_name repeated twice when it is compared with
metadata.txt) throw error
metadata.txt
filename:sheet_name:col_header
emp.xlsx:emp_details:emp_id,sal,dept_id,hiredate
dept.xlsx:dept_details:dept_id,dept_name,created_date
locations.xlsx:loc_details:loc_id,loc_name,created_date
emp.xlsx(工作表名称:emp_details)
emp_id,sal,dept_id,hiredate
1,2000,10,10-jan-2018
2,4000,20,12-jan-2018
3,5000,30,13-jan-2018
dept.xlsx(工作表名称:dept_details)
dept_id,description,created_date
10,HR,10-apr-2018
20,IT,20-may-2018
30,MED,12-jun-2018
locations.xlsx(工作表名称:loc_details)
loc_id,loc_name,created_date,loc_name
100,BAN,10-jan-17,BAN
200,CHE,20-jan-17,CHE
将我的结果打印到新文件中
File_name,count,systemdate,validationstatus
emp.xlsx,3,27-jan-19,succcess
dept.xlsx,3,27-jan-19,failed
locations.xlsx,3,27-jan-19,failed
这个问题有很多解决办法。下面是其中之一。从您的示例中,我看到选项卡名称并不重要,因此我没有将它们包括在分析中。额外的空格也可能导致问题。在这里,我也在清理它们。享受。如果有任何不清楚的地方,请告诉我。希望对您有所帮助:
import pandas as pd
from os import chdir
from datetime import datetime
# point to the folder where the data is
chdir(".\Data")
emp = pd.read_excel("emp.xlsx")
dept = pd.read_excel("dept.xlsx")
locations = pd.read_excel("locations.xlsx")
# build a dictionary where keys are the file names and values are a list of columns names and number of rows
dictDFColNames = {
"emp.xlsx": [list(emp.columns), emp.shape[0]],
"dept.xlsx": [list(dept.columns), dept.shape[0]],
"locations.xlsx": [list(locations.columns), locations.shape[0]]
}
dictDFColNames = {k:[[c.strip() for c in v[0]], v[1]] for k,v in dictDFColNames.items()}
# dictionary of the metadata
dictColNames = dict()
with open("metadata.txt") as f:
next(f)
for line in f:
line = line.strip()
key = line.split(":")[0]
values = line.split(":")[-1].split(",")
values = [c.strip() for c in values]
dictColNames[key] = values
f = open("validation.csv", "w")
header = "File_name,count,systemdate,validationstatus\n"
f.write(header)
for k, v in dictDFColNames.items():
s = ""
col_names = [x.split(".")[0] for x in v[0]]
s_failed = k + "," + str(v[1]) + "," + datetime.today().strftime('%Y-%m-%d') + ",failed\n"
s_success = k + "," + str(v[1]) + "," + datetime.today().strftime('%Y-%m-%d') + ",success\n"
if len(col_names) > len(set(col_names)):
s = s_failed
else:
if set(dictDFColNames[k][0]) == set(dictColNames[k]):
s = s_success
else:
s = s_failed
f.write(s)
f.close()
我有文件名 metadata.txt,其中包含所有 xlsx 文件的元数据以及工作表名称和列 header 信息 我需要做一些验证比较 metadata.txt 与 xlsx 文件并抛出异常。(验证在下面提供) 我有大约 30 个带有不同工作表的 xlsx(我提供了几个文件的示例) 我是 python 的新手,正在寻找有关如何实现它的建议/示例代码。
Validatons :
Check metadata.txt and compare with emp.xlsx , dept.xlsx,locations.xlsx
(basically i need to loop filenames and sheetnames from metadata.txt with
directory path C://Files)
if there is mismatch in header(ie Col_header of metadata with header of
xlsx(example: dept.xlsx(description not matching with dept_name) )
then throw error
If there is duplicates found with column header
(ex:locations.xlsx(loc_name repeated twice when it is compared with
metadata.txt) throw error
metadata.txt
filename:sheet_name:col_header
emp.xlsx:emp_details:emp_id,sal,dept_id,hiredate
dept.xlsx:dept_details:dept_id,dept_name,created_date
locations.xlsx:loc_details:loc_id,loc_name,created_date
emp.xlsx(工作表名称:emp_details)
emp_id,sal,dept_id,hiredate
1,2000,10,10-jan-2018
2,4000,20,12-jan-2018
3,5000,30,13-jan-2018
dept.xlsx(工作表名称:dept_details)
dept_id,description,created_date
10,HR,10-apr-2018
20,IT,20-may-2018
30,MED,12-jun-2018
locations.xlsx(工作表名称:loc_details)
loc_id,loc_name,created_date,loc_name
100,BAN,10-jan-17,BAN
200,CHE,20-jan-17,CHE
将我的结果打印到新文件中
File_name,count,systemdate,validationstatus
emp.xlsx,3,27-jan-19,succcess
dept.xlsx,3,27-jan-19,failed
locations.xlsx,3,27-jan-19,failed
这个问题有很多解决办法。下面是其中之一。从您的示例中,我看到选项卡名称并不重要,因此我没有将它们包括在分析中。额外的空格也可能导致问题。在这里,我也在清理它们。享受。如果有任何不清楚的地方,请告诉我。希望对您有所帮助:
import pandas as pd
from os import chdir
from datetime import datetime
# point to the folder where the data is
chdir(".\Data")
emp = pd.read_excel("emp.xlsx")
dept = pd.read_excel("dept.xlsx")
locations = pd.read_excel("locations.xlsx")
# build a dictionary where keys are the file names and values are a list of columns names and number of rows
dictDFColNames = {
"emp.xlsx": [list(emp.columns), emp.shape[0]],
"dept.xlsx": [list(dept.columns), dept.shape[0]],
"locations.xlsx": [list(locations.columns), locations.shape[0]]
}
dictDFColNames = {k:[[c.strip() for c in v[0]], v[1]] for k,v in dictDFColNames.items()}
# dictionary of the metadata
dictColNames = dict()
with open("metadata.txt") as f:
next(f)
for line in f:
line = line.strip()
key = line.split(":")[0]
values = line.split(":")[-1].split(",")
values = [c.strip() for c in values]
dictColNames[key] = values
f = open("validation.csv", "w")
header = "File_name,count,systemdate,validationstatus\n"
f.write(header)
for k, v in dictDFColNames.items():
s = ""
col_names = [x.split(".")[0] for x in v[0]]
s_failed = k + "," + str(v[1]) + "," + datetime.today().strftime('%Y-%m-%d') + ",failed\n"
s_success = k + "," + str(v[1]) + "," + datetime.today().strftime('%Y-%m-%d') + ",success\n"
if len(col_names) > len(set(col_names)):
s = s_failed
else:
if set(dictDFColNames[k][0]) == set(dictColNames[k]):
s = s_success
else:
s = s_failed
f.write(s)
f.close()