在 python 中使用表格阅读 PDF 时如何删除 'Nan' 值?
How do I remove 'Nan' values while reading a PDF using tabula in python?
我正在使用 tabula-py 在 python 中读取我的 class 时间表 PDF 文件并且 return 值 'data' 有很多'nan' 个我似乎无法清理的值。有人可以提出解决方案吗?
我应该使用某些东西而不是 tabula-py 吗?
我已将 link 添加到 PDF 的图片中。为了隐私,我已经从 PDF 中删除了一些信息。1
我的代码如下:
import tabula
class ClassTimetable:
def __init__(self, filename):
self.filename = filename
def read_data(self):
data = tabula.read_pdf(self.filename, pages='all')
# data1 = tabula.convert_into(self.filename, output_format="csv", output_path='file.csv')
print(data)
我的输出如下:
[ Course Course Regn. ... Unnamed: 2 Room
0 Code Title Credit Type ... GCR Code No.
1 Critical and NaN ... NaN NaN
2 1 18PDM202L Creative 0 ... A- wubaing
3 Thinking Skills NaN ... ISOLATED NaN
4 Management NaN ... NaN NaN
5 2 18PDH102T Principles for 2 ... A- NaN
6 Engineers NaN ... COMBINED NaN
7 Professional Lab3 18EEC206J Analog Electronics 4 ... B boc5om
8 Generation, NaN ... NaN NaN
9 4 18EEC208T Transmission & 3 NaN ... NaN NaN
10 Distribution NaN ... C 4qjaetp
11 Numerical NaN ... NaN NaN
12 5 18MAB202T Methods for Engineers 4 ... D vvbxlqp
13 6 18EEC205J Electrical Machines II 4 ... E drcfega
14 7 18BTB101T Biology 2 ... F NaN
15 Electrical and NaN ... NaN NaN
16 Electronics NaN ... NaN NaN
17 8 18EEC207J Measurements and 4 ... G koed72
18 Instrumentation NaN ... NaN NaN
19 9 18EEC205J Electrical Machines II 4 ... P7-P8- drcfega
20 NaN NaN ... NaN NaN
21 10 18EEC206J Analog Electronics 4 ... P3-P4- boc5om
22 Electrical and NaN ... NaN NaN
23 Electronics NaN ... NaN NaN
24 11 18EEC207J Measurements 4 ... NaN NaN
25 and NaN ... P19-P20- NaN
26 Instrumentation NaN ... NaN NaN
27 Total 23 NaN ... NaN NaN
[28 rows x 8 columns]]
此外,'. . .'平均值?
我想通了。
我意识到,问题是图书馆没有正确读取行之间的分隔,所以我设置了 'lattice=True'。
这解决了我大约 50% 的问题,并意识到该程序需要更高的特异性。
下载 windows 的 Tabula 并找到整个 table 的坐标以及单独的列。在 'area=' 和 'columns=' 的构建选项下将该数据输入 tabula-py。
我意识到同时使用这两个属性可能有点矫枉过正,但在格式化为 .csv 后,我的所有数据都整齐地放在单独的列中,没有 'Nan' 值。
在下面附上我的代码:
import tabula
class ClassTimetable:
def __init__(self, filename):
self.filename = filename
def read_data(self):
data = tabula.read_pdf(self.filename, pages='all', area=[162.498,141.6,546.248,538.736],
columns=[140.55, 172.53, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
491.585, 542.157], lattice=True)
data1 = tabula.convert_into(self.filename, output_format="csv", area=[162.498,141.6,546.248,538.736],
columns=[140.559, 172.538, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
491.585, 542.157], lattice=True, output_path='file2.csv')
return data
输出,如下:
[ Unnamed: 0 Course\rTitle ... Slot GCR Code
0 1.0 18PDM202L ... Mr. R. Prathap\rChandran (102275) A-\rISOLATED
1 2.0 18PDH102T ... Mr. Nizamudeen\rAnvar (102293) A-\rCOMBINED
2 3.0 18EEC206J ... Dr.T.M.Thamizh\rThentral (101436) B
3 4.0 18EEC208T ... Dr.S.Vidyasagar\r(100597) C
4 5.0 18MAB202T ... Dr. M. Suresh\r(101984) D
5 6.0 18EEC205J ... Dr. K. M, Ravi\rEswar (102699) E
6 7.0 18BTB101T ... Mr.T.Anand\r(100034) F
7 8.0 18EEC207J ... Mr.S.Raghavendran\r(102704) G
8 9.0 18EEC205J ... Dr. K. M, Ravi\rEswar (102699) P7-P8-
9 10.0 18EEC206J ... Dr.T.M.Thamizh\rThentral (101436) P3-P4-
10 11.0 18EEC207J ... Mr.S.Raghavendran\r(102704) P19-P20-
11 NaN 23 ... NaN NaN
还是不知道什么'. . .'意思是
我正在使用 tabula-py 在 python 中读取我的 class 时间表 PDF 文件并且 return 值 'data' 有很多'nan' 个我似乎无法清理的值。有人可以提出解决方案吗? 我应该使用某些东西而不是 tabula-py 吗? 我已将 link 添加到 PDF 的图片中。为了隐私,我已经从 PDF 中删除了一些信息。1
我的代码如下:
import tabula
class ClassTimetable:
def __init__(self, filename):
self.filename = filename
def read_data(self):
data = tabula.read_pdf(self.filename, pages='all')
# data1 = tabula.convert_into(self.filename, output_format="csv", output_path='file.csv')
print(data)
我的输出如下:
[ Course Course Regn. ... Unnamed: 2 Room
0 Code Title Credit Type ... GCR Code No.
1 Critical and NaN ... NaN NaN
2 1 18PDM202L Creative 0 ... A- wubaing
3 Thinking Skills NaN ... ISOLATED NaN
4 Management NaN ... NaN NaN
5 2 18PDH102T Principles for 2 ... A- NaN
6 Engineers NaN ... COMBINED NaN
7 Professional Lab3 18EEC206J Analog Electronics 4 ... B boc5om
8 Generation, NaN ... NaN NaN
9 4 18EEC208T Transmission & 3 NaN ... NaN NaN
10 Distribution NaN ... C 4qjaetp
11 Numerical NaN ... NaN NaN
12 5 18MAB202T Methods for Engineers 4 ... D vvbxlqp
13 6 18EEC205J Electrical Machines II 4 ... E drcfega
14 7 18BTB101T Biology 2 ... F NaN
15 Electrical and NaN ... NaN NaN
16 Electronics NaN ... NaN NaN
17 8 18EEC207J Measurements and 4 ... G koed72
18 Instrumentation NaN ... NaN NaN
19 9 18EEC205J Electrical Machines II 4 ... P7-P8- drcfega
20 NaN NaN ... NaN NaN
21 10 18EEC206J Analog Electronics 4 ... P3-P4- boc5om
22 Electrical and NaN ... NaN NaN
23 Electronics NaN ... NaN NaN
24 11 18EEC207J Measurements 4 ... NaN NaN
25 and NaN ... P19-P20- NaN
26 Instrumentation NaN ... NaN NaN
27 Total 23 NaN ... NaN NaN
[28 rows x 8 columns]]
此外,'. . .'平均值?
我想通了。
我意识到,问题是图书馆没有正确读取行之间的分隔,所以我设置了 'lattice=True'。
这解决了我大约 50% 的问题,并意识到该程序需要更高的特异性。
下载 windows 的 Tabula 并找到整个 table 的坐标以及单独的列。在 'area=' 和 'columns=' 的构建选项下将该数据输入 tabula-py。
我意识到同时使用这两个属性可能有点矫枉过正,但在格式化为 .csv 后,我的所有数据都整齐地放在单独的列中,没有 'Nan' 值。
在下面附上我的代码:
import tabula
class ClassTimetable:
def __init__(self, filename):
self.filename = filename
def read_data(self):
data = tabula.read_pdf(self.filename, pages='all', area=[162.498,141.6,546.248,538.736],
columns=[140.55, 172.53, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
491.585, 542.157], lattice=True)
data1 = tabula.convert_into(self.filename, output_format="csv", area=[162.498,141.6,546.248,538.736],
columns=[140.559, 172.538, 217.161, 277.400, 300.454, 339.127, 384.492, 419.446,
491.585, 542.157], lattice=True, output_path='file2.csv')
return data
输出,如下:
[ Unnamed: 0 Course\rTitle ... Slot GCR Code
0 1.0 18PDM202L ... Mr. R. Prathap\rChandran (102275) A-\rISOLATED
1 2.0 18PDH102T ... Mr. Nizamudeen\rAnvar (102293) A-\rCOMBINED
2 3.0 18EEC206J ... Dr.T.M.Thamizh\rThentral (101436) B
3 4.0 18EEC208T ... Dr.S.Vidyasagar\r(100597) C
4 5.0 18MAB202T ... Dr. M. Suresh\r(101984) D
5 6.0 18EEC205J ... Dr. K. M, Ravi\rEswar (102699) E
6 7.0 18BTB101T ... Mr.T.Anand\r(100034) F
7 8.0 18EEC207J ... Mr.S.Raghavendran\r(102704) G
8 9.0 18EEC205J ... Dr. K. M, Ravi\rEswar (102699) P7-P8-
9 10.0 18EEC206J ... Dr.T.M.Thamizh\rThentral (101436) P3-P4-
10 11.0 18EEC207J ... Mr.S.Raghavendran\r(102704) P19-P20-
11 NaN 23 ... NaN NaN
还是不知道什么'. . .'意思是