如何使用 Python 将文本中的信息放入 CSV 中

How to put information from text into CSV with Python

我正在使用 re 搜索 html 文档并输出两个给定字符串之间的所有文本。返回的文本包含不同城镇的用水率,我需要将其自动转换为 csv table 格式。我怎样才能做到这一点?到目前为止,这是我的代码:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')
contents = file.read()
#Converts the html into a soup object
soup = BS(contents, 'html.parser')
rawText = soup.get_text(strip=True, separator='\n')

#Searches the soup object for two given strings and returns the text in between
finalText = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', rawText, flags=re.S|re.I).group(1)

print(finalText)  

这是需要抓取信息并放入 csv 中的输出文本:

The water requirements of persons entitled to domestic and stock rights in these water
sources are estimated to total 4,385 megalitres per year (hereafter
ML/year
),
distributed as follows:
(a)
91 ML/year in the Adjungbilly/Bombowlee/Brungle Water Source,
(b)
75 ML/year in the Billabung Water Source,
(c)
72 ML/year in the Bredbo Water Source,
Page 25
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
(d)
82 ML/year in the Burkes/Bullenbung Water Source,
(e)
45 ML/year in the Burrinjuck Dam Catchment Water Source,
(f)
0 ML/year the Burrumbuttock Water Source,
(g)
67 ML/year in the Gilmore/Sandy Water Source,
(h)
39 ML/year in the Goobarragandra Water Source,
(i)
28 ML/year in the Goodradigbee Water Source,
(j)
113 ML/year in the Hillas Water Source,
(k)
100 ML/year in the Houlaghans Water Source,
(l)
291 ML/year in the Jugiong Water Source,
(m)  75 ML/year in the Kyeamba Water Source,
(n)
178 ML/year in the Lake George Water Source,
(o)
44 ML/year the Lower Billabong Water Source,
(p)
169 ML/year the Lower Billabong Anabranch Water Source,
(q)
156 ML/year the Middle Billabong Water Source,
(r)
103 ML/year in the Molonglo Water Source,
(s)
73 ML/year the Mountain Water Source,
(t)
2 ML/year in the Murrumbidgee (Balranald to Weimby) Water Source,
(u)
34 ML/year in the Murrumbidgee (Gogeldrie to Waldaira) Water Source,
(v)
92 ML/year in the Murrumbidgee Central (Burrinjuck to Gogeldrie) Water
Source,
(w)  218 ML/year in the Murrumbidgee I Water Source,
(x)
133 ML/year in the Murrumbidgee II Water Source,
(y)
116 ML/year in the Murrumbidgee III Water Source,
(z)
73 ML/year in the Murrumbidgee North Water Source,
(aa)  476 ML/year in the Murrumbidgee Western Water Source,
(ab)  92 ML/year in the Muttama Water Source,
(ac)  150 ML/year in the Numeralla East Water Source,

这就是 table 的样子

Town    Water Usage
Billabung 75 ML
Muttam    92 ML

等..

从您提供的输出中,您并不清楚您到底想提取什么。如果您发布的 finalText 以及我认为您实际想要的内容,您可以执行以下操作:

import re

l = re.findall(r'(\d+)\s(ML)/year\s(.*) Water', finalText)
header = [('Water', 'Usage', 'Town')]

data = header + l

with open('your.csv', 'w') as f:
    for line in data:
        f.write(f"{line[2]},{line[0]},{line[1]}\n")

your.csv 看起来像

Town,Water,Usage
Adjungbilly/Bombowlee/Brungle,91,ML
Billabung,75,ML
Bredbo,72,ML
Burkes/Bullenbung,82,ML
Burrinjuck Dam Catchment,45,ML
the Burrumbuttock,0,ML
Gilmore/Sandy,67,ML
Goobarragandra,39,ML
Goodradigbee,28,ML
Hillas,113,ML
Houlaghans,100,ML
Jugiong,291,ML
Kyeamba,75,ML
Lake George,178,ML
the Lower Billabong,44,ML
the Lower Billabong Anabranch,169,ML
the Middle Billabong,156,ML
Molonglo,103,ML
the Mountain,73,ML
Murrumbidgee (Balranald to Weimby),2,ML
Murrumbidgee (Gogeldrie to Waldaira),34,ML
Murrumbidgee Central (Burrinjuck to Gogeldrie),92,ML
Murrumbidgee I,218,ML
Murrumbidgee II,133,ML
Murrumbidgee III,116,ML
Murrumbidgee North,73,ML
Murrumbidgee Western,476,ML
Muttama,92,ML
Numeralla East,150,ML

这适用于给定的文件。

如果你打算将它用于其他文档,你必须对某些行进行位更改(在代码中标记)。

这是代码。

f = open("sample.txt", 'r').readlines()[5:] #initial unwanted lines
for i in f:
    i= [x for x in i.rstrip().split(" ") if x!=" "]
    if len(i)!=1 and i[0] not in ['Page','Water']: # list will have first word of unwanted lines in the middle of the file text
        if '(' in i[0]:
            print (i[6],i[2],i[3])
        else:
            print (i[4],i[0],i[1])

输出:

Adjungbilly/Bombowlee/Brungle 91 ML/year
Billabung 75 ML/year
Bredbo 72 ML/year
Burkes/Bullenbung 82 ML/year
Burrinjuck 45 ML/year
Water 0 ML/year
Gilmore/Sandy 67 ML/year
Goobarragandra 39 ML/year
Goodradigbee 28 ML/year
Hillas 113 ML/year
Houlaghans 100 ML/year
Jugiong 291 ML/year
Kyeamba 75 ML/year
Lake 178 ML/year
Billabong 44 ML/year
Billabong 169 ML/year
Billabong 156 ML/year
Molonglo 103 ML/year
Water 73 ML/year
Murrumbidgee 2 ML/year
Murrumbidgee 34 ML/year
Murrumbidgee 92 ML/year
Murrumbidgee 218 ML/year
Murrumbidgee 133 ML/year
Murrumbidgee 116 ML/year
Murrumbidgee 73 ML/year
Murrumbidgee 476 ML/year
Muttama 92 ML/year
Numeralla 150 ML/year

试试这个

import regex
import pandas as pd

pattern = regex.compile('\d+ ML(?=\/year)|(?<= in the|the )[\w \/\(\)]+(?=\s+Water Source|Water)')
columns = ('Water Usage', 'Town')

res = [dict(zip(columns, pattern.findall(line))) for line in finalText.splitlines() if pattern.match(line)]
df = pd.DataFrame(res)
df.to_csv('output.csv', index=False) #this will save the output to csv file
print(df)

输出:

   Water Usage                                              Town
0        91 ML                    Adjungbilly/Bombowlee/Brungle 
1        75 ML                                        Billabung 
2        72 ML                                           Bredbo 
3        82 ML                                Burkes/Bullenbung 
4        45 ML                         Burrinjuck Dam Catchment 
5         0 ML                                    Burrumbuttock 
6        67 ML                                    Gilmore/Sandy 
7        39 ML                                   Goobarragandra 
8        28 ML                                     Goodradigbee 
9       113 ML                                           Hillas 
10      100 ML                                       Houlaghans 
11      291 ML                                          Jugiong 
12      178 ML                                      Lake George 
13       44 ML                                  Lower Billabong 
14      169 ML                        Lower Billabong Anabranch 
15      156 ML                                 Middle Billabong 
16      103 ML                                         Molonglo 
17       73 ML                                         Mountain 
18        2 ML               Murrumbidgee (Balranald to Weimby) 
19       34 ML             Murrumbidgee (Gogeldrie to Waldaira) 
20       92 ML   Murrumbidgee Central (Burrinjuck to Gogeldrie) 
21      133 ML                                  Murrumbidgee II 
22      116 ML                                 Murrumbidgee III 
23       73 ML                               Murrumbidgee North