从站点到 postgres 的动态列表 python

dynamical list from site into postgres python

早上好,

我正在尝试从这个 site 中输入数据。我正在尝试获取每个搜索结果的日期、创建者、相关性、描述、主题、受众和访问权限,并将其放入我的 postgres 数据库中。问题是描述有时会丢失。所以有时一个结果有6条记录,有时一个结果有7条记录。

所以我的问题是:如果描述不存在,我如何才能为描述创建一个空结果。欢迎任何提示如何做!

到目前为止我的脚本是这样的。如果结果总是有 7 条记录,它会填满数据库(我测试了三个,请记住这一点)

import urllib.parse
import urllib.request
import re
import sys
import psycopg2 as dbapi

url = 'https://easy.dans.knaw.nl/ui/'
values = {'wicket:bookmarkablePage':':nl.knaw.dans.easy.web.search.pages.PublicSearchResultPage',
          'q' : 'opgraving'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
headers = {}
headers['User-Agent'] =  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req = urllib.request.Request(url,data, headers =headers)
resp = urllib.request.urlopen(req)
respData = resp.read()


saveRecord= open('C:/Users/berend/Desktop/record.txt','w')
record =  re.findall(r'<dd>(.*?)</dd>',str(respData))
for item in record:
    saveRecord.write("%s\n" % item)
saveRecord.close()

fin = open("C:/Users/berend/Desktop/record.txt",'r')
fit = open("C:/Users/berend/Desktop/record_schoon.txt",'w')
delete_list = ['</em>', '[',']','<em>','</span>', '<span>', '\n']
for line in fin:
    for word in delete_list:
        line = line.replace(word, "")
    fit.write(line)
fin.close()
fit.close()

open_record= open('C:/Users/berend/Desktop/record_schoon.txt','r')
content = list(open_record)
print(len(content))
open_record.close()

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

前 3 个结果:

2000
Groenewoudt, B.J.; Deeben, J.H.C.; Velde, H.M. van der
100% relevant
Na verkennend onderzoek in 1996 en een grootschalige opgraving met uitgebreid bodemkundig
opgraving
Archaeology
Open (registered users)
2001-09
Peters, F.J.C.; Peeters, J.H.M.
100% relevant
opgraving
Archaeology
Open (registered users)
2008
Jacobs, E.; Burnier, C.Y.
100% relevant
OPGRAVING
Archaeology
Open (registered users)

在这种情况下,我会使用 pandas and sqlalchemy 库来提高效率。我建议使用附加包的解决方案,因为您没有指定 "not" 使用它们。

而不是这个:

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

使用这样的东西:

import pandas as pd
from sqlalchemy import create_engine

# create a connection engine using sqlalchemy
engine = sqla.create_engine('postgresql+psycopg2://postgres:xxx@localhost/import', echo=False)

# read the results file into a pandas DataFrame
df = pd.read_csv('C:/Users/berend/Desktop/record_schoon.txt', delimiter='\t') # or whatever your delimiter is
dfFill = df.fillna("") # "" will be blank space when any record is missing data or 'nan'
dfFill.to_sql("tablename", engine, if_exists="append") #change tablename to the name of your table in import

HTH