Feedparser :如果不存在则插入 pg
Feedparser : insert in pg if does not exists
我在这个问题上遇到了几个问题。
所以我正在尝试使用 feedparser 和 psycopg。问题是,我不想有重复的数据。
def dbFeed():
conn_string ="host='localhost' dbname='rss_feed' user='postgres' password='somepassword'"
print ("Connecting to dababase\n ->%s" %(conn_string))
try:
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
print ("Connected!\n")
except:
print ('Unable to connect to the database')
feeds_to_parse=open("C:\Users\Work\Desktop\feedparser_entry_tests_world.txt","r")
for line in feeds_to_parse:
parser = fp.parse(str(line))
x = len(parser['entries'])
count = 0
while count < x:
现在我有几个解决方案。
起初,我试过这个:
cursor.execute("INSERT INTO feed (link, title, publication_date, newspaper) VALUES (%s, %s, %s, %s)",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
但是我当然有重复的数据。所以我在这里看到了这个 post :
Avoiding duplicated data in PostgreSQL database in Python
我试过了但是我有一个元组索引超出范围的错误
cursor.execute("""INSERT INTO feed (link, title, publication_date, newspaper) SELECT %s, %s, %s, %s WHERE NOT EXISTS
(feed.title FROM feed WHERE feed.title=%s);""",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
但无论如何,这不是我想要的方式。我想在我的 while 循环中添加一个条件,在插入之前测试数据是否存在,因为我不想测试整个数据库,我只想测试最后的条目。再一次,当然它不起作用,因为我猜 parser['entries'][count]['title'] 不是我想的那样......
while count < x:
if parser['entries'][count]['title'] != cursor.execute("SELECT feed.title FROM feed WHERE publication_date > current_date - 15"):
cursor.execute("INSERT INTO feed (link, title, publication_date, newspaper) VALUES (%s, %s, %s, %s)",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
conn.commit()
cursor.close()
conn.close()
你必须添加用于where部分的第二个标题,你也可以在那里添加额外的条件:
cursor.execute(
"INSERT INTO feed (link, title, publication_date, newspaper) "
"SELECT %s, %s, %s, %s WHERE NOT EXISTS (SELECT 1 FROM feed "
"WHERE title = %s AND publication_date > current_date - 15);",
(parser['entries'][count]['link'],
parser['entries'][count]['title'],
parser['entries'][count]['published'],
parser['feed']['title'],
parser['feed']['title']))
我在这个问题上遇到了几个问题。 所以我正在尝试使用 feedparser 和 psycopg。问题是,我不想有重复的数据。
def dbFeed():
conn_string ="host='localhost' dbname='rss_feed' user='postgres' password='somepassword'"
print ("Connecting to dababase\n ->%s" %(conn_string))
try:
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
print ("Connected!\n")
except:
print ('Unable to connect to the database')
feeds_to_parse=open("C:\Users\Work\Desktop\feedparser_entry_tests_world.txt","r")
for line in feeds_to_parse:
parser = fp.parse(str(line))
x = len(parser['entries'])
count = 0
while count < x:
现在我有几个解决方案。 起初,我试过这个:
cursor.execute("INSERT INTO feed (link, title, publication_date, newspaper) VALUES (%s, %s, %s, %s)",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
但是我当然有重复的数据。所以我在这里看到了这个 post : Avoiding duplicated data in PostgreSQL database in Python
我试过了但是我有一个元组索引超出范围的错误
cursor.execute("""INSERT INTO feed (link, title, publication_date, newspaper) SELECT %s, %s, %s, %s WHERE NOT EXISTS
(feed.title FROM feed WHERE feed.title=%s);""",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
但无论如何,这不是我想要的方式。我想在我的 while 循环中添加一个条件,在插入之前测试数据是否存在,因为我不想测试整个数据库,我只想测试最后的条目。再一次,当然它不起作用,因为我猜 parser['entries'][count]['title'] 不是我想的那样......
while count < x:
if parser['entries'][count]['title'] != cursor.execute("SELECT feed.title FROM feed WHERE publication_date > current_date - 15"):
cursor.execute("INSERT INTO feed (link, title, publication_date, newspaper) VALUES (%s, %s, %s, %s)",
(parser['entries'][count]['link'], parser['entries'][count]['title'],
parser['entries'][count]['published'],parser['feed']['title']))
conn.commit()
cursor.close()
conn.close()
你必须添加用于where部分的第二个标题,你也可以在那里添加额外的条件:
cursor.execute(
"INSERT INTO feed (link, title, publication_date, newspaper) "
"SELECT %s, %s, %s, %s WHERE NOT EXISTS (SELECT 1 FROM feed "
"WHERE title = %s AND publication_date > current_date - 15);",
(parser['entries'][count]['link'],
parser['entries'][count]['title'],
parser['entries'][count]['published'],
parser['feed']['title'],
parser['feed']['title']))