Mysql / 忽略多列重复项
Mysql / ignore duplicates on multiple columns
各位下午好
继上一个问题之后,
我创建了一个小型 python 应用程序,可以从网站上为我抓取招聘信息。
我把它们保存在一个小的 sql 数据库中
列是
Job_id/job_title/求职公司/job_salary/job_location/job_post_date
我计划每天 运行 运行我的脚本并希望忽略重复的条目。
什么样的查询可以检查 2 列,即标题和公司,以确保它不会被再次插入,发布的日期总是每天递增 1
Python代码
import mysql.connector
import requests
from bs4 import BeautifulSoup
my_db = mysql.connector.connect(
host="192.168.1.1",
user='job_user',
password='job1',
database='job_db'
)
my_cursor = my_db.cursor()
radius = "10"
val1 = input("Enter Job: ")
val2 = input("Enter Location: ")
val3 = input("Enter Radius default:(10): ")
url = "https://www.indeed.co.uk/jobs?q={}&l={}&radius={}".format(val1, val2, val3)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result1 = soup.find_all(class_="jobsearch-SerpJobCard")
for results in result1:
job = results.find('a', attrs={'data-tn-element': 'jobTitle'})
company = results.find('span', attrs={'class': 'company'})
location = results.find('span', attrs={'class': 'location accessible-contrast-color-location'})
salary = results.find('span', attrs={'class': 'salaryText'})
date_pos = results.find('span', attrs={'class': 'date'})
i1 = job.text.strip()
i2 = company.text.strip()
if location is not None:
i3 = location.text.strip()
else:
i3 = "N/A"
if salary is not None:
i4 = salary.text.strip()
else:
i4 = "N/A"
i5 = date_pos.text.strip()[:1]
print(i1)
print(i2)
print(i3)
print(i4)
print("\n")
sql = "INSERT INTO job_tbl (job_title, job_company, job_salary, job_location, job_posted) \
VALUES (%s, %s, %s, %s, %s)"
val = (i1, i2, i3, i4, i5)
my_cursor.execute(sql, val)
my_db.commit()
SQL查询
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| job_id | job_title | job_company | job_salary | job_location | job_posted |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| 1 | IT Technician | Strathallan School | N/A | £19,000 - £23,000 a year | 3 |
| 2 | English into Romanian IT/Technical Translator (relocation to... | Alpha CRC Ltd. | N/A | £22,000 - £25,000 a year | 7 |
| 3 | IT/Trainee IT Support Analyst | CJ Lang & Son Limited | Dundee DD4 8JU | N/A | 3 |
| 4 | IT Technical Support Apprentice | GP Strategies Training Ltd | Dundee | £10,000 - £12,000 a year | 1 |
| 5 | IT Operations Manager - IRC84524 | Scottish Government | Dundee DD1 | £48,930 - £61,006 a year | 3 |
| 6 | Temporary IT Buyer | brightsolid | Dundee | N/A | 7 |
| 7 | IT Site Support Analyst | Thermo Fisher Scientific | Perth | N/A | 6 |
| 8 | Network and System Administrator | Solutions Driven | Forfar | £30,000 - £35,000 a year | 3 |
| 9 | IT Service Desk Team Leader | Cross Resourcing | Dundee | N/A | 3 |
| 10 | Senior Network Engineer | Raytheon Intelligence & Space | Glenrothes | N/A | 3 |
| 11 | Solutions Architect | NCR | Dundee | N/A | 3 |
| 12 | Technical Support Specialist | The Army | N/A | £15,985 - £20,400 a year | 3 |
| 13 | Pre-Sales Solutions Architect – 12 Month Graduate Internship... | DELL | N/A | N/A | 3 |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
13 rows in set (0.002 sec)
如果我再次 运行 相同的应用程序,它会添加相同的结果,我想要的是匹配 Title & Company 并检查它是否已经被添加
您通常会对列的元组设置唯一约束,并在插入时使用 update ... on duplicate key
,这样不会插入重复项,而是更新当前行的日期。
所以像这样:
create table mytable (
id int primary key auto_increment,
title varchar(50), -- ajust the size as needed
company varchar(50),
salary int,
location varchar(50),
post_date datetime
default current_timestamp, -- not mandatory, but maybe helpful?
unique (title, company)
);
然后:
insert into mytable (title, company, salary, location)
values (?, ?, ?, ?)
on duplicate key update post_date = current_timestamp;
各位下午好
继上一个问题之后,
我创建了一个小型 python 应用程序,可以从网站上为我抓取招聘信息。
我把它们保存在一个小的 sql 数据库中
列是 Job_id/job_title/求职公司/job_salary/job_location/job_post_date
我计划每天 运行 运行我的脚本并希望忽略重复的条目。
什么样的查询可以检查 2 列,即标题和公司,以确保它不会被再次插入,发布的日期总是每天递增 1
Python代码
import mysql.connector
import requests
from bs4 import BeautifulSoup
my_db = mysql.connector.connect(
host="192.168.1.1",
user='job_user',
password='job1',
database='job_db'
)
my_cursor = my_db.cursor()
radius = "10"
val1 = input("Enter Job: ")
val2 = input("Enter Location: ")
val3 = input("Enter Radius default:(10): ")
url = "https://www.indeed.co.uk/jobs?q={}&l={}&radius={}".format(val1, val2, val3)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result1 = soup.find_all(class_="jobsearch-SerpJobCard")
for results in result1:
job = results.find('a', attrs={'data-tn-element': 'jobTitle'})
company = results.find('span', attrs={'class': 'company'})
location = results.find('span', attrs={'class': 'location accessible-contrast-color-location'})
salary = results.find('span', attrs={'class': 'salaryText'})
date_pos = results.find('span', attrs={'class': 'date'})
i1 = job.text.strip()
i2 = company.text.strip()
if location is not None:
i3 = location.text.strip()
else:
i3 = "N/A"
if salary is not None:
i4 = salary.text.strip()
else:
i4 = "N/A"
i5 = date_pos.text.strip()[:1]
print(i1)
print(i2)
print(i3)
print(i4)
print("\n")
sql = "INSERT INTO job_tbl (job_title, job_company, job_salary, job_location, job_posted) \
VALUES (%s, %s, %s, %s, %s)"
val = (i1, i2, i3, i4, i5)
my_cursor.execute(sql, val)
my_db.commit()
SQL查询
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| job_id | job_title | job_company | job_salary | job_location | job_posted |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| 1 | IT Technician | Strathallan School | N/A | £19,000 - £23,000 a year | 3 |
| 2 | English into Romanian IT/Technical Translator (relocation to... | Alpha CRC Ltd. | N/A | £22,000 - £25,000 a year | 7 |
| 3 | IT/Trainee IT Support Analyst | CJ Lang & Son Limited | Dundee DD4 8JU | N/A | 3 |
| 4 | IT Technical Support Apprentice | GP Strategies Training Ltd | Dundee | £10,000 - £12,000 a year | 1 |
| 5 | IT Operations Manager - IRC84524 | Scottish Government | Dundee DD1 | £48,930 - £61,006 a year | 3 |
| 6 | Temporary IT Buyer | brightsolid | Dundee | N/A | 7 |
| 7 | IT Site Support Analyst | Thermo Fisher Scientific | Perth | N/A | 6 |
| 8 | Network and System Administrator | Solutions Driven | Forfar | £30,000 - £35,000 a year | 3 |
| 9 | IT Service Desk Team Leader | Cross Resourcing | Dundee | N/A | 3 |
| 10 | Senior Network Engineer | Raytheon Intelligence & Space | Glenrothes | N/A | 3 |
| 11 | Solutions Architect | NCR | Dundee | N/A | 3 |
| 12 | Technical Support Specialist | The Army | N/A | £15,985 - £20,400 a year | 3 |
| 13 | Pre-Sales Solutions Architect – 12 Month Graduate Internship... | DELL | N/A | N/A | 3 |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
13 rows in set (0.002 sec)
如果我再次 运行 相同的应用程序,它会添加相同的结果,我想要的是匹配 Title & Company 并检查它是否已经被添加
您通常会对列的元组设置唯一约束,并在插入时使用 update ... on duplicate key
,这样不会插入重复项,而是更新当前行的日期。
所以像这样:
create table mytable (
id int primary key auto_increment,
title varchar(50), -- ajust the size as needed
company varchar(50),
salary int,
location varchar(50),
post_date datetime
default current_timestamp, -- not mandatory, but maybe helpful?
unique (title, company)
);
然后:
insert into mytable (title, company, salary, location)
values (?, ?, ?, ?)
on duplicate key update post_date = current_timestamp;