如何获取 <a> 标签中包含特定 url 的文本
How to get the text in <a> tag that contains specific url
我有一个不知道答案的问题,它可能很有趣。
我正在为 link 这样的
抓取
<a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta v Sandozu</a>
现在我已经找到了,我还想要标签的文本:"Prosta delovna mesta v Sandozu"
如何获取文本?
使用纯字符串似乎很容易,这就是解决方案:
response.xpath('//a[@href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml"]/text()').extract()
但我在循环中,我只参考了这个url。我尝试了几个选项,例如:
response.xpath('//a[@href=url_orig]/text()').extract()
response.xpath('//a[@href='url_orig']/text()').extract()
word = "career"
response.xpath('//a[contains(@href, "%s")]/text()').extract() % word
但其中 none 有效。我正在寻找如何将引用而不是字符串放入“@href”或 'contains' 函数中。这是我的代码。你认为有办法吗?
谢谢
马可
def parse(self, response):
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//@href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
#If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
if url.endswith((
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI',
#compressions and other
'.zip', '.rar', '.css', '.flv',
'.ZIP', '.RAR', '.CSS', '.FLV',
)):
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['?', '%', '&', '#']):
continue
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url_orig = url
url = urljoin(base_url,url)
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
if any(x in url for x in [
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
'join_us',
'Join_Us',
'Join_us'
'vacancies',
'Vacancies',
'work-for-us',
'working-with-us',
'join_us',
]):
#We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["link"] = url
#item["term"] = response.xpath('//a[@href=url_orig]/text()').extract()
#item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()
#We return the item.
yield item
#We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
您需要将 url 放在引号中并使用字符串格式:
item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()
我有一个不知道答案的问题,它可能很有趣。 我正在为 link 这样的
抓取 <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta v Sandozu</a>
现在我已经找到了,我还想要标签的文本:"Prosta delovna mesta v Sandozu"
如何获取文本? 使用纯字符串似乎很容易,这就是解决方案:
response.xpath('//a[@href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml"]/text()').extract()
但我在循环中,我只参考了这个url。我尝试了几个选项,例如:
response.xpath('//a[@href=url_orig]/text()').extract()
response.xpath('//a[@href='url_orig']/text()').extract()
word = "career"
response.xpath('//a[contains(@href, "%s")]/text()').extract() % word
但其中 none 有效。我正在寻找如何将引用而不是字符串放入“@href”或 'contains' 函数中。这是我的代码。你认为有办法吗?
谢谢 马可
def parse(self, response):
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//@href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
#If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
if url.endswith((
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI',
#compressions and other
'.zip', '.rar', '.css', '.flv',
'.ZIP', '.RAR', '.CSS', '.FLV',
)):
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['?', '%', '&', '#']):
continue
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url_orig = url
url = urljoin(base_url,url)
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
if any(x in url for x in [
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
'join_us',
'Join_Us',
'Join_us'
'vacancies',
'Vacancies',
'work-for-us',
'working-with-us',
'join_us',
]):
#We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["link"] = url
#item["term"] = response.xpath('//a[@href=url_orig]/text()').extract()
#item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()
#We return the item.
yield item
#We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
您需要将 url 放在引号中并使用字符串格式:
item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()