我希望以一种通用的方式从工作规范中获取薪资信息(考虑到可以指定薪资的多种方式(在有“ Salary”字样的情况下,有无“ 0”)等等。)
采取了三种不同的工作规范,我使用抓取了HTML urllib2
,然后对grep
“每种工资”一词执行了不区分大小写的初始操作。结果差异很大(请注意Jupyter的贴子不够漂亮):
In [52]:
urllib2
Out[52]:
<module 'urllib2' from '/Users/Evan/anaconda/lib/python2.7/urllib2.pyc'>
In [82]:
情况1
reponse = urllib2_urlopen('http://apply.ovoenergycareers.co.uk/vacancies/453/cro-manager/london/')
In [83]:
content = reponse.read()
In [84]:
save_html('salarygrep1', content)
In [59]:
!grep -i salary salarygrep1.html
!grep -i salary salarygrep1.html
<dt class="field_salary">Salary</dt>
<dd class="value_salary">
In [86]:
with open('salarygrep1.html') as s:
for line in s:
if 'salary' in line.lower():
print line
<dt class="field_salary">Salary</dt>
<dd class="value_salary">
In [79]:
情况#2
reponse = urllib2_urlopen('http://apply.ovoenergycareers.co.uk/vacancies/475/ovo-telesales-agent/bristol/')
In [80]:
content = reponse.read()
In [81]:
save_html('salarygrep2', content)
In [63]:
!grep -i salary salarygrep2.html
<dt class="field_salary">Salary</dt>
<dd class="value_salary">
Salary: �18,000 + benefits & competitive commission scheme; OTE range: �20,500 - �30,000
In [87]:
with open('salarygrep2.html') as s:
for line in s:
if 'salary' in line.lower():
print line
<dt class="field_salary">Salary</dt>
<dd class="value_salary">
Salary: �18,000 + benefits & competitive commission scheme; OTE range: �20,500 - �30,000
In [88]:
情况#3
reponse = urllib2_urlopen('https://gs7.globalsuccessor.com/centrica02/tpl_centrica02.asp?s=4A515F4E5A565B1A&jobid=48490,2356610248&key=21798303&c=028859657862&pagestamp=dbykvxmmwfnblykbqc')
In [89]:
content = reponse.read()
In [90]:
save_html('salarygrep3', content)
In [67]:
!grep -i salary salarygrep3.html
!grep -i salary salarygrep3.html
<p id="igSoundBite"><em><div>Salary: £28-£38K depending on experience</div></em></p><h3 id="igJobDesc0">Job Description</h3><p><div>Assistant Product Development Manager </div>
In [95]:
with open('salarygrep3.html') as s:
for line in s:
if 'salary' in line.lower():
print line
<p id="igSoundBite"><em><div>Salary: £28-£38K depending on experience</div></em></p><h3 id="igJobDesc0">Job Description</h3><p><div>Assistant Product Development Manager </div>
In [70]:
情况#4
reponse = urllib2_urlopen('http://jobs.emounlimited.com/senior-digital-project-manager/')
In [71]:
content = reponse.read()
In [72]:
save_html('salarygrep4', content)
In [94]:
!grep -i salary salarygrep4.html
In [92]:
with open('salarygrep4.html') as s:
for line in s:
if 'salary' in line.lower():
print line
In [ ]:
<div>
,根本不会被收取。认为在页面设计和薪水规范方面存在很大差异,是否相信一种尺寸适合所有(或一种尺寸适合最高级)的正则表达式或正则表达式的组合可能会奏效?如果没有,我将如何构建它/它们?还是确实,是否有Python方法可以减少对正则表达式的依赖?
这是一个主意:
BeautifulSoup
body
元素的文本(其余部分我们不感兴趣)代码:
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
urls = [
"http://apply.ovoenergycareers.co.uk/vacancies/453/cro-manager/london/",
"http://apply.ovoenergycareers.co.uk/vacancies/475/ovo-telesales-agent/bristol/",
"https://gs7.globalsuccessor.com/centrica02/tpl_centrica02.asp?s=4A515F4E5A565B1A&jobid=48490,2356610248&key=21798303&c=028859657862&pagestamp=dbykvxmmwfnblykbqc",
"http://jobs.emounlimited.com/senior-digital-project-manager/"
]
money_pattern = re.compile(ur"($|£)([0-9.,]+K?)(?:\s*-\s*(?:$|£)*([0-9.,]+K?)*)*")
for url in urls:
soup = BeautifulSoup(requests.get(url).text, "html.parser")
text = soup.body.text
print("URL: " + url)
for currency, amount1, amount2 in money_pattern.findall(text):
if not amount1 and not amount2:
continue
if not amount2:
print("Single amount found: %s, currency: %s" % (amount1, currency))
else:
print("Range found: %s - %s, currency: %s" % (amount1, amount2, currency) )
print("------")
输出:
URL: http://apply.ovoenergycareers.co.uk/vacancies/453/cro-manager/london/
Range found: 40,000 - 50,000, currency: £
------
URL: http://apply.ovoenergycareers.co.uk/vacancies/475/ovo-telesales-agent/bristol/
Single amount found: 18,000, currency: £
Range found: 20,500 - 30,000, currency: £
------
URL: https://gs7.globalsuccessor.com/centrica02/tpl_centrica02.asp?s=4A515F4E5A565B1A&jobid=48490,2356610248&key=21798303&c=028859657862&pagestamp=dbykvxmmwfnblykbqc
Range found: 28 - 38K, currency: £
------
URL: http://jobs.emounlimited.com/senior-digital-project-manager/
Range found: 36 - 40,000, currency: £
------
希望这至少可以帮助您入门。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句