如何从没有 html 类的单行文本中提取信息？

debugcn 发表于 Dev

阿鲁拉火焰

我正在尝试使用 scrapy 和 python抓取我的第一个网站（https://news.ycombinator.com/jobs）。我需要提取的信息如下： - 正在招聘的公司的名称 - 的位置公司 - 广告所针对的职位

页面html中的这些字段没有单独的标签。文本也没有特定的模式。例如 ZeroCater (YC W11) 正在 SF 招聘一名首席工程师：Must Love Food

单独的正则表达式不足以提取此信息。这个问题有什么有效和简单的解决方案吗？

我试过 python 正则表达式。我还使用 nltk 研究了 NLP 和文本分类。但是nltk会增加代码的复杂度，而且很耗时。

马哈茂德·艾尔沙哈特

在这种情况下，我将做的是尝试找到任何模式来帮助我提取这些数据，例如，我可以看到这些词很频繁，"is hiring|is looking for|is looking to hire|hiring"并且公司名称在前，位置也在后in：

这只是一个小试验，您可以扩展它以获得您需要的

import re
text = """ZeroCater (YC W11) Is Hiring a Principal Engineer in SF: Must Love Food (zerocater.com)
OneSignal Is Hiring Full Stack Engineers in San Mateo (onesignal.com)
Faire (YC W17) Is Looking to Hire Business Operations Leads (greenhouse.io)
InsideSherpa (YC W19) Is Hiring Software Engineers in Sydney (workable.com)
Jerry (YC S17) Is Hiring Senior Software Dev, Data Engineer (Toronto/Remote) (getjerry.com)
Iris Automation Is Hiring an Account Executive for B2B Flying Vehicle Software (irisonboard.com)"""

data = text.lower().splitlines()

for i, line in enumerate(data):
    # getting company name
    data[i] = re.split(r'is hiring|is looking for|is looking to hire|hiring', line)

    # job title and location if present
    data[i][1] = re.split(r' in ', data[i][1])

print('company --- Job Title --- Location')
for c in data:
    print(f'{c[0]} --- {c[1][0]} --- {c[1][1] if len(c[1])>1 else ""}')

输出：

company --- Job Title --- Location
zerocater (yc w11)  ---  a principal engineer --- sf: must love food (zerocater.com)
onesignal  ---  full stack engineers --- san mateo (onesignal.com)
faire (yc w17)  ---  business operations leads (greenhouse.io) --- 
insidesherpa (yc w19)  ---  software engineers --- sydney (workable.com)
jerry (yc s17)  ---  senior software dev, data engineer (toronto/remote) (getjerry.com) --- 
iris automation  ---  an account executive for b2b flying vehicle software (irisonboard.com) ---

确保此代码需要大量修改才能获得可靠的结果，但至少它是一个开始

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。