如何从Pubmed下载全文文章？

debugcn 发表于 Dev

卡希尼·瓦达万（Kahini Wadhawan）

我正在做一个需要与Genia语料库一起工作的项目。根据文献，Genia Corpus是通过在Medline / Pubmed上搜索3个Mesh术语（“转录因子”，“血细胞”和“人类”）而提取的文章制成的。我想从Pubmed中提取Genia语料库中文章的全文文章（可免费获得）。我已经尝试了许多方法，但是无法找到下载文本，XML或Pdf格式的全文的方法。

使用NCBI提供的Entrez实用程序：

我尝试使用此处提到的方法-http: //www.hpa-bioinformatics.org.uk/bioruby-api/classes/Bio/NCBI/REST/EFetch/Methods.html#M002197

使用像这样的Ruby gem Bio来获取给定PubMed ID的信息-Bio :: NCBI :: REST :: EFetch.pubmed（15496913）

但是，它不会返回PMID的全文。
在内部，它会像这样拨打电话-http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1372388&retmode=text&rettype=medline

但是，Ruby gem和上面的调用都不会返回全文。
在进一步上网搜索，我发现，对于考研的rettype和retmode允许值不具有一个选项，以获得完整的文本，如表这里所说- http://www.ncbi.nlm.nih.gov/ books / NBK25499 / table / chapter4.T._valid_values_of__retmode_and /？report = objectonly
我在Internet上看到的所有示例和其他脚本仅与提取摘要有关。作者等，他们都没有讨论提取全文。
这是我发现的另一个使用Python包Bio的链接，但仅访问有关作者的信息-https: //www.biostars.org/p/172296/

如何使用NCBI提供的Entrez实用工具下载文本，XML或Pdf格式的文章全文？还是我已经可以使用可用的脚本或Web搜寻器？

马克西米利安·彼得斯（Maximilian Peters）

您可以biopython用来获取PubMedCentral上的文章，然后从中获取PDF。对于在其他地方托管的所有文章，很难获得通用的解决方案来获取PDF。

看来PubMedCentral不想让您批量下载文章。通过urllib的请求被阻止，但是从浏览器可以使用相同的URL。

from Bio import Entrez

Entrez.email = "[email protected]"


#id is a string list with pubmed IDs
#two of have a public PMC article, one does not
handle = Entrez.efetch("pubmed", id="19304878,19088134", retmode="xml")

records = Entrez.parse(handle)
#checks for all records if they have a PMC identifier
#prints the URL for downloading the PDF
for record in records:
    if record.get('MedlineCitation'):
        if record['MedlineCitation'].get('OtherID'):
           for other_id in record['MedlineCitation']['OtherID']:
               if other_id.title().startswith('Pmc'):
                   print('http://www.ncbi.nlm.nih.gov/pmc/articles/%s/pdf/' % (other_id.title().upper()))

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。