Multi-threading for downloading NCBI files in Python

debugcn Published at Dev

Darth Bunny

So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.

Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.

import urllib2
from BeautifulSoup import BeautifulSoup

#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'


#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files. 
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type


#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
    #Create the position string
    poision = '&retstart='+str(i)+'&retmax='+str(retmax)
    #Create the efetch URL
    url = base + efetch + poision + options
    print url
    #Grab the results
    response = urllib2.urlopen(url)
    #Write output to file
    with open(output, 'a') as file:
        for line in response.readlines():
            file.write(line)
    #Gives a sense of where you are
    print Count - i - retmax

jfs

To download files using multiple threads:

#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen

def generate_urls(some, params): #XXX pass whatever parameters you need
    for restart in range(*params):
        # ... generate url, filename
        yield url, filename

def download((url, filename)):
    try:
        with closing(urlopen(url)) as response, open(filename, 'wb') as file:
            shutil.copyfileobj(response, file)
    except Exception as e:
        return (url, filename), repr(e)
    else: # success
        return (url, filename), None

def main():
    pool = Pool(20) # at most 20 concurrent downloads
    urls = generate_urls(some, params)
    for (url, filename), error in pool.imap_unordered(download, urls):
        if error is not None:
           print("Can't download {url} to {filename}, "
                 "reason: {error}".format(**locals())

if __name__ == "__main__":
   main()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-26

Comments

0 comments

From Dev

Related Related

Article

Multi-threading for downloading NCBI files in Python

Multi-threading for downloading NCBI files in Python

Downloading files concurrently in Python

python - multi-threading in a for loop

Multi threading using Python and pymongo

Python Feedparser and Multi-threading

Multi-threading in Python and shell

python - multi-threading in a for loop

Downloading a LOT of files using python

Downloading files based on timestamp in Python

Downloading a LOT of files using python

Downloading over 1000 files in python

Multi threading in python using parallel threads

Multi threading read and write file using python

Python how to use Threading in multi download

Multi threading in python : issue with parallel processing

Using a for loop along with multi threading python

Issue with python multi threading and socket connections

Multi threading

Multi threading

Python - Deleting multi files

Downloading files with unicode characters from BaseHTTPServer in Python

Loop through downloading files using selenium in Python

Loop through downloading files using selenium in Python

Automate downloading embedded PDF files, using Python

Downloading list of urls/files using loop - python

Threading in python - processing multiple large files concurrently

Threading with files

multi-threading in python: is it really performance effiicient most of the time?

Python multi-threading performance issue related to start()

Python Proxy Scraper / Checker adding multi-threading trouble