Python Proxy Scraper / Checker adding multi-threading trouble

Ken_Mon

I have managed to piece together a proxy scraper/checker, it does work but it is quite slow. I have heard that adding threading can speed up the process, this is past what I am capable of and am wondering if anyone can help show me how to implement threading into the code. I read that the threading library is included with python, I had an attempt at adding it but it seemed to create a second thread doing exactly the same, so it was just going through the same list of proxies at the same time saving duplicates. here is the code.

import requests
from bs4 import BeautifulSoup
from random import choice
import threading
import time
    
stop_flag = 0
    
def get_proxies():
    link = 'https://api.proxyscrape.com/?request=displayproxies&proxytype=all&timeout=5000&country=all&anonymity=all&ssl=no'
    other = 'https://www.proxy-list.download/api/v1/get?type=http'
    get_list1 = requests.get(link).text
    get_list2 = requests.get(other).text
    soup1 = BeautifulSoup(get_list1, 'lxml')
    soup2 = BeautifulSoup(get_list2, 'lxml')
    list1 = soup1.find('body').get_text().strip()
    list2 = soup2.find('body').get_text().strip()
    mix = list1+'\n'+list2+'\n'
    raw_proxies = mix.splitlines()
    t = threading.Thread(target=check_proxy, args=(raw_proxies,))
    t.start()
    time.sleep(0.5)
    return check_proxy(raw_proxies)

def check_proxy(proxies):
    check = 'http://icanhazip.com'       
    for line in proxies:

        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Accept-Encoding": "*","Connection": "keep-alive"}
        try:    
            response = requests.get(check, proxies={'http': 'http://'+line}, headers=headers, timeout=5)
            status = response.status_code

            outfile = open('good_proxies.txt', 'a')
            if status is 200:
                print('good - '+line)
                outfile.write(line+'\n')
            else:
        
                pass
        except Exception:
            print('bad  - '+line)
    outfile.close()        

get_proxies()
Booboo

The following should run much faster. It is best probably to do all the file writing and printing in the main thread and have the worker threads simply return results back:

import requests
from bs4 import BeautifulSoup
from random import choice
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from functools import partial

stop_flag = 0

def get_list(session, url):
    get_list = session.get(url).text
    soup = BeautifulSoup(get_list, 'lxml')
    return soup.find('body').get_text().strip()


def get_proxies(session, executor):
    link = 'https://api.proxyscrape.com/?request=displayproxies&proxytype=all&timeout=5000&country=all&anonymity=all&ssl=no'
    other = 'https://www.proxy-list.download/api/v1/get?type=http'
    lists = list(executor.map(partial(get_list, session), (link, other)))
    mix = lists[0] + '\n' + lists[1] + '\n'
    raw_proxies = mix.splitlines()
    with open('good_proxies.txt', 'a') as outfile:
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Accept-Encoding": "*","Connection": "keep-alive"}
        session.headers.update(headers)
        futures = {executor.submit(partial(check_proxy, session), proxy): proxy for proxy in raw_proxies}
        for future in as_completed(futures):
            proxy = futures[future]
            is_good = future.result()
            if is_good:
                print('good -', proxy)
                outfile.write(proxy + '\n')
            else:
                print('bad -', proxy)


def check_proxy(session, proxy):
    check = 'http://icanhazip.com'
    try:
        response = session.get(check, proxies={'http': 'http://'+proxy}, timeout=5)
        status = response.status_code
        return status == 200
    except Exception:
        return False


N_THREADS=100
with requests.Session() as session:
    with ThreadPoolExecutor(max_workers=N_THREADS) as executor:
        get_proxies(session, executor)

Partial output:

bad - 104.40.158.173:80
bad - 47.105.149.144:3128
good - 185.220.115.150:80
bad - 138.197.157.32:8080
bad - 138.197.157.32:3128
good - 116.17.102.174:3128
good - 183.238.173.226:3128
good - 119.8.44.244:8080
good - 1.174.138.125:8080
good - 116.17.102.131:3128
good - 101.133.167.140:8888
good - 118.31.225.11:8000
good - 117.131.119.116:80
good - 101.200.127.78:80
good - 1.70.67.175:9999
good - 116.196.85.150:3128
good - 1.70.64.160:9999
bad - 102.129.249.120:3128
bad - 138.68.41.90:3128
bad - 138.68.240.218:3128
good - 47.106.239.215:3328
good - 183.162.158.172:4216
bad - 138.68.240.218:8080
good - 115.219.131.244:3000
bad - 138.68.161.14:3128
good - 185.49.107.1:8080
bad - 134.209.29.120:8080

Explanation

The speed improvements are brought about primarily using threading and secondarily using a Session object provided by the requests package, the main advantage of which is that if you are making several requests to the same host, the same TCP connection will be reused.

Python provides two thread pooling mechanisms, (1) the undocumented multiprocessing.pool.ThreadPool class, which shares the same interface as the multiprocessing.pool.Pool class used to create a pool of sub-processes and (2) the ThreadPoolExecutor class from the concurrent.futures module, which shares the same interface as the ProcessPoolExecutor class from the same module, which is used for creating a processor pool. This code uses the ThreadPoolExecutor class.

Threads are lightweight and relatively inexpensive to create and a typical desktop computer can support several thousand. A given application, depending on what it is doing, may not profit, however, by creating more threads beyond some maximum. And threading is only suitable for "jobs" that are non-CPU intensive. That is they relinquish the CPU frequently to allow other threads to run because they are, for example, waiting for an I/O operation or a URL get request to complete. This is because Python byte code cannot run in parallel in multiple threads because the Python interpreter obtains the Global Interpreter Lock (GIL) prior to executing byte code.

A ThreadPoolExecutor instance (assigned to variable executor) is created specifying the number of threads to be in the pool using the max_workers parameter. Here I rather arbitrarily specified 100 threads. You could try increasing this and seeing if it improves performance. The ThreadPoolExecutor instance has two methods one can use for submitting "jobs" or "tasks" to the thread pool for execution. See concurrent.futures documentation. Function map is similar to the builtin map function in that it returns an iterator that applies a function to every item of its iterable result, yielding the results. The difference is that the function calls are now going to be made concurrently by submitting each call as a "job" to the thread pool. The function that help builds the raw_proxies list is get_list and it is responsible for retrieving a single URL:

def get_list(session, url):
    get_list = session.get(url).text
    soup = BeautifulSoup(get_list, 'lxml')
    return soup.find('body').get_text().strip()

I would now like to concurrently call this function for each URL, so I would like to use the map function where the iterable argument is the list of URLs. The problem is that map will only pass to the worker function a single argument (each element of the iterable for each call), but I also want to pass the session argument. I could have assigned the session variable to a global variable, but there was another way. functools.partial(get_list, session) creates another function that when called behaves as if get_list is being called with its first parameter "hard-coded" to be session and so I use this new function in the call to map:

lists = list(executor.map(partial(get_list, session), (link, other)))

I take the iterable being returned by the call to map and turn it into a list that I can later index.

The other method one can use to submit a job to a thread pool is called submit. It takes as arguments the worker function and the worker function's arguments and immediately returns an instance of a Future without waiting for the job to complete. There are various methods you can apply to this Future instance, the most important one is result, which blocks until the job has completed and returns the return value from the worker function. I could have easily used the map function again passing the raw_proxies as the iterable argument and then iterate over the return value from the call to map. But I would be blocking on the jobs in the order in which they were submitted (i.e. the order in which they appear in the raw_proxies list). And that's probably not too bad because the program will not finish until all the "jobs" have finished anyway. But it is slightly more efficient to process the results of a job as soon as it is completed independent of its submission order if you don't require outputting the results in a specific order. The submit function, which returns a Future instance, provides that flexibility:

I individually submit each proxy IP as a separate job and I store the resulting Future in a dictionary as the key with its value being the IP value used to create the job. I do it all with a single statement using a dictionary comprehension:

futures = {executor.submit(partial(check_proxy, session), proxy): proxy for proxy in raw_proxies}

I then use another method provided by concurrent.futures, as_completed, to iterate through all the key values of the dictionary, i.e. the Futures, to return each Future instance as they are completed and then query the Future for the job's return value, which will be either True or False:

    for future in as_completed(futures):
        proxy = futures[future]
        is_good = future.result()
        if is_good:
            print('good -', proxy)
            outfile.write(proxy + '\n')
        else:
            print('bad -', proxy)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Multi threading issues with database

From Dev

JavaScriptCompressor, CssCompressor and multi threading

From Dev

Multi threading database reading

From Dev

Multi-threading with bash

From Dev

multi-threading in python: is it really performance effiicient most of the time?

From Dev

Multi-threading safety for python string operation via '%'

From Dev

Python application multi-threading for dev environment only

From Dev

Multi threading in python using parallel threads

From Dev

python - multi-threading in a for loop

From Dev

How to avoid multi Threading

From Dev

Adding multi-threading possibility to a single-threaded all-files-in-directory iterator utility function

From Dev

Multi threading using Python and pymongo

From Dev

Multi-threading and queuing

From Dev

Multi threading unit test

From Dev

Python adding data in multi-dimensional dictionary

From Dev

Multi Threading Java ScriptEngine

From Dev

Multi-Threading proxy checker Object not set to an instance of an object

From Dev

JMS with akka and multi threading

From Dev

Multi threading

From Dev

Multi Threading in Android

From Dev

Multi threading JMX client

From Dev

Multi threading read and write file using python

From Dev

Java Multi threading semaphore

From Dev

Python multi-threading performance issue related to start()

From Dev

Python how to use Threading in multi download

From Dev

Trouble using lambda function within my scraper

From Dev

Button problem in multi-threading Python Tkinter Serial Monitor

From Dev

How display multi videos with threading using tkinter in python?

From Dev

Trouble with adding information to JSON library in Python

Related Related

HotTag

Archive