I have managed to piece together a proxy scraper/checker, it does work but it is quite slow. I have heard that adding threading can speed up the process, this is past what I am capable of and am wondering if anyone can help show me how to implement threading into the code. I read that the threading library is included with python, I had an attempt at adding it but it seemed to create a second thread doing exactly the same, so it was just going through the same list of proxies at the same time saving duplicates. here is the code.
import requests
from bs4 import BeautifulSoup
from random import choice
import threading
import time
stop_flag = 0
def get_proxies():
link = 'https://api.proxyscrape.com/?request=displayproxies&proxytype=all&timeout=5000&country=all&anonymity=all&ssl=no'
other = 'https://www.proxy-list.download/api/v1/get?type=http'
get_list1 = requests.get(link).text
get_list2 = requests.get(other).text
soup1 = BeautifulSoup(get_list1, 'lxml')
soup2 = BeautifulSoup(get_list2, 'lxml')
list1 = soup1.find('body').get_text().strip()
list2 = soup2.find('body').get_text().strip()
mix = list1+'\n'+list2+'\n'
raw_proxies = mix.splitlines()
t = threading.Thread(target=check_proxy, args=(raw_proxies,))
t.start()
time.sleep(0.5)
return check_proxy(raw_proxies)
def check_proxy(proxies):
check = 'http://icanhazip.com'
for line in proxies:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Accept-Encoding": "*","Connection": "keep-alive"}
try:
response = requests.get(check, proxies={'http': 'http://'+line}, headers=headers, timeout=5)
status = response.status_code
outfile = open('good_proxies.txt', 'a')
if status is 200:
print('good - '+line)
outfile.write(line+'\n')
else:
pass
except Exception:
print('bad - '+line)
outfile.close()
get_proxies()
The following should run much faster. It is best probably to do all the file writing and printing in the main thread and have the worker threads simply return results back:
import requests
from bs4 import BeautifulSoup
from random import choice
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from functools import partial
stop_flag = 0
def get_list(session, url):
get_list = session.get(url).text
soup = BeautifulSoup(get_list, 'lxml')
return soup.find('body').get_text().strip()
def get_proxies(session, executor):
link = 'https://api.proxyscrape.com/?request=displayproxies&proxytype=all&timeout=5000&country=all&anonymity=all&ssl=no'
other = 'https://www.proxy-list.download/api/v1/get?type=http'
lists = list(executor.map(partial(get_list, session), (link, other)))
mix = lists[0] + '\n' + lists[1] + '\n'
raw_proxies = mix.splitlines()
with open('good_proxies.txt', 'a') as outfile:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Accept-Encoding": "*","Connection": "keep-alive"}
session.headers.update(headers)
futures = {executor.submit(partial(check_proxy, session), proxy): proxy for proxy in raw_proxies}
for future in as_completed(futures):
proxy = futures[future]
is_good = future.result()
if is_good:
print('good -', proxy)
outfile.write(proxy + '\n')
else:
print('bad -', proxy)
def check_proxy(session, proxy):
check = 'http://icanhazip.com'
try:
response = session.get(check, proxies={'http': 'http://'+proxy}, timeout=5)
status = response.status_code
return status == 200
except Exception:
return False
N_THREADS=100
with requests.Session() as session:
with ThreadPoolExecutor(max_workers=N_THREADS) as executor:
get_proxies(session, executor)
Partial output:
bad - 104.40.158.173:80
bad - 47.105.149.144:3128
good - 185.220.115.150:80
bad - 138.197.157.32:8080
bad - 138.197.157.32:3128
good - 116.17.102.174:3128
good - 183.238.173.226:3128
good - 119.8.44.244:8080
good - 1.174.138.125:8080
good - 116.17.102.131:3128
good - 101.133.167.140:8888
good - 118.31.225.11:8000
good - 117.131.119.116:80
good - 101.200.127.78:80
good - 1.70.67.175:9999
good - 116.196.85.150:3128
good - 1.70.64.160:9999
bad - 102.129.249.120:3128
bad - 138.68.41.90:3128
bad - 138.68.240.218:3128
good - 47.106.239.215:3328
good - 183.162.158.172:4216
bad - 138.68.240.218:8080
good - 115.219.131.244:3000
bad - 138.68.161.14:3128
good - 185.49.107.1:8080
bad - 134.209.29.120:8080
Explanation
The speed improvements are brought about primarily using threading and secondarily using a Session object
provided by the requests
package, the main advantage of which is that if you are making several requests to the same host, the same TCP connection will be reused.
Python provides two thread pooling mechanisms, (1) the undocumented multiprocessing.pool.ThreadPool
class, which shares the same interface as the multiprocessing.pool.Pool
class used to create a pool of sub-processes and (2) the ThreadPoolExecutor
class from the concurrent.futures
module, which shares the same interface as the ProcessPoolExecutor
class from the same module, which is used for creating a processor pool. This code uses the ThreadPoolExecutor
class.
Threads are lightweight and relatively inexpensive to create and a typical desktop computer can support several thousand. A given application, depending on what it is doing, may not profit, however, by creating more threads beyond some maximum. And threading is only suitable for "jobs" that are non-CPU intensive. That is they relinquish the CPU frequently to allow other threads to run because they are, for example, waiting for an I/O operation or a URL get request to complete. This is because Python byte code cannot run in parallel in multiple threads because the Python interpreter obtains the Global Interpreter Lock (GIL) prior to executing byte code.
A ThreadPoolExecutor
instance (assigned to variable executor
) is created specifying the number of threads to be in the pool using the max_workers
parameter. Here I rather arbitrarily specified 100 threads. You could try increasing this and seeing if it improves performance. The ThreadPoolExecutor
instance has two methods one can use for submitting "jobs" or "tasks" to the thread pool for execution. See concurrent.futures documentation. Function map
is similar to the builtin map
function in that it returns an iterator that applies a function to every item of its iterable result, yielding the results. The difference is that the function calls are now going to be made concurrently by submitting each call as a "job" to the thread pool. The function that help builds the raw_proxies
list is get_list
and it is responsible for retrieving a single URL:
def get_list(session, url):
get_list = session.get(url).text
soup = BeautifulSoup(get_list, 'lxml')
return soup.find('body').get_text().strip()
I would now like to concurrently call this function for each URL, so I would like to use the map
function where the iterable argument is the list of URLs. The problem is that map
will only pass to the worker function a single argument (each element of the iterable for each call), but I also want to pass the session
argument. I could have assigned the session
variable to a global variable, but there was another way. functools.partial(get_list, session)
creates another function that when called behaves as if get_list
is being called with its first parameter "hard-coded" to be session
and so I use this new function in the call to map
:
lists = list(executor.map(partial(get_list, session), (link, other)))
I take the iterable being returned by the call to map
and turn it into a list
that I can later index.
The other method one can use to submit a job to a thread pool is called submit
. It takes as arguments the worker function and the worker function's arguments and immediately returns an instance of a Future
without waiting for the job to complete. There are various methods you can apply to this Future
instance, the most important one is result
, which blocks until the job has completed and returns the return value from the worker function. I could have easily used the map
function again passing the raw_proxies
as the iterable argument and then iterate over the return value from the call to map
. But I would be blocking on the jobs in the order in which they were submitted (i.e. the order in which they appear in the raw_proxies
list). And that's probably not too bad because the program will not finish until all the "jobs" have finished anyway. But it is slightly more efficient to process the results of a job as soon as it is completed independent of its submission order if you don't require outputting the results in a specific order. The submit
function, which returns a Future
instance, provides that flexibility:
I individually submit each proxy IP as a separate job and I store the resulting Future
in a dictionary as the key with its value being the IP value used to create the job. I do it all with a single statement using a dictionary comprehension:
futures = {executor.submit(partial(check_proxy, session), proxy): proxy for proxy in raw_proxies}
I then use another method provided by concurrent.futures
, as_completed
, to iterate through all the key values of the dictionary, i.e. the Futures, to return each Future
instance as they are completed and then query the Future
for the job's return value, which will be either True
or False
:
for future in as_completed(futures):
proxy = futures[future]
is_good = future.result()
if is_good:
print('good -', proxy)
outfile.write(proxy + '\n')
else:
print('bad -', proxy)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments