I have developed a Parsing project of big marketplace and I need to make 350 requests per second to cover the need for 1,000,000 requests per hour 24/7. Now I can't break through the 40 requests per second mark:(
I use 1,000 proxies, and several million different headers with user agents and other parameters. There are no problems with blocking yet. In the future, I plan to use about 5,000 proxies.
I also use the following settings for my scrapy spider.
BOT_NAME = "scrapy_parser"
DOWNLOADER_MIDDLEWARES = {
"amazon_scrapy_parser.middlewares.Handle503Middleware": 700,
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
"scrapy_selenium.SeleniumMiddleware": None,
"scrapy_playwright.middleware.PlaywrightMiddleware": None,
}
SPIDER_MODULES = ["amazon_scrapy_parser.spiders"]
NEWSPIDER_MODULE = "amazon_scrapy_parser.spiders"
ROTATING_PROXY_CLOSE_SPIDER = False
RETRY_TIMES = 6
ROTATING_PROXY_BACKOFF_BASE = 2
ROTATING_PROXY_BACKOFF_CAP = 4
ROTATING_PROXY_PAGE_RETRY_TIMES = 2
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1000
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
CONCURRENT_ITEMS = 1000
DOWNLOAD_DELAY = 0
AUTOTHROTTLE_ENABLED = False
DNSCACHE_ENABLED = False
FEED_EXPORT_ENCODING = "utf-8"
LOG_ENABLED = True
LOG_LEVEL = "INFO" # Рівень логування (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE = 'logs/scrapy_log.log'
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ITEM_PIPELINES = {
'amazon_scrapy_parser.pipelines.FilePipeline': 300,
}
DNS_RESOLVER = 'scrapy.resolver.CachingHostnameResolver'
REACTOR_THREADPOOL_MAXSIZE = 300
My main question is, maybe something needs to be changed or added here to make it work faster than 40 requests per second. My network has a bandwidth of up to 1000 Mb/s, but during parsing the traffic speed constantly drops from 50 Mb/s to 400 Mb/s (but the average remains 200 Mb/s) What am I doing wrong? Maybe someone has already encountered this problem? Because all the settings seem to give more than 1000 requests per second, but I only get 40.
I tried changing the CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_ITEMS parameters - but this does not help. I also use a lot of threading, launch spiders in early processes - but the amount of requests per second still does not change...
@staticmethod
def _run_spider(proxy_list, urls_sublist, headers, settings, monitoring_params):
"""Run the Scrapy spider."""
settings.set("ROTATING_PROXY_LIST", proxy_list, priority="cmdline")
process = CrawlerProcess(settings)
process.crawl(
AmazonSpider,
urls=urls_sublist,
headers=headers,
**monitoring_params,
)
process.start()
def run(self):
"""Main execution logic."""
keywords = self._load_keywords()
headers = self._load_headers()
links_to_serp = [
self._get_url(keyword.strip())
for keyword in keywords[:1000]
if keyword != "" and keyword != " " and not keyword.startswith("&")
]
self.proxy_sublist = self._split_list(self._load_proxies(), self.num_threads)
failed_links = self._load_bad_response_links()
if not failed_links:
print("Start 200.")
links_to_serp_sublists = self._split_list(links_to_serp, self.num_threads)
else:
print("Fix 503.")
links_to_serp_sublists = self._split_list(failed_links, self.num_threads)
self._clear_bad_response_file()
with multiprocessing.Pool(processes=self.num_threads) as pool:
pool.starmap(
AmazonScraper._run_spider,
zip(
self.proxy_sublist,
links_to_serp_sublists,
[headers] * self.num_threads,
[self.settings] * self.num_threads,
[self.monitoring_params] * self.num_threads,
),
)
if __name__ == "__main__":
scraper = Scraper(country="US", num_threads=NUMBER_OF_THREADS)
scraper.run()
I have developed a Parsing project of big marketplace and I need to make 350 requests per second to cover the need for 1,000,000 requests per hour 24/7. Now I can't break through the 40 requests per second mark:(
I use 1,000 proxies, and several million different headers with user agents and other parameters. There are no problems with blocking yet. In the future, I plan to use about 5,000 proxies.
I also use the following settings for my scrapy spider.
BOT_NAME = "scrapy_parser"
DOWNLOADER_MIDDLEWARES = {
"amazon_scrapy_parser.middlewares.Handle503Middleware": 700,
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
"scrapy_selenium.SeleniumMiddleware": None,
"scrapy_playwright.middleware.PlaywrightMiddleware": None,
}
SPIDER_MODULES = ["amazon_scrapy_parser.spiders"]
NEWSPIDER_MODULE = "amazon_scrapy_parser.spiders"
ROTATING_PROXY_CLOSE_SPIDER = False
RETRY_TIMES = 6
ROTATING_PROXY_BACKOFF_BASE = 2
ROTATING_PROXY_BACKOFF_CAP = 4
ROTATING_PROXY_PAGE_RETRY_TIMES = 2
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1000
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
CONCURRENT_ITEMS = 1000
DOWNLOAD_DELAY = 0
AUTOTHROTTLE_ENABLED = False
DNSCACHE_ENABLED = False
FEED_EXPORT_ENCODING = "utf-8"
LOG_ENABLED = True
LOG_LEVEL = "INFO" # Рівень логування (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE = 'logs/scrapy_log.log'
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ITEM_PIPELINES = {
'amazon_scrapy_parser.pipelines.FilePipeline': 300,
}
DNS_RESOLVER = 'scrapy.resolver.CachingHostnameResolver'
REACTOR_THREADPOOL_MAXSIZE = 300
My main question is, maybe something needs to be changed or added here to make it work faster than 40 requests per second. My network has a bandwidth of up to 1000 Mb/s, but during parsing the traffic speed constantly drops from 50 Mb/s to 400 Mb/s (but the average remains 200 Mb/s) What am I doing wrong? Maybe someone has already encountered this problem? Because all the settings seem to give more than 1000 requests per second, but I only get 40.
I tried changing the CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_ITEMS parameters - but this does not help. I also use a lot of threading, launch spiders in early processes - but the amount of requests per second still does not change...
@staticmethod
def _run_spider(proxy_list, urls_sublist, headers, settings, monitoring_params):
"""Run the Scrapy spider."""
settings.set("ROTATING_PROXY_LIST", proxy_list, priority="cmdline")
process = CrawlerProcess(settings)
process.crawl(
AmazonSpider,
urls=urls_sublist,
headers=headers,
**monitoring_params,
)
process.start()
def run(self):
"""Main execution logic."""
keywords = self._load_keywords()
headers = self._load_headers()
links_to_serp = [
self._get_url(keyword.strip())
for keyword in keywords[:1000]
if keyword != "" and keyword != " " and not keyword.startswith("&")
]
self.proxy_sublist = self._split_list(self._load_proxies(), self.num_threads)
failed_links = self._load_bad_response_links()
if not failed_links:
print("Start 200.")
links_to_serp_sublists = self._split_list(links_to_serp, self.num_threads)
else:
print("Fix 503.")
links_to_serp_sublists = self._split_list(failed_links, self.num_threads)
self._clear_bad_response_file()
with multiprocessing.Pool(processes=self.num_threads) as pool:
pool.starmap(
AmazonScraper._run_spider,
zip(
self.proxy_sublist,
links_to_serp_sublists,
[headers] * self.num_threads,
[self.settings] * self.num_threads,
[self.monitoring_params] * self.num_threads,
),
)
if __name__ == "__main__":
scraper = Scraper(country="US", num_threads=NUMBER_OF_THREADS)
scraper.run()
Let's talk it in a pure technical way without moral.
First, your question is essentially not about the speed of sending requests, but about the speed of crawling. The average speed depends on the most time-consuming step. For example, if you can send requests at a speed of 1000/sec but can parse only 5 responses per second, you will cost more than 200 seconds to finish crawling and get the average speed less than 5 requests per second.
It will be more efficient to separate the whole crawling process into two scripts rather than working in a single spider of scrapy
. One is sending request and get the response, the other is parsing the response to get items you want (And more steps if you want like writting into DB, whatever).
Second, you'd better use a distributed crawler like scrapy-redis
instead of scrapy
.
Third, let's talk about your settings.py
.
× RETRY_TIMES
I suggest setting it lower even 0
in your case.
〇 CONCURRENT_REQUESTS
or CONCURRENT_REQUESTS_PER_DOMAIN
raise the MAXIMUM number of concurrent (i.e. simultaneous).
〇 CONCURRENT_ITEMS
makes it faster while parsing items parallel from every response.
〇 DOWNLOAD_DELAY
makes the downloader busier XD.
〇 DOWNLOAD_DELAY
set to 0
while AUTOTHROTTLE_ENABLED
set to False
keep sending requests quickly.
× DNSCACHE_ENABLED
I suggest setting it to True
if your target is only Amazon for it is usually faster when using a in-memory cache.
× LOG_LEVEL
set to ERROR
or CRITICAL
could reduce log messages and make the crawling faster. Or even set LOG_ENABLED
to False
.
Fourth, keep avoiding your CPU and Memory fully used.