network programming - Which settings or steps help to get 500 requests per second from Scrapy in Python? - Stack Overflow

admin2025-04-16 10

I have developed a Parsing project of big marketplace and I need to make 350 requests per second to cover the need for 1,000,000 requests per hour 24/7. Now I can't break through the 40 requests per second mark:(

I use 1,000 proxies, and several million different headers with user agents and other parameters. There are no problems with blocking yet. In the future, I plan to use about 5,000 proxies.

I also use the following settings for my scrapy spider.

BOT_NAME = "scrapy_parser"

DOWNLOADER_MIDDLEWARES = {
    "amazon_scrapy_parser.middlewares.Handle503Middleware": 700,
    "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
    "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
    "scrapy_selenium.SeleniumMiddleware": None, 
    "scrapy_playwright.middleware.PlaywrightMiddleware": None,
}

SPIDER_MODULES = ["amazon_scrapy_parser.spiders"]
NEWSPIDER_MODULE = "amazon_scrapy_parser.spiders"

ROTATING_PROXY_CLOSE_SPIDER = False

RETRY_TIMES = 6

ROTATING_PROXY_BACKOFF_BASE = 2
ROTATING_PROXY_BACKOFF_CAP = 4
ROTATING_PROXY_PAGE_RETRY_TIMES = 2

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1000
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
CONCURRENT_ITEMS = 1000

DOWNLOAD_DELAY = 0 
AUTOTHROTTLE_ENABLED = False
DNSCACHE_ENABLED = False

FEED_EXPORT_ENCODING = "utf-8"

LOG_ENABLED = True
LOG_LEVEL = "INFO"  # Рівень логування (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE = 'logs/scrapy_log.log'

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

ITEM_PIPELINES = {
    'amazon_scrapy_parser.pipelines.FilePipeline': 300,
}

DNS_RESOLVER = 'scrapy.resolver.CachingHostnameResolver'
REACTOR_THREADPOOL_MAXSIZE = 300

My main question is, maybe something needs to be changed or added here to make it work faster than 40 requests per second. My network has a bandwidth of up to 1000 Mb/s, but during parsing the traffic speed constantly drops from 50 Mb/s to 400 Mb/s (but the average remains 200 Mb/s) What am I doing wrong? Maybe someone has already encountered this problem? Because all the settings seem to give more than 1000 requests per second, but I only get 40.

I tried changing the CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_ITEMS parameters - but this does not help. I also use a lot of threading, launch spiders in early processes - but the amount of requests per second still does not change...

    @staticmethod
    def _run_spider(proxy_list, urls_sublist, headers, settings, monitoring_params):
        """Run the Scrapy spider."""
        settings.set("ROTATING_PROXY_LIST", proxy_list, priority="cmdline")
        process = CrawlerProcess(settings)
        process.crawl(
            AmazonSpider,
            urls=urls_sublist,
            headers=headers,
            **monitoring_params,
        )
        process.start()


    def run(self):
        """Main execution logic."""
        keywords = self._load_keywords()
        headers = self._load_headers()
        links_to_serp = [
            self._get_url(keyword.strip())
            for keyword in keywords[:1000]
            if keyword != "" and keyword != " " and not keyword.startswith("&")
        ]
        self.proxy_sublist = self._split_list(self._load_proxies(), self.num_threads)

        failed_links = self._load_bad_response_links()

        if not failed_links:
            print("Start 200.")
            links_to_serp_sublists = self._split_list(links_to_serp, self.num_threads)
        else:
            print("Fix 503.")
            links_to_serp_sublists = self._split_list(failed_links, self.num_threads)
            self._clear_bad_response_file()
        
        with multiprocessing.Pool(processes=self.num_threads) as pool:
            pool.starmap(
                AmazonScraper._run_spider,
                zip(
                    self.proxy_sublist,
                    links_to_serp_sublists,
                    [headers] * self.num_threads,
                    [self.settings] * self.num_threads,
                    [self.monitoring_params] * self.num_threads,
                ),
            )


if __name__ == "__main__":
    scraper = Scraper(country="US", num_threads=NUMBER_OF_THREADS)
    scraper.run()

I use 1,000 proxies, and several million different headers with user agents and other parameters. There are no problems with blocking yet. In the future, I plan to use about 5,000 proxies.

I also use the following settings for my scrapy spider.

BOT_NAME = "scrapy_parser"

DOWNLOADER_MIDDLEWARES = {
    "amazon_scrapy_parser.middlewares.Handle503Middleware": 700,
    "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
    "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
    "scrapy_selenium.SeleniumMiddleware": None, 
    "scrapy_playwright.middleware.PlaywrightMiddleware": None,
}

SPIDER_MODULES = ["amazon_scrapy_parser.spiders"]
NEWSPIDER_MODULE = "amazon_scrapy_parser.spiders"

ROTATING_PROXY_CLOSE_SPIDER = False

RETRY_TIMES = 6

ROTATING_PROXY_BACKOFF_BASE = 2
ROTATING_PROXY_BACKOFF_CAP = 4
ROTATING_PROXY_PAGE_RETRY_TIMES = 2

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1000
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
CONCURRENT_ITEMS = 1000

DOWNLOAD_DELAY = 0 
AUTOTHROTTLE_ENABLED = False
DNSCACHE_ENABLED = False

FEED_EXPORT_ENCODING = "utf-8"

LOG_ENABLED = True
LOG_LEVEL = "INFO"  # Рівень логування (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE = 'logs/scrapy_log.log'

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

ITEM_PIPELINES = {
    'amazon_scrapy_parser.pipelines.FilePipeline': 300,
}

DNS_RESOLVER = 'scrapy.resolver.CachingHostnameResolver'
REACTOR_THREADPOOL_MAXSIZE = 300

    @staticmethod
    def _run_spider(proxy_list, urls_sublist, headers, settings, monitoring_params):
        """Run the Scrapy spider."""
        settings.set("ROTATING_PROXY_LIST", proxy_list, priority="cmdline")
        process = CrawlerProcess(settings)
        process.crawl(
            AmazonSpider,
            urls=urls_sublist,
            headers=headers,
            **monitoring_params,
        )
        process.start()


    def run(self):
        """Main execution logic."""
        keywords = self._load_keywords()
        headers = self._load_headers()
        links_to_serp = [
            self._get_url(keyword.strip())
            for keyword in keywords[:1000]
            if keyword != "" and keyword != " " and not keyword.startswith("&")
        ]
        self.proxy_sublist = self._split_list(self._load_proxies(), self.num_threads)

        failed_links = self._load_bad_response_links()

        if not failed_links:
            print("Start 200.")
            links_to_serp_sublists = self._split_list(links_to_serp, self.num_threads)
        else:
            print("Fix 503.")
            links_to_serp_sublists = self._split_list(failed_links, self.num_threads)
            self._clear_bad_response_file()
        
        with multiprocessing.Pool(processes=self.num_threads) as pool:
            pool.starmap(
                AmazonScraper._run_spider,
                zip(
                    self.proxy_sublist,
                    links_to_serp_sublists,
                    [headers] * self.num_threads,
                    [self.settings] * self.num_threads,
                    [self.monitoring_params] * self.num_threads,
                ),
            )


if __name__ == "__main__":
    scraper = Scraper(country="US", num_threads=NUMBER_OF_THREADS)
    scraper.run()

Share Improve this question asked Feb 3 at 16:35 Serhii Shchokotov 271 silver badge4 bronze badges

2 I hate to be "that guy", but scraping Amazon is not allowed according to its Conditions of Use. – robertklep Commented Feb 3 at 16:46
@robertklep I understand, thank you for the feedback. However, this is the only effective way for monitoring the visibility of my products on the platform. I'll try to minimize the impact on Amazon's servers and ensure compliance with all guidelines to avoid any violations. – Serhii Shchokotov Commented Feb 3 at 17:00
@KSmith this is pretty wrong, sorry. – wRAR Commented Feb 3 at 20:50
@SerhiiShchokotov RPM directly depends on the number of parallel requests and an average response time, so first you need to find out which of these are lower than expected (assuming your code is correct). – wRAR Commented Feb 3 at 20:51
1 @LMC I check my limit with scrapy bench command: 2025-02-04 12:15:04 [scrapy.core.engine] INFO: Closing spider (closespider_timeout) 2025-02-04 12:15:04 [scrapy.extensions.logstats] INFO: Crawled 2275 pages (at 11520 pages/min), scraped 0 items (at 0 items/min) 11520 rpm – Serhii Shchokotov Commented Feb 4 at 10:23

| Show 4 more comments

1 Answer 1

Sorted by: Reset to default 1

Let's talk it in a pure technical way without moral.

First, your question is essentially not about the speed of sending requests, but about the speed of crawling. The average speed depends on the most time-consuming step. For example, if you can send requests at a speed of 1000/sec but can parse only 5 responses per second, you will cost more than 200 seconds to finish crawling and get the average speed less than 5 requests per second.

It will be more efficient to separate the whole crawling process into two scripts rather than working in a single spider of scrapy. One is sending request and get the response, the other is parsing the response to get items you want (And more steps if you want like writting into DB, whatever).

Second, you'd better use a distributed crawler like scrapy-redis instead of scrapy.

Third, let's talk about your settings.py.
× RETRY_TIMES I suggest setting it lower even 0 in your case.
〇 CONCURRENT_REQUESTS or CONCURRENT_REQUESTS_PER_DOMAIN raise the MAXIMUM number of concurrent (i.e. simultaneous).
〇 CONCURRENT_ITEMS makes it faster while parsing items parallel from every response.
〇 DOWNLOAD_DELAY makes the downloader busier XD.
〇 DOWNLOAD_DELAY set to 0 while AUTOTHROTTLE_ENABLED set to False keep sending requests quickly.
× DNSCACHE_ENABLEDI suggest setting it to True if your target is only Amazon for it is usually faster when using a in-memory cache.
× LOG_LEVEL set to ERROR or CRITICAL could reduce log messages and make the crawling faster. Or even set LOG_ENABLED to False.

Fourth, keep avoiding your CPU and Memory fully used.

转载请注明原文地址:http://www.anycun.com/QandA/1744760909a87238.html