web scraping - I am trying to scrape images from danbooru but I got SSLEOFError violation of protocol 1002 error, using Beautifu

admin2025-05-01  1

I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.

Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:

Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))

The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.

Here is my simple function :

def get_image_src(self, post_id):
    image_src = []  
    search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
    try:
        response = self.session.get(search_url) # self.session is initialized already
        if response.status_code != 200:
            print(f"{response.status_code} is the status code : not 200")
            return image_src 
        soup = BeautifulSoup(response.text, 'html.parser')
        image = soup.find("img", class_="fit-width")
        if image:
            image_src.append(image.get("src"))
    except Exception as e:
        print(f"Error processing get_images_srcs {post_id}: {str(e)}")
    return image_src

This is my session initialization function:

def _make_session(self):
    session = requests.Session()
    adapter = HTTPAdapter(
        pool_connections=25,
        pool_maxsize=25,
        max_retries=Retry(
            total=4,
            backoff_factor=1, 
            status_forcelist=[443,503,504] 
        )
    )
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': ''
    })
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.

Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:

Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))

The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.

Here is my simple function :

def get_image_src(self, post_id):
    image_src = []  
    search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
    try:
        response = self.session.get(search_url) # self.session is initialized already
        if response.status_code != 200:
            print(f"{response.status_code} is the status code : not 200")
            return image_src 
        soup = BeautifulSoup(response.text, 'html.parser')
        image = soup.find("img", class_="fit-width")
        if image:
            image_src.append(image.get("src"))
    except Exception as e:
        print(f"Error processing get_images_srcs {post_id}: {str(e)}")
    return image_src

This is my session initialization function:

def _make_session(self):
    session = requests.Session()
    adapter = HTTPAdapter(
        pool_connections=25,
        pool_maxsize=25,
        max_retries=Retry(
            total=4,
            backoff_factor=1, 
            status_forcelist=[443,503,504] 
        )
    )
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': 'https://danbooru.donmai.us'
    })
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session
Share Improve this question edited Jan 2 at 17:03 JeffC 26k5 gold badges34 silver badges55 bronze badges asked Jan 2 at 16:32 Acno_SamaAcno_Sama 678 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

The error you're getting usually means that the server abruptly closed the SSL/TLS connection before it finished the handshake or data transfer.

In your case this means that you're making the same, repeated requests and as a result you're getting rate-limited or blocked on the server side.

What can you do about it?

  • slow down and/or add delays. You're after a large data set so you might get the job done if you introduce sleeps between requests or limit concurrency.
  • make sure your headers are correct (in your example the Chrome version is outdate - the latest release is 131).
  • introduce try-except to handle transient SSL errors.

Having said that, you could adjust your session initialization code:

adapter = HTTPAdapter(
    pool_connections=5,
    pool_maxsize=5,
    max_retries=Retry(
        total=5,
        backoff_factor=2,  # increase the delay between retries
        status_forcelist=[429, 443, 503, 504]
    )
)
转载请注明原文地址:http://www.anycun.com/QandA/1746107292a91771.html