python - why does the html all have same class and sub-class with different information - Stack Overflow

admin2025-04-15  2

I am trying to scrap the house type, EPC rating from the website.

but i noticed that after inspecting the html, house type e.g "freehold" , Epc rating e.g "D" all have the same Class name, and CSS selector

from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import pandas as pd  # Ensure you import pandas
import time

# Initialize WebDriver
driver = webdriver.Chrome()

# Open URL
url = "/?new_homes=include&q=england+&orig_q=united+kingdom&view_type=list&pn=1"
driver.get(url)

# Wait for the main content to load (adjust time as needed)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "_17smgnt0"))
)

# Initialize result list to store data
result = []

# Find all house elements
houses = driver.find_elements(By.CLASS_NAME, "_1hzil3o0")

# Extract and print addresses
for house in houses:
    try:
        item = {
            "address": house.find_element(By.XPATH, './/a/h2').text,
            "DateLast_sold": house.find_element(By.CSS_SELECTOR, "._1hzil3o9._1hzil3o8._194zg6t7").text,
            "Number of Rooms": house.find_element(By.CLASS_NAME, "_1pbf8i53").text,
            "EPC Rating": house.find_element(By.CLASS_NAME, "_14bi3x30").text
            
            
        }

        result.append(item)  # Append to the result list
    except Exception as e:
        print(f"Error extracting address or date: {e}")

# Store the result into a dataframe after the loop
df = pd.DataFrame(result)

# Show the result
print(df)

# Close the driver
driver.quit()

here is a picture of the html file, how can i extract the freehold and EPC rating to show the right information.

I am trying to scrap the house type, EPC rating from the website.

but i noticed that after inspecting the html, house type e.g "freehold" , Epc rating e.g "D" all have the same Class name, and CSS selector

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import pandas as pd  # Ensure you import pandas
import time

# Initialize WebDriver
driver = webdriver.Chrome()

# Open URL
url = "https://www.zoopla.co.uk/house-prices/england/?new_homes=include&q=england+&orig_q=united+kingdom&view_type=list&pn=1"
driver.get(url)

# Wait for the main content to load (adjust time as needed)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "_17smgnt0"))
)

# Initialize result list to store data
result = []

# Find all house elements
houses = driver.find_elements(By.CLASS_NAME, "_1hzil3o0")

# Extract and print addresses
for house in houses:
    try:
        item = {
            "address": house.find_element(By.XPATH, './/a/h2').text,
            "DateLast_sold": house.find_element(By.CSS_SELECTOR, "._1hzil3o9._1hzil3o8._194zg6t7").text,
            "Number of Rooms": house.find_element(By.CLASS_NAME, "_1pbf8i53").text,
            "EPC Rating": house.find_element(By.CLASS_NAME, "_14bi3x30").text
            
            
        }

        result.append(item)  # Append to the result list
    except Exception as e:
        print(f"Error extracting address or date: {e}")

# Store the result into a dataframe after the loop
df = pd.DataFrame(result)

# Show the result
print(df)

# Close the driver
driver.quit()

here is a picture of the html file, how can i extract the freehold and EPC rating to show the right information.

Share Improve this question edited Feb 4 at 16:20 Chioma Okoroafor asked Feb 4 at 15:01 Chioma OkoroaforChioma Okoroafor 571 silver badge7 bronze badges 7
  • A picture of some HTML may not suffice. How you target the element(s) depends on how you uniquely identify the element(s) you want to target. Given the entirety of the HTML structure you're using, by what logic would you determine which element(s) you want to read? This logic can include ancestor elements, position in the DOM relative to other elements, etc. – David Commented Feb 4 at 15:10
  • Right click on that element and under the copy option look at "Copy js Path" as a place to start. – JonSG Commented Feb 4 at 15:32
  • @JonSG , i did it but its still the same thing. it only returns "freehold" and ignores EPC rating. I copied the "js Path" on EPC rating – Chioma Okoroafor Commented Feb 4 at 16:05
  • 1 If you just try to select one div with that class, you'll get the first one that's found in the DOM. You need to select all div elements with that class. Repeated use of a class name is perfectly legal and is used (typically) to ensure identical CSS styling – Adon Bilivit Commented Feb 4 at 16:52
  • @AdonBilivit you have really been super helpful to me for answering my questions, I really appreciate it honestly. I am just a new bee in web scrapping and html document. if you don't mind can you show me a sample code to extract the "freehold" and also "EPC rating" e.g "D" – Chioma Okoroafor Commented Feb 4 at 17:20
 |  Show 2 more comments

1 Answer 1

Reset to default 4

They all have the same classes because they are all styled the same. That's why they all look the same on the page. I looked through the HTML as well and I don't see anything that indicates what is what. I would grab the list of button styled info and loop through it looking for known info. Seems like the first one is always the style. The second is generally the sqm which you can identify by checking if the string contains "sqm". The final one is EPC rating which you can identify by checking if the string contains "EPC rating". That should get you what you need.

转载请注明原文地址:http://www.anycun.com/QandA/1744711569a86560.html