pandas - Comparing 2 data files and create a new file with the missing data using Python - Stack Overflow

admin2025-04-18 26

I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.

The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.

Filename descriptions:

"downloaded_data.csv" - this is the master file which is downloaded from a URL.

"current-data.csv" - this is old data file to compare against to find new products.

"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).

"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.

Example columns from "downloaded_data.csv" file which should also appear in the new file.

Title	Image URL	SKU
Product Name One	imageurl.jpg	SKU123
Product Name Two	imageurl.jpg	SKU456

I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.

Filename descriptions:

"downloaded_data.csv" - this is the master file which is downloaded from a URL.

"current-data.csv" - this is old data file to compare against to find new products.

"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.

Example columns from "downloaded_data.csv" file which should also appear in the new file.

Title	Image URL	SKU
Product Name One	imageurl.jpg	SKU123
Product Name Two	imageurl.jpg	SKU456

Example columns from "current-data.csv" file. Compare the above example CSV file with this using the "SKU" and "ID" columns:

ID	Image URL	Product Name
SKU123	imageurl.jpg	Product Name One
SKU789	imageurl.jpg	Product Name Three

Example of the "new products" file (same columns from "downloaded_data.csv" but only inputting the SKUs that are missing from the lookup file "current-data.csv"):

Title	Image URL	SKU
Product Name Three	imageurl.jpg	SKU789

Below is my Python code. After running this, the "new-products.csv" file ends up just being a copy of the downloaded master file.

import pandas as pd
import requests

pd.set_option("display.max_rows", None)

# Dowload CSV file
url = "URL GOES HERE"
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content of the response to a local CSV file
    with open("downloaded_data.csv", "wb") as f:
        f.write(response.content)
    print("CSV file downloaded successfully")
else:
    print("Failed to download csv file. Status code: ", response.status_code)

# Read the CSV file into a Pandas DataFrame
master_file_compare = pd.read_csv("downloaded_data.csv", usecols=[29], names=['SKU'])
account_data = pd.read_csv("account-data.csv", usecols=[5], names=['SKU'])

# Merging both dataframe using left join
comparison_result = pd.merge(master_file_compare,account_data, on='SKU', how='left', indicator=True)

# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']

comparison_result.to_csv('comparison-file.csv', encoding='utf8')

# Compare Comparison file to master and generate data file
with open('downloaded_data.csv', 'r', encoding='utf8') as in_file, open('new-products.csv', 'w', encoding='utf8') as out_file:
    for line in in_file:
        if line.split(',')[0].strip() not in comparison_result:
            out_file.write(line)

# print(account_data.head)
print(comparison_result)

To note that the current data (comparison file) will not have the same columns or even ordering of columns. All I am aiming to do is to match on the SKU columns as those are the identifiers.

Any pointers or suggestions on where to focus next would be greatly appreciated.

Thanks!

Share Improve this question edited Jan 30 at 12:54 asked Jan 29 at 11:46 Calogero Savarino 376 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

If I understood correctly, you want to get the SKU values that are in the new csv file (let's call it v2.csv) but not in the former one (let's call it v1.csv).

You can do it in many different ways, the way I'd do it would be by converting the list of SKU values of each file to set objects and take the difference between the two to get the ones that are in set_sku_v2 but not in set_sku_v1.

Sample code:

v1_df = pd.read_csv("v1.csv")
v2_df = pd.read_csv("v2.csv")

set_sku_v1 = set(v1_df["SKU"])  # e.g. {12, 15, 19, 22}
set_sku_v2 = set(v2_df["SKU"])  # e.g. {12, 19, 31}

new_sku = list(set_sku_v2 - set_sku_v1)  # e.g. [15, 22]
new_products_df = v2_df.query(f"SKU in {new_sku}")

Using the following example:

display(downloaded_data)
Title   Image_URL   SKU
0   Product_Name_One    imageurl1.jpg   SKU123
1   Product_Name_Two    imageurl2.jpg   SKU456
2   Product_Name_Four   imageurl4.jpg   SKU457

display(current_data)
ID  Image_URL   Product_Name
0   SKU123  imageurl1.jpg   Product_Name_One
1   SKU789  imageurl3.jpg   Product_Name_Three

And adapting your original code:

# Merging both dataframe using left join
comparison_result = pd.merge(downloaded_data, current_data, 
                             left_on='SKU', right_on='ID', 
                             how='left', indicator=True)

# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']   
new_products = downloaded_data.loc[downloaded_data['SKU'].isin(comparison_result['SKU'])]

display(new_products)
    Title   Image_URL   SKU
1   Product_Name_Two    imageurl2.jpg   SKU456
2   Product_Name_Four   imageurl4.jpg   SKU457

转载请注明原文地址:http://www.anycun.com/QandA/1744968526a90176.html

pandas - Comparing 2 data files and create a new file with the missing data using Python - Stack Overflow

2 Answers 2

pandasComparing 2 data files and create a new file with the missing data using PythonStack Overflow