I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.
The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.
Filename descriptions:
"downloaded_data.csv" - this is the master file which is downloaded from a URL.
"current-data.csv" - this is old data file to compare against to find new products.
"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).
"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.
Example columns from "downloaded_data.csv" file which should also appear in the new file.
Title | Image URL | SKU |
---|---|---|
Product Name One | imageurl.jpg | SKU123 |
Product Name Two | imageurl.jpg | SKU456 |
I am fairly new to Python and I am trying to build an app where I take a master file that contains product information and compares it to an older data file with the same data.
The aim of the app is to highlight any new products using a 'SKU' column and to then generate a new file that contains all the same columns from the master file but only the 'new' product rows listed.
Filename descriptions:
"downloaded_data.csv" - this is the master file which is downloaded from a URL.
"current-data.csv" - this is old data file to compare against to find new products.
"new-products.csv" - this will be the file generated to include only the new products that appear in the master file after comparing to the current data file (currnet-data.csv). This new file should also include all the columns that appear in the master file (downloaded_data.csv).
"comparison-file.csv" - this is a file that simply outputs the list of SKUs that are considered new.
Example columns from "downloaded_data.csv" file which should also appear in the new file.
Title | Image URL | SKU |
---|---|---|
Product Name One | imageurl.jpg | SKU123 |
Product Name Two | imageurl.jpg | SKU456 |
Example columns from "current-data.csv" file. Compare the above example CSV file with this using the "SKU" and "ID" columns:
ID | Image URL | Product Name |
---|---|---|
SKU123 | imageurl.jpg | Product Name One |
SKU789 | imageurl.jpg | Product Name Three |
Example of the "new products" file (same columns from "downloaded_data.csv" but only inputting the SKUs that are missing from the lookup file "current-data.csv"):
Title | Image URL | SKU |
---|---|---|
Product Name Three | imageurl.jpg | SKU789 |
Below is my Python code. After running this, the "new-products.csv" file ends up just being a copy of the downloaded master file.
import pandas as pd
import requests
pd.set_option("display.max_rows", None)
# Dowload CSV file
url = "URL GOES HERE"
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Save the content of the response to a local CSV file
with open("downloaded_data.csv", "wb") as f:
f.write(response.content)
print("CSV file downloaded successfully")
else:
print("Failed to download csv file. Status code: ", response.status_code)
# Read the CSV file into a Pandas DataFrame
master_file_compare = pd.read_csv("downloaded_data.csv", usecols=[29], names=['SKU'])
account_data = pd.read_csv("account-data.csv", usecols=[5], names=['SKU'])
# Merging both dataframe using left join
comparison_result = pd.merge(master_file_compare,account_data, on='SKU', how='left', indicator=True)
# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
comparison_result.to_csv('comparison-file.csv', encoding='utf8')
# Compare Comparison file to master and generate data file
with open('downloaded_data.csv', 'r', encoding='utf8') as in_file, open('new-products.csv', 'w', encoding='utf8') as out_file:
for line in in_file:
if line.split(',')[0].strip() not in comparison_result:
out_file.write(line)
# print(account_data.head)
print(comparison_result)
To note that the current data (comparison file) will not have the same columns or even ordering of columns. All I am aiming to do is to match on the SKU columns as those are the identifiers.
Any pointers or suggestions on where to focus next would be greatly appreciated.
Thanks!
If I understood correctly, you want to get the SKU
values that are in the new csv
file (let's call it v2.csv
) but not in the former one (let's call it v1.csv
).
You can do it in many different ways, the way I'd do it would be by converting the list of SKU
values of each file to set
objects and take the difference between the two to get the ones that are in set_sku_v2
but not in set_sku_v1
.
Sample code:
v1_df = pd.read_csv("v1.csv")
v2_df = pd.read_csv("v2.csv")
set_sku_v1 = set(v1_df["SKU"]) # e.g. {12, 15, 19, 22}
set_sku_v2 = set(v2_df["SKU"]) # e.g. {12, 19, 31}
new_sku = list(set_sku_v2 - set_sku_v1) # e.g. [15, 22]
new_products_df = v2_df.query(f"SKU in {new_sku}")
Using the following example:
display(downloaded_data)
Title Image_URL SKU
0 Product_Name_One imageurl1.jpg SKU123
1 Product_Name_Two imageurl2.jpg SKU456
2 Product_Name_Four imageurl4.jpg SKU457
display(current_data)
ID Image_URL Product_Name
0 SKU123 imageurl1.jpg Product_Name_One
1 SKU789 imageurl3.jpg Product_Name_Three
And adapting your original code:
# Merging both dataframe using left join
comparison_result = pd.merge(downloaded_data, current_data,
left_on='SKU', right_on='ID',
how='left', indicator=True)
# Filtering only the rows that are available in left (master_file_compare)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
new_products = downloaded_data.loc[downloaded_data['SKU'].isin(comparison_result['SKU'])]
display(new_products)
Title Image_URL SKU
1 Product_Name_Two imageurl2.jpg SKU456
2 Product_Name_Four imageurl4.jpg SKU457