I am using sklearn to run a random forest. I am setting the seed for the random forest, as well as splitting the data for cross validation. When I re-run the code consecutive times, it gives me the same result. However, re-running the same code after a month, I got slightly different feature importances. In some other similar analyses, the accuracy metrics are different too. The data has not been changed. I am running on Google Colab.
Here is my code:
# Configuration
file_path = '/content/drive/My Drive/dataset.csv'
columns_to_keep = [
'target_column', 'feature_a', 'feature_b', 'feature_c', 'feature_d', 'feature_e',
'feature_f', 'feature_g', 'feature_h', 'feature_i', 'feature_j', 'feature_k', 'feature_l',
'feature_m', 'feature_n', 'feature_o', 'feature_p', 'feature_q', 'feature_r', 'feature_s',
'feature_t', 'feature_u', 'feature_v', 'feature_w', 'feature_x', 'feature_y', 'feature_z',
'feature_aa', 'feature_ab', 'feature_ac', 'feature_ad', 'feature_ae', 'feature_af', 'feature_ag',
'feature_ah', 'feature_ai', 'feature_aj', 'feature_ak', 'feature_al', 'feature_am', 'feature_an'
]
df = pd.read_csv(file_path, usecols=columns_to_keep)
categorical_columns = ['feature_ak', 'feature_al', 'feature_am', 'feature_an', 'feature_ao']
one_hot_columns = ['feature_al', 'feature_ak']
df = df.dropna()
# One-hot encode the specified column
le = LabelEncoder()
for col in one_hot_columns:
df[col] = le.fit_transform(df[col])
# Convert specified columns to categorical
for col in categorical_columns:
df[col] = df[col].astype('category')
# Split into features and target
X = df.drop(columns=['target_column'])
y = df['target_column']
# Initialize the RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Store results
feature_importances_list = []
all_y_true = []
all_y_pred = []
# Perform k-fold cross-validation
for fold_num, (train_index, test_index) in enumerate(kf.split(X), start=1):
# Split the data
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Train the Random Forest model
rf_model.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_model.predict(X_test)
# Collect all true and predicted labels
all_y_true.extend(y_test)
all_y_pred.extend(y_pred)
# Get feature importances for this fold
feature_importances_list.append(rf_model.feature_importances_)
# Calculate and print the accuracy for this fold
accuracy_fold = accuracy_score(y_test, y_pred)
print(f"Fold {fold_num} Accuracy: {accuracy_fold:.4f}")
# Calculate accuracy across all predictions
accuracy_cv = accuracy_score(all_y_true, all_y_pred)
# Generate a classification report
final_report = classification_report(all_y_true, all_y_pred, digits=3)
# Average feature importances across folds
average_importance = sum(feature_importances_list) / len(feature_importances_list)
# Create a DataFrame with feature names and their corresponding average importance
feature_names = X.columns
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': average_importance
}).sort_values(by='Importance', ascending=False)
# Print Results
print(f"Overall Accuracy of Random Forest model with k-fold CV: {accuracy_cv:.4f}")
print("\nFinal Classification Report:")
print(final_report)
print("\nRandom Forest Feature Importances (averaged across folds):")
print(importance_df.head(20))
I am using sklearn to run a random forest. I am setting the seed for the random forest, as well as splitting the data for cross validation. When I re-run the code consecutive times, it gives me the same result. However, re-running the same code after a month, I got slightly different feature importances. In some other similar analyses, the accuracy metrics are different too. The data has not been changed. I am running on Google Colab.
Here is my code:
# Configuration
file_path = '/content/drive/My Drive/dataset.csv'
columns_to_keep = [
'target_column', 'feature_a', 'feature_b', 'feature_c', 'feature_d', 'feature_e',
'feature_f', 'feature_g', 'feature_h', 'feature_i', 'feature_j', 'feature_k', 'feature_l',
'feature_m', 'feature_n', 'feature_o', 'feature_p', 'feature_q', 'feature_r', 'feature_s',
'feature_t', 'feature_u', 'feature_v', 'feature_w', 'feature_x', 'feature_y', 'feature_z',
'feature_aa', 'feature_ab', 'feature_ac', 'feature_ad', 'feature_ae', 'feature_af', 'feature_ag',
'feature_ah', 'feature_ai', 'feature_aj', 'feature_ak', 'feature_al', 'feature_am', 'feature_an'
]
df = pd.read_csv(file_path, usecols=columns_to_keep)
categorical_columns = ['feature_ak', 'feature_al', 'feature_am', 'feature_an', 'feature_ao']
one_hot_columns = ['feature_al', 'feature_ak']
df = df.dropna()
# One-hot encode the specified column
le = LabelEncoder()
for col in one_hot_columns:
df[col] = le.fit_transform(df[col])
# Convert specified columns to categorical
for col in categorical_columns:
df[col] = df[col].astype('category')
# Split into features and target
X = df.drop(columns=['target_column'])
y = df['target_column']
# Initialize the RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Store results
feature_importances_list = []
all_y_true = []
all_y_pred = []
# Perform k-fold cross-validation
for fold_num, (train_index, test_index) in enumerate(kf.split(X), start=1):
# Split the data
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Train the Random Forest model
rf_model.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_model.predict(X_test)
# Collect all true and predicted labels
all_y_true.extend(y_test)
all_y_pred.extend(y_pred)
# Get feature importances for this fold
feature_importances_list.append(rf_model.feature_importances_)
# Calculate and print the accuracy for this fold
accuracy_fold = accuracy_score(y_test, y_pred)
print(f"Fold {fold_num} Accuracy: {accuracy_fold:.4f}")
# Calculate accuracy across all predictions
accuracy_cv = accuracy_score(all_y_true, all_y_pred)
# Generate a classification report
final_report = classification_report(all_y_true, all_y_pred, digits=3)
# Average feature importances across folds
average_importance = sum(feature_importances_list) / len(feature_importances_list)
# Create a DataFrame with feature names and their corresponding average importance
feature_names = X.columns
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': average_importance
}).sort_values(by='Importance', ascending=False)
# Print Results
print(f"Overall Accuracy of Random Forest model with k-fold CV: {accuracy_cv:.4f}")
print("\nFinal Classification Report:")
print(final_report)
print("\nRandom Forest Feature Importances (averaged across folds):")
print(importance_df.head(20))
Even with the random state set, there can be change's with really small changes to any part of the process. It usually is a result of Changes in the Software Environment. There is nothing per say to do, the best thing you can do is ensure everything from the dataset to the package's and everything is locked down ensuring that there are no changes in between the run's.
There were some discussions on how this can be prevented I am linkthing them here so you can explore other options as well: