python - Different Results (With Seed) For sklearn Random Forest - Stack Overflow

admin2025-04-16 5

I am using sklearn to run a random forest. I am setting the seed for the random forest, as well as splitting the data for cross validation. When I re-run the code consecutive times, it gives me the same result. However, re-running the same code after a month, I got slightly different feature importances. In some other similar analyses, the accuracy metrics are different too. The data has not been changed. I am running on Google Colab.

Here is my code:

# Configuration
file_path = '/content/drive/My Drive/dataset.csv'
columns_to_keep = [
    'target_column', 'feature_a', 'feature_b', 'feature_c', 'feature_d', 'feature_e',
    'feature_f', 'feature_g', 'feature_h', 'feature_i', 'feature_j', 'feature_k', 'feature_l',
    'feature_m', 'feature_n', 'feature_o', 'feature_p', 'feature_q', 'feature_r', 'feature_s',
    'feature_t', 'feature_u', 'feature_v', 'feature_w', 'feature_x', 'feature_y', 'feature_z',
    'feature_aa', 'feature_ab', 'feature_ac', 'feature_ad', 'feature_ae', 'feature_af', 'feature_ag',
    'feature_ah', 'feature_ai', 'feature_aj', 'feature_ak', 'feature_al', 'feature_am', 'feature_an'
]

df = pd.read_csv(file_path, usecols=columns_to_keep)

categorical_columns = ['feature_ak', 'feature_al', 'feature_am', 'feature_an', 'feature_ao']
one_hot_columns = ['feature_al', 'feature_ak']

df = df.dropna()

# One-hot encode the specified column
le = LabelEncoder()
for col in one_hot_columns:
    df[col] = le.fit_transform(df[col])

# Convert specified columns to categorical
for col in categorical_columns:
  df[col] = df[col].astype('category')

# Split into features and target
X = df.drop(columns=['target_column'])
y = df['target_column']

# Initialize the RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results
feature_importances_list = []
all_y_true = []
all_y_pred = []

# Perform k-fold cross-validation
for fold_num, (train_index, test_index) in enumerate(kf.split(X), start=1):
    # Split the data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train the Random Forest model
    rf_model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = rf_model.predict(X_test)

    # Collect all true and predicted labels
    all_y_true.extend(y_test)
    all_y_pred.extend(y_pred)

    # Get feature importances for this fold
    feature_importances_list.append(rf_model.feature_importances_)

    # Calculate and print the accuracy for this fold
    accuracy_fold = accuracy_score(y_test, y_pred)
    print(f"Fold {fold_num} Accuracy: {accuracy_fold:.4f}")

# Calculate accuracy across all predictions
accuracy_cv = accuracy_score(all_y_true, all_y_pred)

# Generate a classification report
final_report = classification_report(all_y_true, all_y_pred, digits=3)

# Average feature importances across folds
average_importance = sum(feature_importances_list) / len(feature_importances_list)

# Create a DataFrame with feature names and their corresponding average importance
feature_names = X.columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': average_importance
}).sort_values(by='Importance', ascending=False)

# Print Results
print(f"Overall Accuracy of Random Forest model with k-fold CV: {accuracy_cv:.4f}")

print("\nFinal Classification Report:")
print(final_report)

print("\nRandom Forest Feature Importances (averaged across folds):")
print(importance_df.head(20))

Here is my code:

# Configuration
file_path = '/content/drive/My Drive/dataset.csv'
columns_to_keep = [
    'target_column', 'feature_a', 'feature_b', 'feature_c', 'feature_d', 'feature_e',
    'feature_f', 'feature_g', 'feature_h', 'feature_i', 'feature_j', 'feature_k', 'feature_l',
    'feature_m', 'feature_n', 'feature_o', 'feature_p', 'feature_q', 'feature_r', 'feature_s',
    'feature_t', 'feature_u', 'feature_v', 'feature_w', 'feature_x', 'feature_y', 'feature_z',
    'feature_aa', 'feature_ab', 'feature_ac', 'feature_ad', 'feature_ae', 'feature_af', 'feature_ag',
    'feature_ah', 'feature_ai', 'feature_aj', 'feature_ak', 'feature_al', 'feature_am', 'feature_an'
]

df = pd.read_csv(file_path, usecols=columns_to_keep)

categorical_columns = ['feature_ak', 'feature_al', 'feature_am', 'feature_an', 'feature_ao']
one_hot_columns = ['feature_al', 'feature_ak']

df = df.dropna()

# One-hot encode the specified column
le = LabelEncoder()
for col in one_hot_columns:
    df[col] = le.fit_transform(df[col])

# Convert specified columns to categorical
for col in categorical_columns:
  df[col] = df[col].astype('category')

# Split into features and target
X = df.drop(columns=['target_column'])
y = df['target_column']

# Initialize the RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results
feature_importances_list = []
all_y_true = []
all_y_pred = []

# Perform k-fold cross-validation
for fold_num, (train_index, test_index) in enumerate(kf.split(X), start=1):
    # Split the data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train the Random Forest model
    rf_model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = rf_model.predict(X_test)

    # Collect all true and predicted labels
    all_y_true.extend(y_test)
    all_y_pred.extend(y_pred)

    # Get feature importances for this fold
    feature_importances_list.append(rf_model.feature_importances_)

    # Calculate and print the accuracy for this fold
    accuracy_fold = accuracy_score(y_test, y_pred)
    print(f"Fold {fold_num} Accuracy: {accuracy_fold:.4f}")

# Calculate accuracy across all predictions
accuracy_cv = accuracy_score(all_y_true, all_y_pred)

# Generate a classification report
final_report = classification_report(all_y_true, all_y_pred, digits=3)

# Average feature importances across folds
average_importance = sum(feature_importances_list) / len(feature_importances_list)

# Create a DataFrame with feature names and their corresponding average importance
feature_names = X.columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': average_importance
}).sort_values(by='Importance', ascending=False)

# Print Results
print(f"Overall Accuracy of Random Forest model with k-fold CV: {accuracy_cv:.4f}")

print("\nFinal Classification Report:")
print(final_report)

print("\nRandom Forest Feature Importances (averaged across folds):")
print(importance_df.head(20))

Share Improve this question edited Feb 4 at 10:51 desertnaut 60.5k32 gold badges155 silver badges182 bronze badges asked Feb 4 at 2:34 rnoob 12 bronze badges

1 If there is a change in the sklearn version used in Collab, this can be expected, – desertnaut Commented Feb 4 at 10:56

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Even with the random state set, there can be change's with really small changes to any part of the process. It usually is a result of Changes in the Software Environment. There is nothing per say to do, the best thing you can do is ensure everything from the dataset to the package's and everything is locked down ensuring that there are no changes in between the run's.

There were some discussions on how this can be prevented I am linkthing them here so you can explore other options as well:

https://github.com/scikit-learn/scikit-learn/discussions/25411
https://github.com/scikit-learn/scikit-learn/issues/12188
https://github.com/scikit-learn/scikit-learn/issues/12259

转载请注明原文地址:http://www.anycun.com/QandA/1744742186a86971.html

python - Different Results (With Seed) For sklearn Random Forest - Stack Overflow

1 Answer 1

pythonDifferent Results (With Seed) For sklearn Random ForestStack Overflow