Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
e378e78
Add Chapter 1 machine learning landscape notebook
macsrc Nov 1, 2025
5495662
Add registry files and update ML project notebook
macsrc Nov 7, 2025
2c32130
Rename and add ML landscape notebooks
macsrc Nov 8, 2025
8b39a90
Created using Colab
macsrc Nov 9, 2025
c025165
Created using Colab
macsrc Nov 10, 2025
daba234
Completed
macsrc Nov 10, 2025
af8dfc6
Add end-to-end ML project notebook
macsrc Nov 11, 2025
50a9cdb
Analyze and save file
macsrc Nov 18, 2025
b33ea1f
Add neural nets analysis notebook with Keras
macsrc Nov 19, 2025
d40cf05
updated comments for 10-analysis
macsrc Nov 19, 2025
e18d0cd
created files for analysis
macsrc Nov 19, 2025
79482a7
Merge branch 'handson-ml-241025' of https://github.com/macsrc/mac-han…
macsrc Nov 19, 2025
b684181
Created using Colab
macsrc Nov 19, 2025
dad8aa3
Updated Chap-11 Training DNN with analysis
macsrc Nov 19, 2025
2bd9e7b
updated analysis file for exploration
macsrc Nov 20, 2025
0fc833e
For Ch01 - ML Landscape - updated reviews for the e2e program
macsrc Nov 20, 2025
a228350
py files
macsrc Nov 20, 2025
7a3e560
Merge branch 'handson-ml-241025' of https://github.com/macsrc/mac-han…
macsrc Nov 20, 2025
6a05e16
Add new notebook and text files, update references
macsrc Nov 23, 2025
f96bee6
Add deep learning notes and explanations for chapters 10-12
macsrc Nov 23, 2025
263e8dc
updates - 24-11-2025
macsrc Nov 24, 2025
86cf187
update - Ch01 done update sync
macsrc Nov 26, 2025
f3ce393
updates
macsrc Nov 26, 2025
5557c54
updates
macsrc Nov 27, 2025
855b1bc
updates for ch03 and ch10
macsrc Dec 2, 2025
b31bd29
Add enhanced ML project and neural nets files
macsrc Dec 7, 2025
bb9eb15
updates
macsrc Dec 18, 2025
0966f35
Created using Colab
macsrc Dec 19, 2025
63db7dd
Created using Colab
macsrc Dec 20, 2025
17233b6
Created using Colab
macsrc Dec 21, 2025
165e2b7
Created using Colab
macsrc Dec 21, 2025
d0211fd
Created using Colab
macsrc Dec 21, 2025
10141fa
Created using Colab
macsrc Dec 21, 2025
b53d03b
ch01 and ch10 updates
macsrc Dec 21, 2025
7ac9931
Merge branch 'handson-ml-241025' of https://github.com/macsrc/mac-han…
macsrc Dec 21, 2025
390132e
Add neural network analysis notebook with Keras
macsrc Dec 21, 2025
f6c5ab7
Created using Colab
macsrc Dec 22, 2025
30b1171
Created using Colab
macsrc Dec 22, 2025
434ac39
update 22122025
macsrc Dec 22, 2025
bf2a802
Created using Colab
macsrc Dec 22, 2025
676961d
Created using Colab
macsrc Dec 22, 2025
1a53f8f
Created using Colab
macsrc Dec 22, 2025
c951a61
Created using Colab
macsrc Dec 22, 2025
192a489
Expand AI understanding template interview prompts
macsrc Dec 23, 2025
70d6bf3
Created using Colab
macsrc Dec 23, 2025
f5dfc2d
Created using Colab
macsrc Dec 23, 2025
94b9b95
update
macsrc Dec 23, 2025
0050c4f
Explore chapter 15 details
macsrc Dec 23, 2025
d21ac89
update
macsrc Dec 23, 2025
7b20ea1
Merge branch 'handson-ml-241025' of https://github.com/macsrc/mac-han…
macsrc Dec 23, 2025
29a6adc
ch15 - first comit for system check using Colab
macsrc Dec 23, 2025
9b0b3f0
Add and reorganize explore-hml3 chapter notebooks
macsrc Dec 23, 2025
457cb33
Created using Colab
macsrc Dec 23, 2025
1792200
Created using Colab
macsrc Dec 23, 2025
7d7c84a
Created using Colab
macsrc Dec 24, 2025
bcf19c4
exploration in progress - Created using Colab
macsrc Dec 25, 2025
368b137
Created using Colab
macsrc Dec 25, 2025
0c2b70b
Created using Colab
macsrc Dec 26, 2025
5f6af73
updates - 26-Dec-2025
macsrc Dec 26, 2025
d76d961
Add temporary Word document to project
macsrc Dec 26, 2025
08ff585
changes from branch 241025
macsrc Dec 26, 2025
8ba2043
added enhanced_homl_notebook
macsrc Dec 27, 2025
e8cb4af
Created using Colab
macsrc Dec 27, 2025
42f0f39
Add AI template and update explore-hml3.docx
macsrc Dec 28, 2025
3c0986a
Created using Colab
macsrc Jan 1, 2026
460a0c0
Created using Colab - 02-Jan-2025
macsrc Jan 2, 2026
22c90a1
Created using Colab
macsrc Jan 2, 2026
4392037
Created using Colab
macsrc Jan 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5,383 changes: 3,690 additions & 1,693 deletions 01_the_machine_learning_landscape.ipynb

Large diffs are not rendered by default.

1,190 changes: 1,190 additions & 0 deletions 01_the_machine_learning_landscape.md

Large diffs are not rendered by default.

442 changes: 442 additions & 0 deletions 01_the_machine_learning_landscape.py

Large diffs are not rendered by default.

1,420 changes: 1,420 additions & 0 deletions 01_the_mll_practice.ipynb

Large diffs are not rendered by default.

418 changes: 418 additions & 0 deletions 02_Enhanced_End-to-End_Notebook_for_Hands.py

Large diffs are not rendered by default.

16,808 changes: 10,597 additions & 6,211 deletions 02_end_to_end_machine_learning_project.ipynb

Large diffs are not rendered by default.

4,996 changes: 4,996 additions & 0 deletions 02_end_to_end_machine_learning_project.md

Large diffs are not rendered by default.

1,751 changes: 1,751 additions & 0 deletions 02_end_to_end_machine_learning_project.py

Large diffs are not rendered by default.

637 changes: 637 additions & 0 deletions 02_end_to_end_machine_learning_project_enhancements.md

Large diffs are not rendered by default.

293 changes: 293 additions & 0 deletions 02_enhanced_homl_ch_2_notebook.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# Enhanced End-to-End Notebook for Hands-On ML Chapter 2
# File: enhanced_homl_ch2_notebook.py
# Purpose: Complete, runnable end-to-end pipeline implementing improvements:
# - data loading
# - EDA (brief)
# - feature engineering
# - preprocessing pipelines (ColumnTransformer)
# - outlier handling
# - several models (Linear, Ridge, DecisionTree, RandomForest, GradientBoosting)\# - model selection (RandomizedSearchCV)
# - stacking ensemble
# - final evaluation on test set
# - model persistence

print("Enhanced ML workflow loaded.")

# --- Imports ---------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import urllib.request
import tarfile

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer, PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator, TransformerMixin
import joblib

# Optional advanced libs (if installed) – fallback safe
try:
import xgboost as xgb
XGBOOST_AVAILABLE = True
except Exception:
XGBOOST_AVAILABLE = False

# --- Utilities -------------------------------------------------------------
def load_housing_data(data_root="https://github.com/ageron/data/raw/main/"):
tarball_path = Path("datasets/housing.tgz")
if not tarball_path.is_file():
Path("datasets").mkdir(parents=True, exist_ok=True)
url = data_root + "housing.tgz"
print("Downloading housing dataset...")
urllib.request.urlretrieve(url, tarball_path)
with tarfile.open(tarball_path) as housing_tarball:
housing_tarball.extractall(path="datasets")
return pd.read_csv(Path("datasets/housing/housing.csv"))

# RMSE helper
def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))

# save figure helper
IMAGES_PATH = Path("images/enhanced_homl_ch2")
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id):
plt.tight_layout()
plt.savefig(IMAGES_PATH / f"{fig_id}.png", dpi=200)

# --- Load data -------------------------------------------------------------
housing = load_housing_data()
print("Loaded housing shape:", housing.shape)

# --- Quick EDA (very brief) -----------------------------------------------
print(housing.info())
print(housing.describe().T[['mean','std','min','max']])
print(housing['ocean_proximity'].value_counts())

# Visual quick check (histograms)
housing.hist(bins=50, figsize=(12, 8))
save_fig('histograms')
plt.show()

# --- Create stratified split based on median_income (as in book) ------------
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0.,1.5,3.0,4.5,6.,np.inf], labels=[1,2,3,4,5])
train_set, test_set = train_test_split(housing, test_size=0.2, stratify=housing['income_cat'], random_state=42)
for s in (train_set, test_set):
s.drop('income_cat', axis=1, inplace=True)

housing = train_set.copy()

# --- Feature engineering (recommended additions) ---------------------------
# We'll implement feature transformers that can be included in ColumnTransformer

def add_extra_features(X_df):
X = X_df.copy()
# safe divisions
X['rooms_per_house'] = X['total_rooms'] / X['households']
X['bedrooms_ratio'] = X['total_bedrooms'] / X['total_rooms']
X['people_per_house'] = X['population'] / X['households']
X['rooms_per_person'] = X['total_rooms'] / X['population']
X['income_x_age'] = X['median_income'] * X['housing_median_age']
# fill infinities / NaNs if any
X.replace([np.inf, -np.inf], np.nan, inplace=True)
return X

# Apply to verify
housing_fe = add_extra_features(housing)
print(housing_fe[['rooms_per_house','bedrooms_ratio','people_per_house','rooms_per_person','income_x_age']].head())

# Custom transformer to add engineered features inside pipeline
class FeatureAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_rooms_per_person=True):
self.add_rooms_per_person = add_rooms_per_person
def fit(self, X, y=None):
return self
def transform(self, X):
# X is numpy array: we convert to DataFrame with feature names if provided
if hasattr(X, 'columns'):
X_df = X
else:
# fallback: no column names – assume original order from book num_attribs
X_df = pd.DataFrame(X, columns=['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income'])
X_df = add_extra_features(X_df)
# select numerical columns
return X_df[['rooms_per_house','bedrooms_ratio','people_per_house','rooms_per_person','income_x_age']].values

# --- Preprocessing pipelines -----------------------------------------------
num_attribs = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('feat_adder', FeatureAdder()),
('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessing = ColumnTransformer([
('num', num_pipeline, num_attribs),
('cat', cat_pipeline, cat_attribs),
])

# Fit-transform a small sample to ensure pipeline works
sample_prep = preprocessing.fit_transform(housing)
print('Preprocessing output shape:', sample_prep.shape)

# --- Prepare training data -------------------------------------------------
X_train = housing.drop('median_house_value', axis=1)
y_train = housing['median_house_value'].copy()
X_test = test_set.drop('median_house_value', axis=1)
y_test = test_set['median_house_value'].copy()

# We will use pipelines wrapping the preprocessing + estimator

# --- Define candidate models ----------------------------------------------
models = {}
models['lin_reg'] = make_pipeline(preprocessing, LinearRegression())
models['ridge'] = make_pipeline(preprocessing, Ridge(random_state=42))
models['tree'] = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
models['rf'] = make_pipeline(preprocessing, RandomForestRegressor(random_state=42, n_jobs=-1))
models['gbr'] = make_pipeline(preprocessing, GradientBoostingRegressor(random_state=42))
if XGBOOST_AVAILABLE:
models['xgb'] = make_pipeline(preprocessing, xgb.XGBRegressor(random_state=42, n_jobs=-1))

# --- Quick cross-validation comparison ------------------------------------
print('Cross-validating baseline models (5-fold RMSE) ...')
for name, model in models.items():
scores = -cross_val_score(model, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
print(f"{name}: mean RMSE={scores.mean():.2f}, std={scores.std():.2f}")

# --- Hyperparameter tuning (RandomizedSearch) ------------------------------
# We'll tune RandomForest and GradientBoosting (and XGBoost if available)

param_dist_rf = {
'randomforestregressor__n_estimators': [100, 200, 400],
'randomforestregressor__max_features': [4, 6, 8, 10],
'randomforestregressor__max_depth': [None, 10, 20, 30],
'randomforestregressor__min_samples_split': [2, 5, 10]
}

param_dist_gbr = {
'gradientboostingregressor__n_estimators': [100, 200, 400],
'gradientboostingregressor__learning_rate': [0.01, 0.05, 0.1],
'gradientboostingregressor__max_depth': [3, 5, 8],
'gradientboostingregressor__subsample': [0.6, 0.8, 1.0]
}

searches = {}

print('Running RandomizedSearchCV for RandomForest (this may take a while)...')
rf_search = RandomizedSearchCV(models['rf'], param_distributions=param_dist_rf, n_iter=10, cv=3,
scoring='neg_root_mean_squared_error', random_state=42, n_jobs=-1)
rf_search.fit(X_train, y_train)
searches['rf'] = rf_search
print('Best RF params:', rf_search.best_params_)

print('Running RandomizedSearchCV for GradientBoosting...')
gbr_search = RandomizedSearchCV(models['gbr'], param_distributions=param_dist_gbr, n_iter=10, cv=3,
scoring='neg_root_mean_squared_error', random_state=42, n_jobs=-1)
gbr_search.fit(X_train, y_train)
searches['gbr'] = gbr_search
print('Best GBR params:', gbr_search.best_params_)

if XGBOOST_AVAILABLE:
param_dist_xgb = {
'xgbregressor__n_estimators': [100, 200, 400],
'xgbregressor__max_depth': [3, 5, 8],
'xgbregressor__learning_rate': [0.01, 0.05, 0.1],
'xgbregressor__colsample_bytree': [0.6, 0.8, 1.0]
}
print('Running RandomizedSearchCV for XGBoost...')
xgb_search = RandomizedSearchCV(models['xgb'], param_distributions=param_dist_xgb, n_iter=10, cv=3,
scoring='neg_root_mean_squared_error', random_state=42, n_jobs=-1)
xgb_search.fit(X_train, y_train)
searches['xgb'] = xgb_search
print('Best XGB params:', xgb_search.best_params_)

# --- Evaluate best estimators on validation via cross-val ------------------
print('Evaluating best estimators (5-fold CV)')
best_estimators = {}
for key, search in searches.items():
best = search.best_estimator_
best_estimators[key] = best
scores = -cross_val_score(best, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
print(f"{key}: mean RMSE={scores.mean():.2f}, std={scores.std():.2f}")

# Also include tuned RF and GBR in pool
candidates = []
for k, est in best_estimators.items():
candidates.append((k, est))
# add untuned models as fallback
candidates.append(('ridge', models['ridge']))

# --- Stacking ensemble -----------------------------------------------------
print('Building stacking ensemble with top candidates...')
final_estimators = [(name, est.named_steps[list(est.named_steps.keys())[-1]]) for name, est in candidates]
# NOTE: StackingRegressor expects estimators without preprocessing; to keep preprocessing we create pipeline-wrappers
stack = StackingRegressor(estimators=[(name, est) for name, est in best_estimators.items() if name in ['rf','gbr'] and name in best_estimators],
final_estimator=Ridge())
stack_pipeline = make_pipeline(preprocessing, stack)

# Cross-validate stacking
stack_scores = -cross_val_score(stack_pipeline, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
print(f"Stacking mean RMSE={stack_scores.mean():.2f}, std={stack_scores.std():.2f}")

# --- Final training on full training set and test evaluation ----------------
print('Training final model (stack pipeline) on full training set...')
stack_pipeline.fit(X_train, y_train)
final_predictions = stack_pipeline.predict(X_test)
final_test_rmse = rmse(y_test, final_predictions)
print(f"Final model test RMSE = {final_test_rmse:.2f}")

# Feature importances from best RF (if present)
if 'rf' in searches:
best_rf = searches['rf'].best_estimator_.named_steps['randomforestregressor']
# get feature names
try:
feature_names = list(preprocessing.get_feature_names_out())
except Exception:
feature_names = None
if feature_names is not None:
importances = best_rf.feature_importances_
fi = sorted(zip(importances, feature_names), reverse=True)[:15]
print('Top feature importances (RF):')
for imp, name in fi:
print(f"{name}: {imp:.3f}")

# --- Confidence interval for test RMSE via bootstrap -----------------------
from scipy import stats
squared_errors = (final_predictions - y_test) ** 2
def rmse_from_sq(sq):
return np.sqrt(np.mean(sq))
boot_res = stats.bootstrap([squared_errors], rmse_from_sq, confidence_level=0.95, random_state=42)
print('95% CI for test RMSE:', boot_res.confidence_interval)

# --- Save final pipeline --------------------------------------------------
joblib.dump(stack_pipeline, 'enhanced_homl_ch2_final_pipeline.pkl')
print('Saved final pipeline to enhanced_homl_ch2_final_pipeline.pkl')

# --- Quick prediction demo ------------------------------------------------
sample = X_test.iloc[:5]
print('Sample predictions:', stack_pipeline.predict(sample).round(-2))
print('Sample actuals :', y_test.iloc[:5].values.round(-2))

# Done
print('Notebook run complete.')
Loading