监督学习两大类
| 任务 | 输出 | 例子 |
|---|---|---|
| 回归 | 连续值 | 房价、销量、气温 |
| 分类 | 离散类别 | 垃圾邮件 / 不是;猫 / 狗 / 鸟 |
线性回归(回归)
预测一条直线 y = w₁x₁ + w₂x₂ + ... + b:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, preds):.3f}")
print(f"R² : {r2_score(y_test, preds):.3f}") # R² 越接近 1 越好
print("系数:", model.coef_) # 各特征的权重
逻辑回归(分类)
虽然叫"回归",但其实是分类——输出概率值再变成 0/1:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
preds = model.predict(X_test)
probs = model.predict_proba(X_test)[:, 1] # 属于正类的概率
决策树(分类 / 回归都行)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
可视化树:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
plot_tree(model, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()
随机森林:决策树的"三个臭皮匠"
多个决策树投票,比单棵树准很多:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
XGBoost / LightGBM:表格数据的天花板
2026 工业级表格 ML 标配——比随机森林还好:
pip install xgboost lightgbm
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
# 特征重要性
xgb.plot_importance(model)
处理结构化表格数据,XGBoost / LightGBM 经常打败深度学习——别先想 PyTorch。
评估指标
分类
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
accuracy_score(y_test, preds)
precision_score(y_test, preds, average="macro")
recall_score(y_test, preds, average="macro")
f1_score(y_test, preds, average="macro")
confusion_matrix(y_test, preds)
回归
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mean_squared_error(y_test, preds)
mean_absolute_error(y_test, preds)
r2_score(y_test, preds)
选模型的经验法则
| 数据情况 | 优先尝试 |
|---|---|
| 简单关系 / 想看可解释性 | 线性回归 / 逻辑回归 |
| 中小数据 + 想看决策路径 | 决策树 |
| 表格数据 / 想要好的 baseline | 随机森林 |
| 表格数据 / 追求最高分 | XGBoost / LightGBM |
| 文本 / 图像 | 神经网络(下一篇之后讲) |
下一篇讲聚类和降维——无监督学习。