装库

pip install scikit-learn

核心 API:四件套

from sklearn.X import SomeModel

model = SomeModel()           # 1. 实例化
model.fit(X_train, y_train)   # 2. 训练
preds = model.predict(X_test) # 3. 预测
score = model.score(X_test, y_test)  # 4. 评估

所有模型都长这样——这是 sklearn 最强的设计。

完整例子:鸢尾花分类

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. 加载数据
iris = load_iris()
X, y = iris.data, iris.target           # X: 特征 (150, 4), y: 标签 (150,)

# 2. 分训练 / 测试
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. 训练
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. 评估
preds = model.predict(X_test)
print(f"准确率: {accuracy_score(y_test, preds):.3f}")
print(classification_report(y_test, preds, target_names=iris.target_names))

标准流程

1. 加载数据(pandas / sklearn.datasets)
2. 探索(df.describe / 可视化)
3. 拆分(train_test_split)
4. 预处理(标准化 / 缺失值 / 编码)
5. 训练(model.fit)
6. 评估(accuracy / precision / recall / 交叉验证)
7. 调参(GridSearchCV)
8. 用最终模型预测

预处理

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# 数值标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)         # 用 train 的均值/方差

# 类别编码
encoder = OneHotEncoder(sparse_output=False)
X_cat = encoder.fit_transform(df[["city"]])

# 缺失值
imputer = SimpleImputer(strategy="mean")
X_filled = imputer.fit_transform(X)

Pipeline:把预处理 + 模型串起来

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", LogisticRegression()),
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

避免"训练时标准化、预测时忘了"的坑。

交叉验证

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"5 折平均: {scores.mean():.3f} ± {scores.std():.3f}")

调参 GridSearchCV

from sklearn.model_selection import GridSearchCV

params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10],
}

grid = GridSearchCV(RandomForestClassifier(), params, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)

n_jobs=-1 用全部 CPU 核。

sklearn 在 2026 的位置

经典 ML(结构化表格数据)依然是 sklearn 主场——LLM 不擅长这种事。 深度学习 / 文本 / 图像 → PyTorch + Transformers。

下两篇分别讲分类回归 / 聚类降维。