装库
pip install scikit-learn
核心 API:四件套
from sklearn.X import SomeModel
model = SomeModel() # 1. 实例化
model.fit(X_train, y_train) # 2. 训练
preds = model.predict(X_test) # 3. 预测
score = model.score(X_test, y_test) # 4. 评估
所有模型都长这样——这是 sklearn 最强的设计。
完整例子:鸢尾花分类
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. 加载数据
iris = load_iris()
X, y = iris.data, iris.target # X: 特征 (150, 4), y: 标签 (150,)
# 2. 分训练 / 测试
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. 训练
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 4. 评估
preds = model.predict(X_test)
print(f"准确率: {accuracy_score(y_test, preds):.3f}")
print(classification_report(y_test, preds, target_names=iris.target_names))
标准流程
1. 加载数据(pandas / sklearn.datasets)
2. 探索(df.describe / 可视化)
3. 拆分(train_test_split)
4. 预处理(标准化 / 缺失值 / 编码)
5. 训练(model.fit)
6. 评估(accuracy / precision / recall / 交叉验证)
7. 调参(GridSearchCV)
8. 用最终模型预测
预处理
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# 数值标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # 用 train 的均值/方差
# 类别编码
encoder = OneHotEncoder(sparse_output=False)
X_cat = encoder.fit_transform(df[["city"]])
# 缺失值
imputer = SimpleImputer(strategy="mean")
X_filled = imputer.fit_transform(X)
Pipeline:把预处理 + 模型串起来
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
避免"训练时标准化、预测时忘了"的坑。
交叉验证
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"5 折平均: {scores.mean():.3f} ± {scores.std():.3f}")
调参 GridSearchCV
from sklearn.model_selection import GridSearchCV
params = {
"n_estimators": [50, 100, 200],
"max_depth": [None, 5, 10],
}
grid = GridSearchCV(RandomForestClassifier(), params, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
n_jobs=-1 用全部 CPU 核。
sklearn 在 2026 的位置
经典 ML(结构化表格数据)依然是 sklearn 主场——LLM 不擅长这种事。 深度学习 / 文本 / 图像 → PyTorch + Transformers。
下两篇分别讲分类回归 / 聚类降维。