
About
Machine learning in Python with scikit-learn. Use for classification, regression, clustering, model evaluation, and ML pipelines.
name: scikit-learn description: Machine learning in Python with scikit-learn. Use for classification, regression, clustering, model evaluation, and ML pipelines. license: BSD-3-Clause license metadata: skill-author: K-Dense Inc. risk: unknown source: community
Scikit-learn
Overview
This skill provides comprehensive guidance for machine learning tasks using scikit-learn, the industry-standard Python library for classical machine learning. Use this skill for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building production-ready ML pipelines.
Installation
# Install scikit-learn using uv
uv uv pip install scikit-learn
# Optional: Install visualization dependencies
uv uv pip install matplotlib seaborn
# Commonly used with
uv uv pip install pandas numpy
When to Use This Skill
Use the scikit-learn skill when:
- Building classification or regression models
- Performing clustering or dimensionality reduction
- Preprocessing and transforming data for machine learning
- Evaluating model performance with cross-validation
- Tuning hyperparameters with grid or random search
- Creating ML pipelines for production workflows
- Comparing different algorithms for a task
- Working with both structured (tabular) and text data
- Need interpretable, classical machine learning approaches
Quick Start
Classification Example
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
Complete Pipeline with Mixed Data
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
# Create preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Core Capabilities
1. Supervised Learning
Comprehensive algorithms for classification and regression tasks.
Key algorithms:
- Linear models: Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet
- Tree-based: Decision Trees, Random Forest, Gradient Boosting
- Support Vector Machines: SVC, SVR with various kernels
- Ensemble methods: AdaBoost, Voting, Stacking
- Neural Networks: MLPClassifier, MLPRegressor
- Others: Naive Bayes, K-Nearest Neighbors
When to use:
- Classification: Predicting discrete categories (spam detection, image classification, fraud detection)
- Regression: Predicting continuous values (price prediction, demand forecasting)
See: references/supervised_learning.md for detailed algorithm documentation, parameters, and usage examples.
2. Unsupervised Learning
Discover patterns in unlabeled data through clustering and dimensionality reduction.
Clustering algorithms:
- Partition-based: K-Means, MiniBatchKMeans
- Density-based: DBSCAN, HDBSCAN, OPTICS
- Hierarchical: AgglomerativeClustering
- Probabilistic: Gaussian Mixture Models
- Others: MeanShift, SpectralClustering, BIRCH
Dimensionality reduction:
- Linear: PCA, TruncatedSVD, NMF
- Manifold learning: t-SNE, UMAP, Isomap, LLE
- Feature extraction: FastICA, LatentDirichletAllocation
When to use:
- Customer segmentation, anomaly detection, data visualization
- Reducing feature dimensions, exploratory data analysis
- Topic modeling, image compression
See: references/unsupervised_learning.md for detailed documentation.
3. Model Evaluation and Selection
Tools for robust model evaluation, cross-validation, and hyperparameter tuning.
Cross-validation strategies:
