Class Balance Analyzer
Upload or paste your class labels to analyze dataset balance. Get class distribution, imbalance severity, and calculated class weights for PyTorch, TensorFlow, and Scikit-learn.
Understanding Class Imbalance
- Imbalance Ratio: The ratio of the largest class to the smallest class. Ratio > 2 usually indicates imbalance.
- Minority Class %: The percentage of the smallest class. Below 10% is typically problematic.
- Severity: LOW (ratio < 2), MEDIUM (ratio 2-10), HIGH (ratio > 10).
- Why it matters: Models trained on imbalanced data often ignore the minority class entirely.
Solutions for Class Imbalance
EASY Class Weights
Assign higher weights to minority classes during training. Fastest solution, works for most cases. Built into most frameworks.
👎 Con: May not work for extreme imbalance
MEDIUM Oversampling
Duplicate minority class samples or use SMOTE to generate synthetic samples. Increases training data.
👎 Con: Risk of overfitting
MEDIUM Undersampling
Remove majority class samples to balance classes. Reduces training data and speed.
👎 Con: Loses information
HARD SMOTE / Advanced Techniques
Synthetic Minority Over-sampling Technique. Creates synthetic samples between existing minority samples.
👎 Con: Complex, requires tuning
EASY Threshold Adjustment
Change the decision threshold instead of 0.5. Trades off precision vs recall.
👎 Con: Only for binary classification
EASY Different Metrics
Use F1, precision-recall AUC instead of accuracy. Better for imbalanced data.
👎 Con: Just evaluation, not fixing
Class Weights Explained
For a binary classification with 95 positives and 5 negatives:
- Balanced formula: weight = total_samples / (num_classes × class_count)
- Example: Negative weight = 100 / (2 × 95) = 0.53, Positive weight = 100 / (2 × 5) = 10
- Interpretation: The model treats each positive sample as 10× more important during training
- Alternative: Some prefer weight = 1 / class_count, others use log scaling
Related Tools
- Text Preprocessing Pipeline – Clean text data before classification
- JSON Formatter – Format your dataset as JSON
- Tokenizer Visualizer – Tokenize text for NLP tasks
- EDA Text Augmenter – Generate training data variations