
Agastya — Contract Understanding on CUAD (Phase 1 ML)
Agastya targets clause-level prediction on legal agreements using CUAD v1: long-format rows per contract × clause category, grouped train/validation splits to prevent document leakage, and sparse TF-IDF features with optional log-scaled clause length. Primary model is `LinearSVC` with `class_weight='balanced'`; Multinomial Naive Bayes provides a fast baseline. Evaluation emphasizes macro-F1 under imbalance alongside precision, recall, and confusion views. Work is notebook-auditable (`notebooks/Phase_1/`: literature, EDA, feature engineering, theory) with methodology captured in `progress.md`, `agent.md`, and `project.md`. The `src/` package layout reserves modules for future OCR, segmentation, models, reasoning, and reporting while Phase 1 stays intentionally sklearn-centric.
Timeline
Semester-scale
Role
ML Research & Engineering
Team
Team of 2
Status
Technology Stack
Key Features
Key Learnings
- Legal NLP baselines: TF-IDF + linear models still teach a lot about data defects
- Grouped splits as a default for multi-row-per-document problems
- Communicating limitations of small validation sets for rare labels
- Team research workflow with clear phase boundaries and documentation
Key Challenges
- Severe class imbalance and rare categories on small validation folds
- Heavy-tailed clause lengths and Yes/No vs free-form field heterogeneity
- Keeping claims honest: illustrative macro-F1 on a single split vs report-grade cross-validation
- Separating graded notebook work from unfinished pipeline code under `src/`