Agastya — Contract Understanding on CUAD (Phase 1 ML)

Agastya targets clause-level prediction on legal agreements using CUAD v1: long-format rows per contract × clause category, grouped train/validation splits to prevent document leakage, and sparse TF-IDF features with optional log-scaled clause length. Primary model is `LinearSVC` with `class_weight='balanced'`; Multinomial Naive Bayes provides a fast baseline. Evaluation emphasizes macro-F1 under imbalance alongside precision, recall, and confusion views. Work is notebook-auditable (`notebooks/Phase_1/`: literature, EDA, feature engineering, theory) with methodology captured in `progress.md`, `agent.md`, and `project.md`. The `src/` package layout reserves modules for future OCR, segmentation, models, reasoning, and reporting while Phase 1 stays intentionally sklearn-centric.

Timeline

Semester-scale

Role

ML Research & Engineering

Team

Team of 2

Status

Completed

Source Code

Technology Stack

PythonJupyterscikit-learnpandasNumPyCUAD v1

Key Features

510 contracts, 41 categories, 20,910 long-format rows—documented EDA on imbalance and text tails

Grouped split strategy aligned to real deployment constraints (no same contract in train and val)

Ablation: TF-IDF vs TF-IDF + log-length; coefficient and sparsity inspection

Reproducible environment via `requirements.txt` and pinned random seeds where applicable

CUAD data vendored under `data/CUAD_v1/` for runnable clones

Roadmap for Phase 2+ (transformers, probabilistic reasoning, reporting UI) without compromising Phase 1 rigor

Key Learnings

Legal NLP baselines: TF-IDF + linear models still teach a lot about data defects
Grouped splits as a default for multi-row-per-document problems
Communicating limitations of small validation sets for rare labels
Team research workflow with clear phase boundaries and documentation

Key Challenges

Severe class imbalance and rare categories on small validation folds
Heavy-tailed clause lengths and Yes/No vs free-form field heterogeneity
Keeping claims honest: illustrative macro-F1 on a single split vs report-grade cross-validation
Separating graded notebook work from unfinished pipeline code under `src/`

Impact & Results

Reproducible Phase 1 baseline for CUAD clause identification

Public fork continuing Atticus CUAD lineage with citation expectations documented

Structured path from classical baselines toward hybrid AI described in roadmap

Future Enhancements

Grouped k-fold or multi-seed stability studies

Transformer encoders and calibration on held-out contracts

OCR-backed ingestion and clause segmentation pipeline in `src/`

End-user report generation UI for risk and clause presence summaries