DevxSubh
I develop 3D visuals, user interfaces and web applications.
Loading...0%
Menu
Menu
Close
Close
Back to projects
Agastya — Contract Understanding on CUAD (Phase 1 ML)

Agastya — Contract Understanding on CUAD (Phase 1 ML)

Agastya targets clause-level prediction on legal agreements using CUAD v1: long-format rows per contract × clause category, grouped train/validation splits to prevent document leakage, and sparse TF-IDF features with optional log-scaled clause length. Primary model is `LinearSVC` with `class_weight='balanced'`; Multinomial Naive Bayes provides a fast baseline. Evaluation emphasizes macro-F1 under imbalance alongside precision, recall, and confusion views. Work is notebook-auditable (`notebooks/Phase_1/`: literature, EDA, feature engineering, theory) with methodology captured in `progress.md`, `agent.md`, and `project.md`. The `src/` package layout reserves modules for future OCR, segmentation, models, reasoning, and reporting while Phase 1 stays intentionally sklearn-centric.

Timeline

Semester-scale

Role

ML Research & Engineering

Team

Team of 2

Status

Completed

Technology Stack

PythonJupyterscikit-learnpandasNumPyCUAD v1

Key Features

510 contracts, 41 categories, 20,910 long-format rows—documented EDA on imbalance and text tails
Grouped split strategy aligned to real deployment constraints (no same contract in train and val)
Ablation: TF-IDF vs TF-IDF + log-length; coefficient and sparsity inspection
Reproducible environment via `requirements.txt` and pinned random seeds where applicable
CUAD data vendored under `data/CUAD_v1/` for runnable clones
Roadmap for Phase 2+ (transformers, probabilistic reasoning, reporting UI) without compromising Phase 1 rigor

Key Learnings

  • Legal NLP baselines: TF-IDF + linear models still teach a lot about data defects
  • Grouped splits as a default for multi-row-per-document problems
  • Communicating limitations of small validation sets for rare labels
  • Team research workflow with clear phase boundaries and documentation

Key Challenges

  • Severe class imbalance and rare categories on small validation folds
  • Heavy-tailed clause lengths and Yes/No vs free-form field heterogeneity
  • Keeping claims honest: illustrative macro-F1 on a single split vs report-grade cross-validation
  • Separating graded notebook work from unfinished pipeline code under `src/`

Impact & Results

Reproducible Phase 1 baseline for CUAD clause identification
Public fork continuing Atticus CUAD lineage with citation expectations documented
Structured path from classical baselines toward hybrid AI described in roadmap

Future Enhancements

Grouped k-fold or multi-seed stability studies
Transformer encoders and calibration on held-out contracts
OCR-backed ingestion and clause segmentation pipeline in `src/`
End-user report generation UI for risk and clause presence summaries