Overview
The Document Extraction feature uses advanced Large Language Models (LLMs) to extract structured CDM-compliant data from unstructured PDF credit agreements. It supports multiple input modes including PDF upload, text input, audio transcription, and image OCR. Code Reference:client/src/apps/docu-digitizer/DocumentParser.tsx, app/chains/extraction_chain.py, app/api/routes.py (extract endpoints)
Key Features
Multi-Modal Input
- PDF Upload: Direct PDF file upload and processing
- Text Input: Paste or type document text
- Audio Transcription: Speech-to-text transcription using Gradio Spaces
- Image OCR: Optical character recognition from images
- Document Library: Select from previously extracted documents
client/src/apps/docu-digitizer/MultimodalInputTabs.tsx
LLM Extraction
- Simple Extraction: For documents under 50k characters
- Map-Reduce Extraction: For larger documents (>50k characters)
- Multiple Providers: OpenAI GPT-4o, vLLM, HuggingFace
- Structured Output: CDM-compliant structured data extraction
app/chains/extraction_chain.py, app/chains/map_reduce_chain.py
CDM Compliance
- Automatic Validation: Policy compliance validation at extraction point
- CDM Models: CreditAgreement, Party, LoanFacility models
- Event Generation: Automatic CDM event generation
- Policy Enforcement: Real-time policy evaluation
app/models/cdm.py, app/models/cdm_events.py
Workflow
1. Input Document
- Choose Input Method: PDF upload, text paste, audio, or image
- Upload/Enter Content: Provide document content
- Review Input: Verify content is correct
- Proceed to Extraction: Click “Extract” button
2. Extraction Process
- LLM Processing: System processes document with selected LLM
- Structured Extraction: Extracts CDM-compliant data
- Policy Evaluation: Evaluates against policy rules
- Validation: Validates extracted data
- Result Display: Shows extracted data with edit capability
3. Review and Save
- Review Data: Review extracted CDM data
- Edit if Needed: Use clause editor or CDM accordion editor
- Save to Library: Save document to library
- Broadcast FDC3: Broadcast to desktop applications
- Generate Documents: Option to generate LMA templates
API Endpoints
Extract Document
Extract structured data from document text. Request Body:app/api/routes.py (extract endpoint)
Upload Document
Upload and extract from document file. Request: Multipart form data with file Response: Document record with extracted CDM dataExtraction Chain
Simple Extraction Chain
For documents under 50k characters: Code Reference:app/chains/extraction_chain.py
Map-Reduce Chain
For larger documents:- Map Phase: Split document into chunks
- Extract Phase: Extract from each chunk
- Reduce Phase: Combine extractions into final result
app/chains/map_reduce_chain.py
CDM Data Structure
Extracted data follows CDM structure:app/models/cdm.py (CreditAgreement model)
Policy Integration
All extractions are evaluated against policy rules:- Sanctions Screening: Check for sanctioned parties
- Credit Risk: Evaluate creditworthiness
- ESG Compliance: Check ESG requirements
- Regulatory Filings: Identify filing requirements
app/services/policy_service.py
User Interface
Document Parser
Location:client/src/apps/docu-digitizer/DocumentParser.tsx
Features:
- Multimodal Input Tabs: PDF, text, audio, image input
- Extraction Results: Display extracted CDM data
- Clause Editor: Edit individual clauses
- CDM Accordion Editor: Edit CDM structure
- FDC3 Broadcast: Broadcast to desktop apps
- Save to Library: Save extracted documents
Best Practices
- Quality Input: Ensure document text is clear and complete
- Review Results: Always review extracted data for accuracy
- Edit as Needed: Use editors to correct any errors
- Policy Compliance: Verify policy evaluation results
- Save Documents: Save to library for future reference
Additional Resources
Last Updated: 2026-01-14
Code Reference:
client/src/apps/docu-digitizer/DocumentParser.tsx, app/chains/extraction_chain.py, app/api/routes.py