Extract: Structured data
Unstructured documents into structured data
When the team uploads a customer agreement, cap table, and monthly financial pack, they do not want three unread files. They want counterparties, change-of-control language, ownership positions, and concentration signals surfaced fast enough to decide what deserves human attention first.
Colabra does just that. A contract is read for clauses and terms. A financial statement is parsed for line items, ratios, and QoE metrics. A cap table is analysed for ownership positions.
Contracts and clauses

For each contract agreement, Colabra extracts clauses, as well as:
- Parties — who is bound by the contract, including counterparties and guarantors.
- Dates — effective date, expiry date, renewal terms.
- Change of control — whether the contract requires consent, provides termination rights, or is silent on change of control.
- Termination rights — termination for convenience, termination for cause, notice periods.
- Indemnity provisions — caps, baskets, survival periods, carve-outs.
- Non-compete and non-solicit — scope, duration, geography.
- Assignment clauses — whether assignment is permitted, restricted, or requires consent.
This clause-level detail powers the downstream risk register. A change-of-control clause that requires consent becomes a finding when your guardrails flag assignment issues. An indemnity cap below your deal-size threshold becomes a finding when your guardrails set a minimum percentage.
Cross-border contracts are part of this same workflow, not a separate manual lane. Colabra can normalise non-English document titles into English or Latin script for the file list, translate non-English contract clauses into English for review, and compare monetary facts against the deal base currency. See Cross-border deals for the full workflow.
Financial statements, QoE, and working capital

Financial extraction covers multiple document types:
- Financial statements — income statement, balance sheet, and cash flow data. Revenue, EBITDA, gross margin, and working capital are extracted and structured.
- EBITDA bridges and adjustments — the quality of earnings view surfaces adjustments, normalizations, and one-time items that affect the true earnings picture.
- Working capital analysis — short-term assets vs. short-term liabilities, with trend analysis across periods.
- Cash conversion — operating cash flow relative to EBITDA, highlighting whether reported earnings translate to actual cash.
- AR/AP aging — receivables and payables broken down by aging bucket (30, 60, 90+ days), with concentration and trend analysis.
- Bank statements — account activity extraction for cash verification.
- Tax filings — tax return data for cross-checking reported performance, tax position, and filing history.
- Tieout support — reconciliation documents that connect management accounts to audited statements.
Financial thresholds in your diligence guardrails determine which observations become findings. If your AR aging threshold is 15% for 90+ day receivables and the target exceeds that, it surfaces as a flag.
Cap tables and equity

Cap table extraction produces:
- Ownership snapshots — who holds what percentage, across common, preferred, and other security classes.
- Security classes — the structure of the capitalisation, including conversion terms and preferences.
- Vesting schedules — employee equity vesting details, cliff dates, and acceleration triggers.
- Equity grant ledgers — individual grant records with dates, amounts, and vesting status.
Colabra distinguishes between additive rows (positions that count toward total ownership) and informational rows (subtotals, notes, or representative entries). This distinction matters for accurate rollup calculations — informational rows are displayed but never included in ownership totals.
Entity structure

Entity-structure extraction turns corporate records into a usable graph:
- Parent and subsidiary mapping — how legal entities connect across the group.
- Ownership chains — intermediate holdings, minority stakes, and control paths.
- Jurisdictions — where each legal entity sits and which territories matter.
- Structure summaries — perimeter maps, subsidiary trees, and legal-entity overviews.
This is what powers the entity map. Instead of rebuilding the structure manually from filings, org charts, and schedules, your team starts from a graph that already links entities back to the source evidence.
IP, compliance, communications, and dependencies

Beyond contracts and financials, extraction covers:
- IP portfolio — patents, trademarks, registered copyrights, trade secrets, and licensing agreements.
- Compliance findings — regulatory issues, permit gaps, and compliance observations extracted from filings and correspondence.
- Communication threads — email thread analysis including participants, topics, and key decisions.
- Meeting materials — structured notes from board packs, management updates, and committee discussions.
- Code dependencies — third-party libraries, package risk, vendor SDKs, and supply chain concentration.
Each of these families supports a different diligence question:
- IP extraction helps the team understand ownership, registrations, encumbrances, and licensing exposure.
- Compliance extraction surfaces policy gaps, permit issues, and regulatory exceptions.
- Communication extraction captures decisions, commitments, and red flags buried in email and meeting materials.
- Dependency extraction helps software diligence teams identify third-party exposure and package concentration.
Main extraction targets
| Data type | What you get |
|---|---|
| Clauses and terms | Key provisions, change-of-control rights, termination, indemnities, non-competes |
| Financial statements | Revenue, EBITDA, cash flow, balance sheet line items, management adjustments |
| Quality of earnings | Adjusted EBITDA bridges, working capital, cash conversion, concentration metrics |
| Cap tables | Share classes, holder positions, dilution, ownership percentages |
| Equity and vesting | Grant ledgers, vesting schedules, option pools |
| Entity structure | Subsidiaries, parents, ownership chains, jurisdictions |
| AR/AP aging | Aging buckets, concentration, trend analysis |
| Bank and tax data | Account summaries, tax filing details, reconciliation support |
| IP assets | Patents, trademarks, licences, registration status |
| Compliance issues | Regulatory gaps, policy violations, permit status |
| Communications | Meeting notes, email threads, key discussion points |
| Code dependencies | Third-party libraries, package risk, and critical vendor exposure |
When to trust extraction vs. when to verify
For most documents, the structured output is accurate enough to drive triage and prioritisation. But for deal-critical findings — a change-of-control provision in a material contract, an ownership percentage that affects control, a financial adjustment that moves EBITDA — always verify against the source.
The file preview lets you open the original document alongside the extracted data. Use it. Every finding links back to the specific clause or data point that produced it, so verification is one click away.