Core Architecture Regulatory Mapping

Entity Taxonomy & Classification: Single-Intent Routing for Annual Filing Automation

Corporate entity portfolios rarely conform to a single regulatory template. Legal operations and compliance teams managing multi-jurisdictional portfolios face compounding complexity when annual reporting obligations diverge by entity type, domicile, fiscal structure, and statutory jurisdiction. The ingestion and classification pipeline serves as the deterministic control plane for corporate compliance automation. By enforcing a strict single-intent execution model, organizations route each entity record through a standardized taxonomy before triggering downstream filing workflows, eliminating ambiguous state assignments and preventing cascading penalty events. This architecture operates as a stateless, idempotent microservice within the broader Core Architecture & Regulatory Mapping framework, ensuring that every compliance action is traceable, auditable, and reproducible.

Deterministic Normalization Pipeline

Raw entity data arriving from ERP systems, HRIS platforms, or manual intake forms contains inconsistent casing, malformed jurisdiction codes, and legacy entity aliases. The normalization layer executes a strict, lossless transformation sequence before any classification logic evaluates the payload. Field-level validation strips control characters, standardizes jurisdictional codes to ISO 3166-2, and resolves entity type aliases against a canonical registry.

from __future__ import annotations
import re
import logging
from enum import Enum
from typing import Optional
from pydantic import BaseModel, field_validator

logger = logging.getLogger("compliance.taxonomy")

class JurisdictionCode(str, Enum):
    DE = "US-DE"
    CA = "US-CA"
    NY = "US-NY"
    # Extend with full ISO 3166-2 registry in production

class EntityType(str, Enum):
    DOMESTIC_CORP = "domestic_c_corp"
    FOREIGN_QUALIFIED = "foreign_qualified"
    LLC = "limited_liability_company"
    PARTNERSHIP = "partnership"

class RawEntityPayload(BaseModel):
    entity_name: str
    formation_state: str
    entity_type_raw: str
    ein_prefix: Optional[str] = None
    fiscal_year_end_month: Optional[int] = None

class NormalizedEntity(BaseModel):
    entity_id: str
    canonical_name: str
    jurisdiction_iso: JurisdictionCode
    entity_type: EntityType
    ein_prefix: Optional[str] = None
    fiscal_year_end_month: Optional[int] = None

    @field_validator("canonical_name")
    @classmethod
    def normalize_name(cls, v: str) -> str:
        return re.sub(r"\s+", " ", v.strip().upper())

    @field_validator("formation_state")
    @classmethod
    def resolve_jurisdiction(cls, v: str) -> JurisdictionCode:
        mapping = {"DE": JurisdictionCode.DE, "CA": JurisdictionCode.CA, "NY": JurisdictionCode.NY}
        clean = v.strip().upper()
        if clean not in mapping:
            raise ValueError(f"Unsupported jurisdiction code: {clean}")
        return mapping[clean]

    @field_validator("entity_type_raw")
    @classmethod
    def resolve_entity_type(cls, v: str) -> EntityType:
        alias_map = {
            "c-corp": EntityType.DOMESTIC_CORP, "llc": EntityType.LLC,
            "foreign": EntityType.FOREIGN_QUALIFIED, "lp": EntityType.PARTNERSHIP
        }
        clean = v.strip().lower()
        resolved = alias_map.get(clean)
        if resolved is None:
            raise ValueError(f"Unrecognized entity type alias: {clean}")
        return resolved

Normalization failures are immediately captured and routed to a validation dead-letter queue. This guarantees that downstream classification engines only process structurally valid, standardized payloads.

Single-Intent Classification Engine

The taxonomy schema maps structural attributes to regulatory obligations. Domestic corporations, foreign-qualified entities, limited liability companies, and hybrid pass-through structures each trigger distinct compliance metadata profiles. The classification logic uses a rule-based decision tree augmented by a lightweight probabilistic classifier for edge-case descriptions. The system evaluates formation documents, EIN/TIN prefixes (aligned with IRS employer identification number guidelines), registered agent jurisdictions, and fiscal year-end declarations to assign a definitive entity class.

The single-intent execution model mandates that each record must resolve to exactly one primary classification vector before advancing. Conflicting attributes—such as a Delaware formation paired with a California LLC operating agreement—trigger an immediate pre-classification halt.

from dataclasses import dataclass
from enum import Enum
from .normalization import EntityType, NormalizedEntity  # see preceding block

class ClassificationIntent(str, Enum):
    ANNUAL_REPORT = "annual_report"
    FRANCHISE_TAX = "franchise_tax"
    STATEMENT_OF_INFO = "statement_of_info"
    FOREIGN_QUALIFICATION = "foreign_qualification"

@dataclass(frozen=True)
class ClassificationResult:
    intent: ClassificationIntent
    confidence: float
    rule_applied: str
    metadata: dict

class SingleIntentClassifier:
    def __init__(self, conflict_threshold: float = 0.65):
        self.conflict_threshold = conflict_threshold

    def evaluate(self, entity: NormalizedEntity) -> ClassificationResult:
        # Rule-based deterministic evaluation
        if entity.entity_type == EntityType.DOMESTIC_CORP:
            return ClassificationResult(
                intent=ClassificationIntent.ANNUAL_REPORT,
                confidence=1.0,
                rule_applied="DOMESTIC_CORP_DEFAULT",
                metadata={"filing_template": "corp_annual_report_v3"}
            )
        if entity.entity_type == EntityType.LLC:
            return ClassificationResult(
                intent=ClassificationIntent.STATEMENT_OF_INFO,
                confidence=0.95,
                rule_applied="LLC_SOS_DEFAULT",
                metadata={"filing_template": "llc_soi_v2"}
            )
        if entity.entity_type == EntityType.FOREIGN_QUALIFIED:
            return ClassificationResult(
                intent=ClassificationIntent.FOREIGN_QUALIFICATION,
                confidence=0.90,
                rule_applied="FOREIGN_QUAL_RULE",
                metadata={"requires_registered_agent": True}
            )
        
        # Fallback to probabilistic heuristic for hybrid/edge cases
        return self._probabilistic_fallback(entity)

    def _probabilistic_fallback(self, entity: NormalizedEntity) -> ClassificationResult:
        # Placeholder for lightweight ML/heuristic scoring in production
        # Returns confidence < 0.70 to force Tier 2/3 routing
        return ClassificationResult(
            intent=ClassificationIntent.ANNUAL_REPORT,
            confidence=0.55,
            rule_applied="HEURISTIC_FALLBACK",
            metadata={"requires_manual_review": True}
        )

For teams navigating jurisdictional variance, understanding How to map LLC vs C-Corp filing requirements across 50 states provides the foundational logic required to parameterize state-specific rule engines without hardcoding brittle conditional statements.

Tiered Fallback & Error Categorization Strategy

Ambiguity in entity classification is the primary driver of late filings and administrative penalties. The routing engine implements a tiered fallback mechanism to guarantee compliance continuity while maintaining strict audit boundaries.

Tier Trigger Condition Routing Action Audit Requirement
Tier 1 Confidence ≥ 0.95, zero attribute conflicts Direct pipeline execution Log rule ID, timestamp, hash of payload
Tier 2 Confidence 0.70–0.94, minor heuristic gaps Async validation queue (cross-reference SOS DB) Store confidence delta, retry count, validation source
Tier 3 Confidence < 0.70 OR direct attribute contradiction Dead-letter queue for legal ops review Full diagnostic payload, conflict vector, SLA timer
import logging
from enum import Enum
from .normalization import NormalizedEntity  # see normalization block above
from .classifier import (  # see classification block above
    ClassificationResult,
    SingleIntentClassifier,
)

logger = logging.getLogger("compliance.taxonomy")

class RoutingErrorType(str, Enum):
    CONFLICTING_JURISDICTION = "CONFLICTING_JURISDICTION"
    MISSING_FISCAL_DECLARATION = "MISSING_FISCAL_DECLARATION"
    LOW_CONFIDENCE_THRESHOLD = "LOW_CONFIDENCE_THRESHOLD"
    SCHEMA_MUTATION = "SCHEMA_MUTATION"

class ClassificationRouter:
    def __init__(self, classifier: SingleIntentClassifier):
        self.classifier = classifier

    def route(self, entity: NormalizedEntity) -> ClassificationResult | None:
        result = self.classifier.evaluate(entity)
        
        if result.confidence >= 0.95:
            logger.info("TIER_1_MATCH", extra={"entity_id": entity.entity_id, "intent": result.intent})
            return result
            
        if 0.70 <= result.confidence < 0.95:
            logger.warning("TIER_2_HEURISTIC", extra={"entity_id": entity.entity_id, "confidence": result.confidence})
            self._enqueue_async_validation(entity, result)
            return result
            
        # Tier 3: Halt and flag
        self._raise_compliance_alert(entity, RoutingErrorType.LOW_CONFIDENCE_THRESHOLD)
        return None

    def _enqueue_async_validation(self, entity: NormalizedEntity, result: ClassificationResult) -> None:
        # Production implementation: publish to SQS/Kafka with idempotency key
        logger.info("ASYNC_VALIDATION_QUEUED", extra={"entity_id": entity.entity_id})

    def _raise_compliance_alert(self, entity: NormalizedEntity, error_type: RoutingErrorType) -> None:
        logger.critical(
            "TIER_3_BLOCKED",
            extra={
                "entity_id": entity.entity_id,
                "error_type": error_type.value,
                "requires_legal_review": True
            }
        )

This error taxonomy maps directly to statutory audit requirements. Every classification halt generates an immutable event log, ensuring regulators can trace exactly why a filing was delayed and what remediation steps were initiated.

Downstream Routing & Compliance Metadata Integration

Once a single-intent classification vector is resolved, the payload synchronizes with State Filing Deadline Calendars to compute jurisdiction-specific due dates, penalty grace periods, and fee schedules. The classification metadata drives template selection, ensuring that Delaware franchise tax calculations, California Statement of Information submissions, and New York biennial reports are generated against the correct statutory schema.

All classification payloads are cryptographically signed and stored within strict Security & Data Boundaries to prevent unauthorized schema mutation or regulatory data leakage. By decoupling classification from execution, engineering teams can iterate on rule sets, update jurisdictional aliases, and patch probabilistic models without disrupting active filing pipelines. This architecture maintains continuous compliance across evolving statutory landscapes while providing legal operations teams with deterministic visibility into every entity’s filing trajectory.