The Signal Extraction Layer is the first point at which volatile acquisition data is transformed into structured intelligence. Its role is to capture every potentially linkable artifact, normalize it, and classify it into well-defined types. Decisions about meaning are deferred; the emphasis here is precision and repeatability. Errors at this stage propagate downstream, making resilience and platform-agnostic handling essential.Documentation Index
Fetch the complete documentation index at: https://none-38c466ad.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
4.1 Structural Layout Signatures
PurposeRepresent page structure in a way that remains stable under cosmetic change but sensitive to template-level shifts. Method
The rendered DOM is reduced to a hierarchical structural signature, from which a layout hash is derived. This abstraction captures spatial hierarchy while discarding superficial text or image changes. Invariants
- Cosmetic changes do not affect the signature
- Platform template updates may trigger controlled invalidation
- Collision risk is mitigated by combining the layout signature with other signal families
Crawlers that capture only static HTML without executing JavaScript miss dynamically injected nodes, leading to fragmented signatures.
4.2 Asset-Level Fingerprints
PurposeStatic assets such as JavaScript, CSS, and fonts often persist across multiple domains operated by the same entity. Identical assets act as durable pivots even when URLs differ. Method
Assets are normalized to remove cache-busting parameters and then hashed by content. This ensures byte-level uniqueness while tolerating delivery variance across CDNs. Invariants
- Large vendor libraries are treated as low-entropy and down-weighted
- Bespoke or uncommon scripts act as strong pivots
- CDN variance does not affect correlation because matching is based on content rather than hostnames
Baseline URL-based matching misses persistent overlaps. Deep asset fetching is often skipped by scanners due to bandwidth cost, causing loss of high-value pivots.
4.3 Endpoint Extraction
PurposeInfrastructure endpoints such as APIs, storage buckets, and webhooks are stable operator-level artifacts. They often persist across multiple domains within the same network. Method
Pattern recognition routines extract candidate endpoints and normalize them into canonical form (protocol, host, base path). Resolution of hosts occurs in enrichment rather than at this stage. Invariants
- Storage identifiers are reduced to root containers for stability
- Ephemeral testing endpoints are flagged and deprioritized
- Endpoints embedded in runtime scripts are captured only when full JavaScript execution is performed
Without runtime resource capture, many integration points are invisible and correlation coverage collapses.
4.4 Analytics and Tracker Identifiers
PurposeAnalytics and tracker IDs are persistent, account-level identifiers. They remain stable even when a domain undergoes significant change, making them powerful linkage signals. Method
Identifiers are extracted using format-aware pattern matching combined with context validation to avoid false positives. IDs are normalized and deduplicated per observation. Invariants
- Most identifiers persist across domain churn.
- Collisions are rare but handled with rarity-aware scoring.
- Extraction from deferred scripts ensures coverage of trackers loaded indirectly.
Simple scrapers miss IDs delivered through tag managers or runtime injection, creating blind spots in correlation.
4.5 Platform Indicators
PurposeEarly identification of platform affiliation accelerates downstream analysis and enables selective application of extraction logic. Method
Indicator patterns are scored, and the platform with the strongest aggregated signal strength is assigned. If multiple platforms show comparable evidence, the system records a composite label rather than forcing a single classification. Invariants
- False positives are controlled by requiring minimum confidence
- Hybrid detections are preserved for downstream reconciliation
- Baselines prevent common markers from dominating
4.6 Typed Signal Schema
PurposeTo prevent misclassification, every artifact is assigned a type from a fixed schema. This guards against corruption in later stages of processing. Method
A strict enumeration defines valid signal types, such as layout hashes, asset hashes, endpoints, and identifiers. Extraction output must conform to this schema. Invariants
- Each signal is explicitly typed before enrichment
- Validation occurs at the schema level, preventing accidental cross-use
- Downstream modules operate only on typed input
Without enforced typing, signals of different kinds can be conflated, leading to false linkages and unstable graphs.
4.7 Output Integrity
PurposeExtraction outputs must be machine-ingestible, auditable, and resistant to tampering. Method
Signals are emitted in structured arrays with associated metadata. A verification hash is computed to preserve fidelity across pipelines. Invariants
- Metadata includes acquisition context and runtime environment
- Integrity validation ensures reproducibility and trust
- Outputs are suitable for direct ingestion into enrichment and correlation workflows