Skip to main content

Documentation Index

Fetch the complete documentation index at: https://none-38c466ad.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The Signal Extraction Layer is the first point at which volatile acquisition data is transformed into structured intelligence. Its role is to capture every potentially linkable artifact, normalize it, and classify it into well-defined types. Decisions about meaning are deferred; the emphasis here is precision and repeatability. Errors at this stage propagate downstream, making resilience and platform-agnostic handling essential.

4.1 Structural Layout Signatures

Purpose
Represent page structure in a way that remains stable under cosmetic change but sensitive to template-level shifts.
Method
The rendered DOM is reduced to a hierarchical structural signature, from which a layout hash is derived. This abstraction captures spatial hierarchy while discarding superficial text or image changes.
Invariants
  • Cosmetic changes do not affect the signature
  • Platform template updates may trigger controlled invalidation
  • Collision risk is mitigated by combining the layout signature with other signal families
Common pitfalls
Crawlers that capture only static HTML without executing JavaScript miss dynamically injected nodes, leading to fragmented signatures.

4.2 Asset-Level Fingerprints

Purpose
Static assets such as JavaScript, CSS, and fonts often persist across multiple domains operated by the same entity. Identical assets act as durable pivots even when URLs differ.
Method
Assets are normalized to remove cache-busting parameters and then hashed by content. This ensures byte-level uniqueness while tolerating delivery variance across CDNs.
Invariants
  • Large vendor libraries are treated as low-entropy and down-weighted
  • Bespoke or uncommon scripts act as strong pivots
  • CDN variance does not affect correlation because matching is based on content rather than hostnames
Common pitfalls
Baseline URL-based matching misses persistent overlaps. Deep asset fetching is often skipped by scanners due to bandwidth cost, causing loss of high-value pivots.

4.3 Endpoint Extraction

Purpose
Infrastructure endpoints such as APIs, storage buckets, and webhooks are stable operator-level artifacts. They often persist across multiple domains within the same network.
Method
Pattern recognition routines extract candidate endpoints and normalize them into canonical form (protocol, host, base path). Resolution of hosts occurs in enrichment rather than at this stage.
Invariants
  • Storage identifiers are reduced to root containers for stability
  • Ephemeral testing endpoints are flagged and deprioritized
  • Endpoints embedded in runtime scripts are captured only when full JavaScript execution is performed
Common pitfalls
Without runtime resource capture, many integration points are invisible and correlation coverage collapses.

4.4 Analytics and Tracker Identifiers

Purpose
Analytics and tracker IDs are persistent, account-level identifiers. They remain stable even when a domain undergoes significant change, making them powerful linkage signals.
Method
Identifiers are extracted using format-aware pattern matching combined with context validation to avoid false positives. IDs are normalized and deduplicated per observation.
Invariants
  • Most identifiers persist across domain churn.
  • Collisions are rare but handled with rarity-aware scoring.
  • Extraction from deferred scripts ensures coverage of trackers loaded indirectly.
Common pitfalls
Simple scrapers miss IDs delivered through tag managers or runtime injection, creating blind spots in correlation.

4.5 Platform Indicators

Purpose
Early identification of platform affiliation accelerates downstream analysis and enables selective application of extraction logic.
Method
Indicator patterns are scored, and the platform with the strongest aggregated signal strength is assigned. If multiple platforms show comparable evidence, the system records a composite label rather than forcing a single classification.
Invariants
  • False positives are controlled by requiring minimum confidence
  • Hybrid detections are preserved for downstream reconciliation
  • Baselines prevent common markers from dominating
Common pitfalls Simplistic string checks miss obfuscated or hybrid builds, reducing classification reliability.

4.6 Typed Signal Schema

Purpose
To prevent misclassification, every artifact is assigned a type from a fixed schema. This guards against corruption in later stages of processing.
Method
A strict enumeration defines valid signal types, such as layout hashes, asset hashes, endpoints, and identifiers. Extraction output must conform to this schema.
Invariants
  • Each signal is explicitly typed before enrichment
  • Validation occurs at the schema level, preventing accidental cross-use
  • Downstream modules operate only on typed input
Common pitfalls
Without enforced typing, signals of different kinds can be conflated, leading to false linkages and unstable graphs.

4.7 Output Integrity

Purpose
Extraction outputs must be machine-ingestible, auditable, and resistant to tampering.
Method
Signals are emitted in structured arrays with associated metadata. A verification hash is computed to preserve fidelity across pipelines.
Invariants
  • Metadata includes acquisition context and runtime environment
  • Integrity validation ensures reproducibility and trust
  • Outputs are suitable for direct ingestion into enrichment and correlation workflows
Common pitfalls Basic crawlers often discard intermediate artifacts, eliminating the ability to audit or replay results.