Indexing

ALOS DB indexes are fully automatic. There is nothing to create, nothing to drop, and nothing to tune. Every field you query is indexed on first access and stays indexed forever.

Zero-Config Indexing

ALOS DB is the only document database with fully automatic index management. There is no createIndex(), no ensureIndex(), and no dropIndex(). The index system is entirely self-managing.

How It Works

The first time you query a field, ALOS DB automatically builds a B-tree index on that field. Every subsequent query on the same field uses the index instantly. This happens transparently — your code never touches index management:

Go
// Just query. Indexes are created automatically.
doc, _ := users.FindOne(alosdbclient.Document{
    "guild_id":  "123456789",
    "author.id": "987654321",
})
// Both "guild_id" and "author.id" are now indexed.
// Every future query on these fields is instant.

What Gets Indexed

  • Top-level fieldsguild_id, channel_id, type
  • Nested fieldsauthor.id, author.username, author.discriminator
  • Deep nested fieldsauthor.global_name, content, any depth
  • Every unique key in your documents — if it exists, it can be indexed

No manual index management. ALOS DB watches what you query and builds indexes automatically. You never have to think about createIndex, index ordering, compound indexes, or index maintenance. It just works.

Auto-Index vs. Traditional Databases

Feature ALOS DB MongoDB / PostgreSQL
Index creation Automatic Manual createIndex()
Index dropping Not needed Manual dropIndex()
Compound indexes Automatic multi-field Manual compound definition
Index tuning Zero config Manual explain + tuning
Nested field indexes Automatic Must define dot-notation index
Forgotten indexes Impossible Common production issue

Query Performance

ALOS DB delivers sub-15ms query latency at 60 million documents and sub-25ms at 300 million documents. These numbers hold for complex multi-condition queries with nested fields, $or, $in, $regex, $exists, and more.

<15ms
60M Documents
<25ms
300M Documents
1.4 GB
RAM at 60M Docs
8 min
Full Rebuild 60M

These benchmarks were measured on a real production dataset — 60 million Discord message documents with 22+ top-level and nested keys per document. Every query was a complex, multi-condition filter. Not synthetic key-value lookups.

What "Under 15ms" Actually Means

  • Cold queries — fields never queried before still return under 15ms because auto-indexing is fast enough to build the index AND execute the query within the threshold
  • Complex queries — multi-field conditions with $and, $or, $in, $exists, $ne, $regex, and nested dot-notation fields
  • Miss queries — queries that intentionally match zero documents (worst case for many databases) are equally fast
  • Concurrent queries — 256-shard architecture means parallel queries don't contend

800-Query Test Suite

ALOS DB ships with a dedicated test suite of 800 complex queries — 400 hit queries (expected matches) and 400 miss queries (expected zero results). Every query must complete in under 15ms. These are real queries against a real dataset, not synthetic benchmarks.

Test Results Summary

Test Suite Total Pass (≤15ms) Pass Rate Heap (in-use)
Hit Queries 400 400 100%
Miss Queries 400 400 100%
Combined 800 800 100%

Query Categories Tested

The 800 queries cover every operator and pattern that a real application would use:

  • Exact match — single and multi-field equality (guild_id, channel_id, type)
  • Nested field queries — dot-notation on author.id, author.username, author.discriminator
  • $or combinations — branching logic with 2-4 clauses, each with different operators
  • $in operators — arrays of 2-5 candidate values per field
  • $exists checks — field presence/absence on top-level and nested paths
  • $ne / $nin — exclusion filters combined with positive conditions
  • $regex patterns — prefix, suffix, and substring matching on string fields
  • $gt / $gte / $lt / $lte — range queries on numeric and string fields
  • Mixed compound queries — 3-6 conditions spanning multiple categories simultaneously

Zero Failures Across 4,000 Runs

All 800 queries passed in under 15ms across 5 independent iterations (4,000 total runs), with zero failures. Hit queries averaged 2.03ms (p95 4.6ms, max 7.6ms). Miss queries averaged 0.35ms (p95 1.5ms, max 3.6ms). Even the most complex cases — deeply nested $or clauses with large $in arrays and multiple $regex patterns — comfortably cleared the 15ms threshold.

Memory Efficiency

At 60 million documents with 22+ keys and sub-keys per document, ALOS DB uses only 1.4 GB of RAM while maintaining sub-15ms query performance. This is orders of magnitude less than what other databases require for the same workload.

How 1.4 GB Is Possible

  • Snapshot-backed B-trees — non-unique index values are stored in a compact binary snapshot format (snapshotBacking) rather than as Go slice-of-strings. Each doc group is a (pos, count) pair pointing into the binary blob, not a heap-allocated list.
  • 256 concurrent shards — write overlays are split across 256 shards so each shard holds a small slice of pending changes. Overlays are drained on each snapshot cycle, keeping steady-state memory near zero.
  • Lazy indexing — indexes are only built for fields that are actually queried. A collection with 50 keys but 5 queried fields only holds 5 indexes in memory.
  • Disk-backed documents — documents live on disk in compressed shards. Only the index keys and document IDs are in memory. The actual document payloads are read from disk on demand.
  • Zero-allocation hot paths — query execution, index lookups, and shard reads are designed for zero heap allocation. sync.Pool is used aggressively for buffers, decoders, and temporary slices.

Memory Breakdown at Scale

Component Typical Size Notes
B-tree index per field 2-5 MB Keys + tree metadata
Snapshot backing (non-unique) 1-3 MB per field Binary-encoded doc ID lists
Shard overlays (256 shards) <1 MB total Drained on each snapshot
Offset tables ~100 MB at 60M docs Required for O(1) doc lookup
Document payloads 0 MB Stored on disk, read on demand

1.4 GB for 60 million documents means ALOS DB uses roughly 24 bytes per document of RAM. Compare that to MongoDB's typical 100-500 bytes per document for WiredTiger cache overhead alone.

Index Rebuild Speed

Rebuilding the full index for 60 million documents with 22+ keys and sub-keys takes only 8 minutes. This is currently the fastest full index rebuild of any document database at this scale.

What "Full Rebuild" Means

A full rebuild re-reads every document from disk, extracts every field value, and reconstructs every B-tree index from scratch. This is the operation that runs when:

  • Server cold start — after a restart, indexes are rebuilt from persisted snapshots or from raw data
  • Snapshot corruption — if an index snapshot is invalid, the index is rebuilt from disk
  • Manual rebuildRebuildIndex() drops and reconstructs all indexes for a collection

Why It's Fast

  • Two-phase loading — core indexes (offset tables + value indexes) are built first in a single pass. Secondary indexes (bloom filters, field summaries, string grams) are built in the background. Queries can run as soon as phase 1 completes.
  • Parallel shard scanning — each data shard is scanned by its own goroutine. With 256 shards, you get full CPU utilization on modern hardware.
  • Zero-allocation field hashing — field values are hashed directly into the index without intermediate string allocations using HashFieldValuesForIndex().
  • Inline LZ4 decompression — compressed shard data is decompressed with reusable per-goroutine buffers, avoiding sync.Pool overhead during the critical scan loop.
  • BulkLoad — instead of inserting key-value pairs one at a time, the sorted pairs are loaded into the B-tree in a single batch operation, building the tree bottom-up.

Rebuild Timeline

Phase 1 — Core Indexes (blocks queries until complete) ├─ Scan 256 shards in parallel ├─ Build offset tables → O(1) document lookups ready ├─ Build value indexes → exact-match queries ready └─ indexReady channel closes → queries can execute Phase 2 — Secondary Indexes (background, queries already working) ├─ Build bloom filters → faster impossible-query rejection ├─ Build field summaries → range query optimization ├─ Build segment tables → range scan acceleration └─ Build string gram indexes → string pattern filtering

Queries don't wait for a full rebuild. Phase 1 completes in a fraction of the total time and unlocks all query functionality. Phase 2 runs in the background and only adds optimization layers. Your application is responsive within seconds of a cold start, even with 60 million documents.

Architecture Deep Dive

B-Tree Index Structure

Every index is a B-tree with a maximum of 128 keys per node. Leaf nodes form a linked list for efficient range traversal. The tree supports both unique indexes (one doc per key) and non-unique indexes (many docs per key via docGroup snapshot references).

Internal Structure
// Each index wraps a B-tree with 256 concurrent write shards
type Index struct {
    Field           string          // "author.id", "guild_id", etc.
    Unique          bool            // unique constraint
    sortedIndex     *SortedIndex    // B-tree with linked leaf list
    shards          [256]indexShard // concurrent write buffers
    snapshotBacking []byte          // compact binary doc-ID storage
}

// Write shards prevent lock contention
type indexShard struct {
    mu         sync.RWMutex
    addOverlay map[string][]string              // pending additions
    delOverlay map[string]map[string]struct{} // pending deletions
}

256-Shard Concurrency

Every index is divided into 256 write shards. Each key is assigned to a shard via FastHash64String(key) & 255. This means 256 goroutines can write to 256 different keys simultaneously with zero lock contention. Read operations access the B-tree directly with no shard locking required.

Snapshot Persistence

Indexes are periodically serialized to disk as binary snapshots. On restart, snapshots are loaded directly into memory — no full rebuild needed if the snapshot is valid. Snapshot format v4 stores: header, sorted keys, doc-ID counts, and packed doc-ID lists. This is faster than rebuilding from raw documents by an order of magnitude.

Two-Phase Loading

ALOS DB uses a two-phase index loading strategy that prioritizes query availability over complete index optimization. This is what makes cold starts fast.

Phase 1: Core Indexes (Blocking)

The first phase builds the minimum set of indexes required for correct query execution:

  • Offset tables — mapping from document ID to disk position. Required for O(1) document reads.
  • Value indexes — field-value to document-ID mappings. Required for index-accelerated queries.

Phase 1 performs a single-pass scan of each shard. No MVCC pre-scan is needed. Doc count is derived from the scan itself, skipping an expensive CountDocuments full-scan. When phase 1 completes, the indexReady channel closes and all queries can execute.

Phase 2: Secondary Indexes (Background)

The second phase runs in a background goroutine after indexReady closes. It builds optimization layers that make queries faster but are not required for correctness:

  • Bloom filters — probabilistic structures for instantly rejecting impossible queries
  • Field summaries — min/max/cardinality metadata for range query optimization
  • Segment tables — shard-level field-value summaries for targeted scan acceleration
  • String gram indexes — substring indexes for fast pattern matching pre-filtering

Every query path has a nil-safe fallback for missing secondary indexes. If a bloom filter hasn't been built yet, the query falls back to a shard scan. If a field summary is missing, the range query scans all candidates. The result is always correct — secondary indexes only make it faster.

Query Planning

ALOS DB automatically selects the best execution strategy for every query. There is no EXPLAIN to run and no query hints to provide. The planner chooses the optimal path in microseconds.

Execution Priority

1. Bloom Filter Rejection queryMayMatch() — reject impossible queries in nanoseconds 2. Fully-Indexed Path All conditions use indexes → intersect/union index results → Fastest: no document reads needed for filtering 3. Best Single-Index Plan Select the index with smallest result set, iterate + filter Ranking: unique exact (0) → non-unique exact (1) → unique range (2) → non-unique range (3) 4. Full Scan Fallback Only for queries with zero indexable conditions (rare)

Index-Accelerated Operators

Operator Index Used Execution
field: value (exact) Yes Single B-tree lookup
$eq Yes Single B-tree lookup
$in Yes Multi-key lookup, merge results
$gt / $gte / $lt / $lte Yes Linked-leaf range traversal
$regex Prefix/Literal Anchored-prefix B-tree traversal; otherwise cached regex filter
$exists Metadata Field-summary pruning, then filtered scan only when needed
$ne / $nin Filter Only Post-filter on the reduced candidate set; full scan only if no other clause narrows work

$regex is index-accelerated when the pattern has a usable literal prefix such as "^john" or an exact literal match. Unanchored, suffix, substring, and case-insensitive regex patterns fall back to cached regex evaluation after candidate reduction. $exists is not a direct single-key lookup like $eq, but it still benefits from shard field summaries that can instantly prove a field is absent everywhere or present everywhere before any document scan starts. $ne and $nin currently do not select candidate IDs from the B-tree on their own; they become fast when paired with an indexed or metadata-pruned clause that has already made the remaining candidate set small.

Mixed Query Optimization

When a query mixes directly indexed operators, metadata-assisted operators, and fallback filters, ALOS DB uses the best indexed clause first, applies shard-summary pruning where available, and only then evaluates the remaining filters against the reduced candidate set:

Query Execution Example
// Query: {guild_id: "123", author.username: {$regex: "^john"}, profile.bio: {$exists: true}}
//
// Step 1: Use exact index on "guild_id" to cut the search space immediately
// Step 2: Use anchored-prefix regex index traversal on "author.username"
// Step 3: Use field-summary metadata to skip shards that cannot satisfy profile.bio existence
// Step 4: Evaluate any remaining filters on the already-reduced candidate set
//
// "Instant" queries often come from this combination of index traversal + metadata pruning,
// not necessarily from every operator being a standalone exact lookup.
ESC