config-system-review-and-redesign.md 22.1 KB

Configuration System Review And Redesign

1. Goal

This document reviews the current configuration system and proposes a practical redesign for long-term maintainability.

The target is a configuration system that is:

  • unified in loading and ownership
  • clear in boundaries and precedence
  • visible in effective behavior
  • easy to evolve across development, deployment, and operations

This review is based on the current implementation, not only on the intended architecture in docs.

2. Project Context

The repo already defines the right architectural direction:

  • config/config.yaml should be the main configuration source for search behavior and service wiring
  • .env should mainly carry deployment-specific values and secrets
  • provider/backend expansion should stay centralized instead of spreading through business code

That direction is described in:

The problem is not the architectural intent. The problem is that the current implementation only partially follows it.

3. Current-State Review

3.1 What exists today

The current system effectively has several configuration channels:

  • config/config.yaml
    • search behavior
    • rerank behavior
    • services registry
    • tenant config
  • config/config_loader.py
    • parses search behavior and tenant config into SearchConfig
    • also injects some defaults from code
  • config/services_config.py
    • reparses config/config.yaml again, independently
    • resolves translation, embedding, rerank service config
    • also applies env overrides
  • config/env_config.py
    • loads .env
    • defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults
  • service-local config modules
  • startup scripts
    • derive defaults from shell env, Python config, and YAML in different combinations
  • inline fallbacks in business logic
    • query parsing
    • indexing
    • service startup

3.2 Main findings

Finding A: there is no single loader for the full effective configuration

ConfigLoader and services_config both parse config/config.yaml, but they do so separately and with different responsibilities.

Impact:

  • the same file is loaded twice through different code paths
  • search config and services config can drift in interpretation
  • alternative config paths are hard to support cleanly
  • tests and tools cannot ask one place for the full effective config tree

Finding B: precedence is not explicit, stable, or globally enforced

Current precedence differs by subsystem:

  • search behavior mostly comes from YAML plus code defaults
  • embedding and rerank allow env overrides for provider/backend/url
  • translation intentionally blocks some env overrides
  • startup scripts still choose host/port and mode via env
  • some values are reconstructed from other env vars

Examples:

Impact:

  • operators cannot reliably predict the effective configuration by reading one file
  • the same setting category behaves differently across services
  • incidents become harder to debug because source-of-truth depends on the code path

Finding C: defaults are duplicated across YAML and code

There are several layers of default values:

  • dataclass defaults in QueryConfig
  • fallback defaults in ConfigLoader._parse_config
  • defaults in config.yaml
  • defaults in env_config.py
  • defaults in embeddings/config.py
  • defaults in reranker/config.py
  • defaults in startup scripts

Examples:

Impact:

  • changing a default is risky because there may be multiple hidden copies
  • code review cannot easily tell whether a value is authoritative or dead legacy
  • “same config” may behave differently across processes

Finding D: config is still embedded in runtime logic

Some important behavior remains encoded as inline fallback logic rather than declared config.

Examples:

Impact:

  • configuration is not fully visible in config files
  • behavior can silently change when tenant config is missing or malformed
  • “default behavior” is spread across business modules

Finding E: some configuration assets are not managed as first-class config

Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content.

There is also an admin API that mutates rewrite rules in memory only:

Impact:

  • rewrite rules are neither cleanly file-backed nor fully runtime-managed
  • restart behavior is unclear
  • configuration visibility and persistence are weak

Finding F: visibility is limited

The system exposes only a small sanitized subset at /admin/config.

At the same time, the true effective config includes:

  • tenant overlays
  • env overrides
  • service backend selections
  • script-selected modes
  • hidden defaults in code

Impact:

  • there is no authoritative “effective config” view
  • debugging configuration mismatches requires source reading
  • operators cannot easily verify what each process actually started with

Finding G: the indexer does not really consume the unified config as a first-class dependency

Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it.

Impact:

  • configuration is not truly system-wide
  • search-side and indexer-side behavior can drift
  • the current “unified config” is only partially unified

Finding H: docs still carry legacy and mixed mental models

Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as translate_to_en and translate_to_zh.

Impact:

  • new developers may follow old mental models
  • cleanup work keeps getting deferred because old and new systems appear both “supported”

4. Design Principles For The Redesign

The redesign should follow these rules.

4.1 One logical configuration system

It is acceptable to have multiple files, but not multiple loaders with overlapping ownership.

There must be one loader pipeline that produces one typed AppConfig.

4.2 Configuration files declare, parser code interprets, env provides runtime injection

Responsibilities should be:

  • configuration files
    • declare non-secret desired behavior and non-secret deployable settings
  • parsing logic
    • load, merge, validate, normalize, and expose typed config
    • never invent hidden business behavior
  • environment variables
    • carry secrets and a small set of runtime/process values
    • do not redefine business behavior casually

4.3 One precedence rule for the whole system

Every config category should follow the same merge model unless explicitly exempted.

4.4 No silent implicit fallback for business behavior

Fail fast at startup when required config is missing or invalid.

Do not silently fall back to legacy behavior such as hardcoded language lists.

4.5 Effective configuration must be observable

Every service should be able to show:

  • config version or hash
  • source files loaded
  • environment name
  • sanitized effective configuration

5.1 Boundary model

Use three clear layers.

Layer 1: repository-managed static config

Purpose:

  • search behavior
  • tenant behavior
  • provider/backend registry
  • non-secret service topology defaults
  • feature switches

Examples:

  • field boosts
  • query strategy
  • rerank fusion parameters
  • tenant language plans
  • translation capability registry
  • embedding backend selection default

Layer 2: environment-specific overlays

Purpose:

  • per-environment non-secret differences
  • service endpoints by environment
  • resource sizing defaults by environment
  • dev/test/prod operational differences

Examples:

  • local embedding URL vs production URL
  • dev rerank backend vs prod rerank backend
  • lower concurrency in local development

Layer 3: environment variables

Purpose:

  • secrets
  • bind host/port
  • external infrastructure credentials
  • container-orchestrator last-mile injection

Examples:

  • ES_HOST, ES_USERNAME, ES_PASSWORD
  • DB_HOST, DB_USERNAME, DB_PASSWORD
  • REDIS_HOST, REDIS_PASSWORD
  • DASHSCOPE_API_KEY, DEEPL_AUTH_KEY
  • API_HOST, API_PORT, INDEXER_PORT, TRANSLATION_PORT

Rule:

  • environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy
  • if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention

5.2 Unified precedence

Recommended precedence:

  1. schema defaults in code
  2. config/base.yaml
  3. config/environments/<env>.yaml
  4. tenant overlay from config/tenants/
  5. environment variables for the explicitly allowed runtime keys
  6. CLI flags for the current process only

Important rule:

  • only one module may implement this merge logic
  • no business module may call os.getenv() directly for configuration
config/
  schema.py
  loader.py
  sources.py
  base.yaml
  environments/
    dev.yaml
    test.yaml
    prod.yaml
  tenants/
    _default.yaml
    1.yaml
    162.yaml
    170.yaml
  dictionaries/
    query_rewrite.dict
  README.md
.env.example

Notes:

  • base.yaml contains shared defaults and feature behavior
  • environments/*.yaml contains environment-specific non-secret overrides
  • tenants/*.yaml contains tenant-specific overrides only
  • dictionaries/ stores first-class config assets such as rewrite dictionaries
  • schema.py defines the typed config model
  • loader.py is the only entry point that loads and merges config

If the team prefers fewer files, tenants.yaml is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”.

5.4 Typed configuration model

Introduce one root object, for example:

class AppConfig(BaseModel):
    runtime: RuntimeConfig
    infrastructure: InfrastructureConfig
    search: SearchConfig
    services: ServicesConfig
    tenants: TenantCatalogConfig
    assets: ConfigAssets

Suggested subtrees:

  • runtime
    • environment name
    • config revision/hash
    • bind addresses/ports
  • infrastructure
    • ES
    • DB
    • Redis
    • index namespace
  • search
    • field boosts
    • query config
    • function score
    • rerank behavior
    • spu config
  • services
    • translation
    • embedding
    • rerank
  • tenants
    • default tenant config
    • tenant overrides
  • assets
    • rewrite dictionary path

Benefits:

  • one validated object shared by backend, indexer, translator, embedding, reranker
  • one place for defaults
  • one place for schema evolution

5.5 Loading flow

Recommended loading flow:

  1. determine APP_ENV or RUNTIME_ENV
  2. load schema defaults
  3. load config/base.yaml
  4. load config/environments/<env>.yaml if present
  5. load tenant files
  6. inject first-class assets such as rewrite dictionary
  7. apply allowed env overrides
  8. validate the final AppConfig
  9. freeze and cache the config object
  10. expose a sanitized effective-config view

Important:

  • every process should call the same loader
  • services should receive a resolved AppConfig, not re-open YAML independently

5.6 Clear responsibility split

Configuration files are responsible for

  • what the system should do
  • what providers/backends are available
  • which features are enabled
  • tenant language/index policies
  • non-secret service topology

Parser/loader code is responsible for

  • locating sources
  • merge precedence
  • type validation
  • normalization
  • deprecation warnings
  • producing the final immutable config object

Environment variables are responsible for

  • secrets
  • bind addresses/ports
  • infrastructure endpoints when the deployment platform injects them
  • a very small set of documented operational overrides

Business code is not responsible for

  • inventing defaults for missing config
  • loading YAML directly
  • calling os.getenv() for normal application behavior

5.7 How to handle service config

Unify all service-facing config under one structure:

services:
  translation:
    endpoint: "http://translator:6006"
    timeout_sec: 10
    default_model: "llm"
    default_scene: "general"
    capabilities: ...
  embedding:
    endpoint:
      text: "http://embedding:6005"
      image: "http://embedding-image:6008"
    backend: "tei"
    backends: ...
  rerank:
    endpoint: "http://reranker:6007/rerank"
    backend: "qwen3_vllm"
    backends: ...

Rules:

  • endpoint is how callers reach the service
  • backend is how the service itself is implemented
  • only the service process cares about backend
  • only callers care about endpoint
  • both still belong to the same config tree, because they are part of one system

5.8 How to handle tenant config

Tenant config should become explicit policy, not translation-era leftovers.

Recommended tenant fields:

  • primary_language
  • index_languages
  • search_languages
  • translation_policy
  • facet_policy
  • optional tenant-specific ranking overrides

Avoid keeping translate_to_en and translate_to_zh as active concepts in the long-term model.

If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings.

5.9 How to handle rewrite rules and similar assets

Treat them as declared config assets.

Recommended rules:

  • file path declared in config
  • one canonical location under config/dictionaries/
  • loader validates presence and format
  • admin runtime updates either:
    • are removed, or
    • write back through a controlled persistence path

Do not keep a hybrid model where startup loads one file and admin mutates only in memory.

5.10 Observability improvements

Add the following:

  • config dump CLI that prints sanitized effective config
  • startup log with config hash, environment, and config file list
  • /admin/config/effective endpoint returning sanitized effective config
  • /admin/config/meta endpoint returning:
    • environment
    • config hash
    • loaded source files
    • deprecated keys in use

This is important for operations and for multi-service debugging.

6. Practical Refactor Plan

The refactor should be incremental.

Phase 1: establish the new config core without changing behavior

  • create config/schema.py
  • create config/loader.py
  • move all current defaults into schema models
  • make loader read current config/config.yaml
  • make loader read .env only for approved keys
  • expose one get_app_config()

Result:

  • same behavior, but one typed root config becomes available

Phase 2: remove duplicate readers

  • make services_config.py a thin adapter over get_app_config()
  • make tenant_config_loader.py read from get_app_config()
  • stop reparsing YAML in services_config.py
  • stop service modules from depending on legacy local config modules for behavior

Result:

  • one parsing path
  • fewer divergence risks

Phase 3: move hidden defaults out of business logic

  • remove hardcoded fallback language lists from query/indexer modules
  • require tenant defaults to come from config schema only
  • remove duplicate behavior defaults from service code

Result:

  • behavior becomes visible and reviewable

Phase 4: clean service startup configuration

  • make startup scripts ask the unified loader for resolved values
  • keep only bind host/port and secret injection in shell env
  • retire or reduce embeddings/config.py and reranker/config.py

Result:

  • startup behavior matches runtime config model

Phase 5: split config files by responsibility

  • keep a single root loader
  • split current giant config.yaml into:
    • base.yaml
    • environments/<env>.yaml
    • tenants/*.yaml
    • dictionaries/query_rewrite.dict

Result:

  • config remains unified logically, but is easier to read and maintain physically

Phase 6: deprecate legacy compatibility

  • deprecate translate_to_en and translate_to_zh
  • deprecate env-based backend/provider selection except for explicitly approved keys
  • remove old code paths after one or two release cycles

Result:

  • the system becomes simpler instead of carrying two generations forever

7. Concrete Rules To Adopt

These rules should be documented and enforced in code review.

Rule 1

Only config/loader.py may load config files or .env.

Rule 2

Only config/loader.py may read os.getenv() for application config.

Rule 3

Business modules receive typed config objects and do not read files or env directly.

Rule 4

Each config key has one owner.

Examples:

  • search.query.knn_boost belongs to search behavior config
  • services.embedding.backend belongs to service implementation config
  • infrastructure.redis.password belongs to env/secrets

Rule 5

Every fallback must be either:

  • declared in schema defaults, or
  • rejected at startup

No hidden fallback in runtime logic.

Rule 6

Every configuration asset must be visible in one of these places only:

  • config file
  • env var
  • generated runtime metadata

Not inside parser code as an implicit constant.

Suggested conventions:

  • config keys use noun-based hierarchical names
  • avoid mixing transport and implementation concepts in one field
  • use endpoint for caller-facing addresses
  • use backend for service-internal implementation choice
  • use enabled only for true feature toggles
  • use default_* only when a real selection happens at runtime

Examples:

  • good: services.rerank.endpoint
  • good: services.rerank.backend
  • good: tenants.default.index_languages
  • avoid: service_url, base_url, provider, backend, and script env all meaning slightly different things without a common model

9. Highest-Priority Cleanup Items

If the team wants the shortest path to improvement, start here:

  1. build one root AppConfig
  2. make services_config.py stop reparsing YAML
  3. declare rewrite dictionary path explicitly and fix the current mismatch
  4. remove hardcoded ["en", "zh"] fallbacks from query/indexer logic
  5. replace /admin/config with an effective-config endpoint
  6. retire embeddings/config.py and reranker/config.py as behavior sources
  7. deprecate legacy tenant translation flags

10. Expected Outcome

After the redesign:

  • developers can answer “where does this setting come from?” in one step
  • operators can see effective config without reading source code
  • backend, indexer, translator, embedding, and reranker all share one model
  • tenant behavior is explicit instead of partially implicit
  • migration becomes safer because defaults and precedence are centralized
  • adding a new provider/backend becomes configuration extension, not configuration archaeology

11. Summary

The current system has the right intent but not yet the right implementation shape.

Today the main problems are:

  • duplicate config loaders
  • inconsistent precedence
  • duplicated defaults
  • config hidden in runtime logic
  • weak effective-config visibility
  • leftover legacy concepts

The recommended direction is:

  • one root typed config
  • one loader pipeline
  • explicit layered sources
  • narrow env responsibility
  • no hidden business fallbacks
  • observable effective config

That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.