Configuration System Review And Redesign
1. Goal
This document reviews the current configuration system and proposes a practical redesign for long-term maintainability.
The target is a configuration system that is:
- unified in loading and ownership
- clear in boundaries and precedence
- visible in effective behavior
- easy to evolve across development, deployment, and operations
This review is based on the current implementation, not only on the intended architecture in docs.
2. Project Context
The repo already defines the right architectural direction:
config/config.yamlshould be the main configuration source for search behavior and service wiring.envshould mainly carry deployment-specific values and secrets- provider/backend expansion should stay centralized instead of spreading through business code
That direction is described in:
- <code>README.md</code>
- <code>docs/DEVELOPER_GUIDE.md</code>
- <code>docs/QUICKSTART.md</code>
- <code>translation/README.md</code>
The problem is not the architectural intent. The problem is that the current implementation only partially follows it.
3. Current-State Review
3.1 What exists today
The current system effectively has several configuration channels:
config/config.yaml- search behavior
- rerank behavior
- services registry
- tenant config
config/config_loader.py- parses search behavior and tenant config into
SearchConfig - also injects some defaults from code
- parses search behavior and tenant config into
config/services_config.py- reparses
config/config.yamlagain, independently - resolves translation, embedding, rerank service config
- also applies env overrides
- reparses
config/env_config.py- loads
.env - defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults
- loads
- service-local config modules
- startup scripts
- derive defaults from shell env, Python config, and YAML in different combinations
- inline fallbacks in business logic
- query parsing
- indexing
- service startup
3.2 Main findings
Finding A: there is no single loader for the full effective configuration
ConfigLoader and services_config both parse config/config.yaml, but they do so separately and with different responsibilities.
Impact:
- the same file is loaded twice through different code paths
- search config and services config can drift in interpretation
- alternative config paths are hard to support cleanly
- tests and tools cannot ask one place for the full effective config tree
Finding B: precedence is not explicit, stable, or globally enforced
Current precedence differs by subsystem:
- search behavior mostly comes from YAML plus code defaults
- embedding and rerank allow env overrides for provider/backend/url
- translation intentionally blocks some env overrides
- startup scripts still choose host/port and mode via env
- some values are reconstructed from other env vars
Examples:
- env override for embedding provider/url/backend:
- host/port and service URL reconstruction:
- translator host/port still driven by startup env:
Impact:
- operators cannot reliably predict the effective configuration by reading one file
- the same setting category behaves differently across services
- incidents become harder to debug because source-of-truth depends on the code path
Finding C: defaults are duplicated across YAML and code
There are several layers of default values:
- dataclass defaults in
QueryConfig - fallback defaults in
ConfigLoader._parse_config - defaults in
config.yaml - defaults in
env_config.py - defaults in
embeddings/config.py - defaults in
reranker/config.py - defaults in startup scripts
Examples:
- query defaults duplicated in dataclass and parser:
- embedding defaults duplicated in YAML,
services_config,embeddings/config.py, and startup script: - reranker defaults duplicated in YAML and
reranker/config.py:
Impact:
- changing a default is risky because there may be multiple hidden copies
- code review cannot easily tell whether a value is authoritative or dead legacy
- “same config” may behave differently across processes
Finding D: config is still embedded in runtime logic
Some important behavior remains encoded as inline fallback logic rather than declared config.
Examples:
- query-time translation target languages fallback to
["en", "zh"]: - indexer text handling and LLM enrichment also fallback to
["en", "zh"]:
Impact:
- configuration is not fully visible in config files
- behavior can silently change when tenant config is missing or malformed
- “default behavior” is spread across business modules
Finding E: some configuration assets are not managed as first-class config
Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content.
- loader expects:
- repo currently contains:
There is also an admin API that mutates rewrite rules in memory only:
Impact:
- rewrite rules are neither cleanly file-backed nor fully runtime-managed
- restart behavior is unclear
- configuration visibility and persistence are weak
Finding F: visibility is limited
The system exposes only a small sanitized subset at /admin/config.
At the same time, the true effective config includes:
- tenant overlays
- env overrides
- service backend selections
- script-selected modes
- hidden defaults in code
Impact:
- there is no authoritative “effective config” view
- debugging configuration mismatches requires source reading
- operators cannot easily verify what each process actually started with
Finding G: the indexer does not really consume the unified config as a first-class dependency
Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it.
Impact:
- configuration is not truly system-wide
- search-side and indexer-side behavior can drift
- the current “unified config” is only partially unified
Finding H: docs still carry legacy and mixed mental models
Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as translate_to_en and translate_to_zh.
- desired model:
- legacy tenant translation flags still documented:
Impact:
- new developers may follow old mental models
- cleanup work keeps getting deferred because old and new systems appear both “supported”
4. Design Principles For The Redesign
The redesign should follow these rules.
4.1 One logical configuration system
It is acceptable to have multiple files, but not multiple loaders with overlapping ownership.
There must be one loader pipeline that produces one typed AppConfig.
4.2 Configuration files declare, parser code interprets, env provides runtime injection
Responsibilities should be:
- configuration files
- declare non-secret desired behavior and non-secret deployable settings
- parsing logic
- load, merge, validate, normalize, and expose typed config
- never invent hidden business behavior
- environment variables
- carry secrets and a small set of runtime/process values
- do not redefine business behavior casually
4.3 One precedence rule for the whole system
Every config category should follow the same merge model unless explicitly exempted.
4.4 No silent implicit fallback for business behavior
Fail fast at startup when required config is missing or invalid.
Do not silently fall back to legacy behavior such as hardcoded language lists.
4.5 Effective configuration must be observable
Every service should be able to show:
- config version or hash
- source files loaded
- environment name
- sanitized effective configuration
5. Recommended Target Design
5.1 Boundary model
Use three clear layers.
Layer 1: repository-managed static config
Purpose:
- search behavior
- tenant behavior
- provider/backend registry
- non-secret service topology defaults
- feature switches
Examples:
- field boosts
- query strategy
- rerank fusion parameters
- tenant language plans
- translation capability registry
- embedding backend selection default
Layer 2: environment-specific overlays
Purpose:
- per-environment non-secret differences
- service endpoints by environment
- resource sizing defaults by environment
- dev/test/prod operational differences
Examples:
- local embedding URL vs production URL
- dev rerank backend vs prod rerank backend
- lower concurrency in local development
Layer 3: environment variables
Purpose:
- secrets
- bind host/port
- external infrastructure credentials
- container-orchestrator last-mile injection
Examples:
ES_HOST,ES_USERNAME,ES_PASSWORDDB_HOST,DB_USERNAME,DB_PASSWORDREDIS_HOST,REDIS_PASSWORDDASHSCOPE_API_KEY,DEEPL_AUTH_KEYAPI_HOST,API_PORT,INDEXER_PORT,TRANSLATION_PORT
Rule:
- environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy
- if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention
5.2 Unified precedence
Recommended precedence:
- schema defaults in code
config/base.yamlconfig/environments/<env>.yaml- tenant overlay from
config/tenants/ - environment variables for the explicitly allowed runtime keys
- CLI flags for the current process only
Important rule:
- only one module may implement this merge logic
- no business module may call
os.getenv()directly for configuration
5.3 Recommended directory structure
config/
schema.py
loader.py
sources.py
base.yaml
environments/
dev.yaml
test.yaml
prod.yaml
tenants/
_default.yaml
1.yaml
162.yaml
170.yaml
dictionaries/
query_rewrite.dict
README.md
.env.example
Notes:
base.yamlcontains shared defaults and feature behaviorenvironments/*.yamlcontains environment-specific non-secret overridestenants/*.yamlcontains tenant-specific overrides onlydictionaries/stores first-class config assets such as rewrite dictionariesschema.pydefines the typed config modelloader.pyis the only entry point that loads and merges config
If the team prefers fewer files, tenants.yaml is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”.
5.4 Typed configuration model
Introduce one root object, for example:
class AppConfig(BaseModel):
runtime: RuntimeConfig
infrastructure: InfrastructureConfig
search: SearchConfig
services: ServicesConfig
tenants: TenantCatalogConfig
assets: ConfigAssets
Suggested subtrees:
runtime- environment name
- config revision/hash
- bind addresses/ports
infrastructure- ES
- DB
- Redis
- index namespace
search- field boosts
- query config
- function score
- rerank behavior
- spu config
services- translation
- embedding
- rerank
tenants- default tenant config
- tenant overrides
assets- rewrite dictionary path
Benefits:
- one validated object shared by backend, indexer, translator, embedding, reranker
- one place for defaults
- one place for schema evolution
5.5 Loading flow
Recommended loading flow:
- determine
APP_ENVorRUNTIME_ENV - load schema defaults
- load
config/base.yaml - load
config/environments/<env>.yamlif present - load tenant files
- inject first-class assets such as rewrite dictionary
- apply allowed env overrides
- validate the final
AppConfig - freeze and cache the config object
- expose a sanitized effective-config view
Important:
- every process should call the same loader
- services should receive a resolved
AppConfig, not re-open YAML independently
5.6 Clear responsibility split
Configuration files are responsible for
- what the system should do
- what providers/backends are available
- which features are enabled
- tenant language/index policies
- non-secret service topology
Parser/loader code is responsible for
- locating sources
- merge precedence
- type validation
- normalization
- deprecation warnings
- producing the final immutable config object
Environment variables are responsible for
- secrets
- bind addresses/ports
- infrastructure endpoints when the deployment platform injects them
- a very small set of documented operational overrides
Business code is not responsible for
- inventing defaults for missing config
- loading YAML directly
- calling
os.getenv()for normal application behavior
5.7 How to handle service config
Unify all service-facing config under one structure:
services:
translation:
endpoint: "http://translator:6006"
timeout_sec: 10
default_model: "llm"
default_scene: "general"
capabilities: ...
embedding:
endpoint:
text: "http://embedding:6005"
image: "http://embedding-image:6008"
backend: "tei"
backends: ...
rerank:
endpoint: "http://reranker:6007/rerank"
backend: "qwen3_vllm"
backends: ...
Rules:
endpointis how callers reach the servicebackendis how the service itself is implemented- only the service process cares about
backend - only callers care about
endpoint - both still belong to the same config tree, because they are part of one system
5.8 How to handle tenant config
Tenant config should become explicit policy, not translation-era leftovers.
Recommended tenant fields:
primary_languageindex_languagessearch_languagestranslation_policyfacet_policy- optional tenant-specific ranking overrides
Avoid keeping translate_to_en and translate_to_zh as active concepts in the long-term model.
If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings.
5.9 How to handle rewrite rules and similar assets
Treat them as declared config assets.
Recommended rules:
- file path declared in config
- one canonical location under
config/dictionaries/ - loader validates presence and format
- admin runtime updates either:
- are removed, or
- write back through a controlled persistence path
Do not keep a hybrid model where startup loads one file and admin mutates only in memory.
5.10 Observability improvements
Add the following:
config dumpCLI that prints sanitized effective config- startup log with config hash, environment, and config file list
/admin/config/effectiveendpoint returning sanitized effective config/admin/config/metaendpoint returning:- environment
- config hash
- loaded source files
- deprecated keys in use
This is important for operations and for multi-service debugging.
6. Practical Refactor Plan
The refactor should be incremental.
Phase 1: establish the new config core without changing behavior
- create
config/schema.py - create
config/loader.py - move all current defaults into schema models
- make loader read current
config/config.yaml - make loader read
.envonly for approved keys - expose one
get_app_config()
Result:
- same behavior, but one typed root config becomes available
Phase 2: remove duplicate readers
- make
services_config.pya thin adapter overget_app_config() - make
tenant_config_loader.pyread fromget_app_config() - stop reparsing YAML in
services_config.py - stop service modules from depending on legacy local config modules for behavior
Result:
- one parsing path
- fewer divergence risks
Phase 3: move hidden defaults out of business logic
- remove hardcoded fallback language lists from query/indexer modules
- require tenant defaults to come from config schema only
- remove duplicate behavior defaults from service code
Result:
- behavior becomes visible and reviewable
Phase 4: clean service startup configuration
- make startup scripts ask the unified loader for resolved values
- keep only bind host/port and secret injection in shell env
- retire or reduce
embeddings/config.pyandreranker/config.py
Result:
- startup behavior matches runtime config model
Phase 5: split config files by responsibility
- keep a single root loader
- split current giant
config.yamlinto:base.yamlenvironments/<env>.yamltenants/*.yamldictionaries/query_rewrite.dict
Result:
- config remains unified logically, but is easier to read and maintain physically
Phase 6: deprecate legacy compatibility
- deprecate
translate_to_enandtranslate_to_zh - deprecate env-based backend/provider selection except for explicitly approved keys
- remove old code paths after one or two release cycles
Result:
- the system becomes simpler instead of carrying two generations forever
7. Concrete Rules To Adopt
These rules should be documented and enforced in code review.
Rule 1
Only config/loader.py may load config files or .env.
Rule 2
Only config/loader.py may read os.getenv() for application config.
Rule 3
Business modules receive typed config objects and do not read files or env directly.
Rule 4
Each config key has one owner.
Examples:
search.query.knn_boostbelongs to search behavior configservices.embedding.backendbelongs to service implementation configinfrastructure.redis.passwordbelongs to env/secrets
Rule 5
Every fallback must be either:
- declared in schema defaults, or
- rejected at startup
No hidden fallback in runtime logic.
Rule 6
Every configuration asset must be visible in one of these places only:
- config file
- env var
- generated runtime metadata
Not inside parser code as an implicit constant.
8. Recommended Naming Conventions
Suggested conventions:
- config keys use noun-based hierarchical names
- avoid mixing transport and implementation concepts in one field
- use
endpointfor caller-facing addresses - use
backendfor service-internal implementation choice - use
enabledonly for true feature toggles - use
default_*only when a real selection happens at runtime
Examples:
- good:
services.rerank.endpoint - good:
services.rerank.backend - good:
tenants.default.index_languages - avoid:
service_url,base_url,provider,backend, and script env all meaning slightly different things without a common model
9. Highest-Priority Cleanup Items
If the team wants the shortest path to improvement, start here:
- build one root
AppConfig - make
services_config.pystop reparsing YAML - declare rewrite dictionary path explicitly and fix the current mismatch
- remove hardcoded
["en", "zh"]fallbacks from query/indexer logic - replace
/admin/configwith an effective-config endpoint - retire
embeddings/config.pyandreranker/config.pyas behavior sources - deprecate legacy tenant translation flags
10. Expected Outcome
After the redesign:
- developers can answer “where does this setting come from?” in one step
- operators can see effective config without reading source code
- backend, indexer, translator, embedding, and reranker all share one model
- tenant behavior is explicit instead of partially implicit
- migration becomes safer because defaults and precedence are centralized
- adding a new provider/backend becomes configuration extension, not configuration archaeology
11. Summary
The current system has the right intent but not yet the right implementation shape.
Today the main problems are:
- duplicate config loaders
- inconsistent precedence
- duplicated defaults
- config hidden in runtime logic
- weak effective-config visibility
- leftover legacy concepts
The recommended direction is:
- one root typed config
- one loader pipeline
- explicit layered sources
- narrow env responsibility
- no hidden business fallbacks
- observable effective config
That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.