# Configuration System Review And Redesign ## 1. Goal This document reviews the current configuration system and proposes a practical redesign for long-term maintainability. The target is a configuration system that is: - unified in loading and ownership - clear in boundaries and precedence - visible in effective behavior - easy to evolve across development, deployment, and operations This review is based on the current implementation, not only on the intended architecture in docs. ## 2. Project Context The repo already defines the right architectural direction: - `config/config.yaml` should be the main configuration source for search behavior and service wiring - `.env` should mainly carry deployment-specific values and secrets - provider/backend expansion should stay centralized instead of spreading through business code That direction is described in: - [`README.md`](/data/saas-search/README.md) - [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md) - [`docs/QUICKSTART.md`](/data/saas-search/docs/QUICKSTART.md) - [`translation/README.md`](/data/saas-search/translation/README.md) The problem is not the architectural intent. The problem is that the current implementation only partially follows it. ## 3. Current-State Review ### 3.1 What exists today The current system effectively has several configuration channels: - `config/config.yaml` - search behavior - rerank behavior - services registry - tenant config - `config/config_loader.py` - parses search behavior and tenant config into `SearchConfig` - also injects some defaults from code - `config/services_config.py` - reparses `config/config.yaml` again, independently - resolves translation, embedding, rerank service config - also applies env overrides - `config/env_config.py` - loads `.env` - defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults - service-local config modules - [`embeddings/config.py`](/data/saas-search/embeddings/config.py) - [`reranker/config.py`](/data/saas-search/reranker/config.py) - startup scripts - derive defaults from shell env, Python config, and YAML in different combinations - inline fallbacks in business logic - query parsing - indexing - service startup ### 3.2 Main findings #### Finding A: there is no single loader for the full effective configuration `ConfigLoader` and `services_config` both parse `config/config.yaml`, but they do so separately and with different responsibilities. - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L148) - [`config/services_config.py`](/data/saas-search/config/services_config.py#L33) Impact: - the same file is loaded twice through different code paths - search config and services config can drift in interpretation - alternative config paths are hard to support cleanly - tests and tools cannot ask one place for the full effective config tree #### Finding B: precedence is not explicit, stable, or globally enforced Current precedence differs by subsystem: - search behavior mostly comes from YAML plus code defaults - embedding and rerank allow env overrides for provider/backend/url - translation intentionally blocks some env overrides - startup scripts still choose host/port and mode via env - some values are reconstructed from other env vars Examples: - env override for embedding provider/url/backend: - [`config/services_config.py`](/data/saas-search/config/services_config.py#L52) - [`config/services_config.py`](/data/saas-search/config/services_config.py#L68) - [`config/services_config.py`](/data/saas-search/config/services_config.py#L139) - host/port and service URL reconstruction: - [`config/env_config.py`](/data/saas-search/config/env_config.py#L55) - [`config/env_config.py`](/data/saas-search/config/env_config.py#L75) - translator host/port still driven by startup env: - [`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh#L28) Impact: - operators cannot reliably predict the effective configuration by reading one file - the same setting category behaves differently across services - incidents become harder to debug because source-of-truth depends on the code path #### Finding C: defaults are duplicated across YAML and code There are several layers of default values: - dataclass defaults in `QueryConfig` - fallback defaults in `ConfigLoader._parse_config` - defaults in `config.yaml` - defaults in `env_config.py` - defaults in `embeddings/config.py` - defaults in `reranker/config.py` - defaults in startup scripts Examples: - query defaults duplicated in dataclass and parser: - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L24) - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L240) - embedding defaults duplicated in YAML, `services_config`, `embeddings/config.py`, and startup script: - [`config/config.yaml`](/data/saas-search/config/config.yaml#L196) - [`embeddings/config.py`](/data/saas-search/embeddings/config.py#L14) - [`scripts/start_embedding_service.sh`](/data/saas-search/scripts/start_embedding_service.sh#L29) - reranker defaults duplicated in YAML and `reranker/config.py`: - [`config/config.yaml`](/data/saas-search/config/config.yaml#L214) - [`reranker/config.py`](/data/saas-search/reranker/config.py#L6) Impact: - changing a default is risky because there may be multiple hidden copies - code review cannot easily tell whether a value is authoritative or dead legacy - “same config” may behave differently across processes #### Finding D: config is still embedded in runtime logic Some important behavior remains encoded as inline fallback logic rather than declared config. Examples: - query-time translation target languages fallback to `["en", "zh"]`: - [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L339) - indexer text handling and LLM enrichment also fallback to `["en", "zh"]`: - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L216) - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L310) - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L649) Impact: - configuration is not fully visible in config files - behavior can silently change when tenant config is missing or malformed - “default behavior” is spread across business modules #### Finding E: some configuration assets are not managed as first-class config Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content. - loader expects: - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L162) - repo currently contains: - [`config/query_rewrite.dict`](/data/saas-search/config/query_rewrite.dict) There is also an admin API that mutates rewrite rules in memory only: - [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L68) - [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L622) Impact: - rewrite rules are neither cleanly file-backed nor fully runtime-managed - restart behavior is unclear - configuration visibility and persistence are weak #### Finding F: visibility is limited The system exposes only a small sanitized subset at `/admin/config`. - [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L42) At the same time, the true effective config includes: - tenant overlays - env overrides - service backend selections - script-selected modes - hidden defaults in code Impact: - there is no authoritative “effective config” view - debugging configuration mismatches requires source reading - operators cannot easily verify what each process actually started with #### Finding G: the indexer does not really consume the unified config as a first-class dependency Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it. - [`api/indexer_app.py`](/data/saas-search/api/indexer_app.py#L76) Impact: - configuration is not truly system-wide - search-side and indexer-side behavior can drift - the current “unified config” is only partially unified #### Finding H: docs still carry legacy and mixed mental models Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as `translate_to_en` and `translate_to_zh`. - desired model: - [`README.md`](/data/saas-search/README.md#L78) - [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md#L207) - [`translation/README.md`](/data/saas-search/translation/README.md#L161) - legacy tenant translation flags still documented: - [`indexer/README.md`](/data/saas-search/indexer/README.md#L39) Impact: - new developers may follow old mental models - cleanup work keeps getting deferred because old and new systems appear both “supported” ## 4. Design Principles For The Redesign The redesign should follow these rules. ### 4.1 One logical configuration system It is acceptable to have multiple files, but not multiple loaders with overlapping ownership. There must be one loader pipeline that produces one typed `AppConfig`. ### 4.2 Configuration files declare, parser code interprets, env provides runtime injection Responsibilities should be: - configuration files - declare non-secret desired behavior and non-secret deployable settings - parsing logic - load, merge, validate, normalize, and expose typed config - never invent hidden business behavior - environment variables - carry secrets and a small set of runtime/process values - do not redefine business behavior casually ### 4.3 One precedence rule for the whole system Every config category should follow the same merge model unless explicitly exempted. ### 4.4 No silent implicit fallback for business behavior Fail fast at startup when required config is missing or invalid. Do not silently fall back to legacy behavior such as hardcoded language lists. ### 4.5 Effective configuration must be observable Every service should be able to show: - config version or hash - source files loaded - environment name - sanitized effective configuration ## 5. Recommended Target Design ## 5.1 Boundary model Use three clear layers. ### Layer 1: repository-managed static config Purpose: - search behavior - tenant behavior - provider/backend registry - non-secret service topology defaults - feature switches Examples: - field boosts - query strategy - rerank fusion parameters - tenant language plans - translation capability registry - embedding backend selection default ### Layer 2: environment-specific overlays Purpose: - per-environment non-secret differences - service endpoints by environment - resource sizing defaults by environment - dev/test/prod operational differences Examples: - local embedding URL vs production URL - dev rerank backend vs prod rerank backend - lower concurrency in local development ### Layer 3: environment variables Purpose: - secrets - bind host/port - external infrastructure credentials - container-orchestrator last-mile injection Examples: - `ES_HOST`, `ES_USERNAME`, `ES_PASSWORD` - `DB_HOST`, `DB_USERNAME`, `DB_PASSWORD` - `REDIS_HOST`, `REDIS_PASSWORD` - `DASHSCOPE_API_KEY`, `DEEPL_AUTH_KEY` - `API_HOST`, `API_PORT`, `INDEXER_PORT`, `TRANSLATION_PORT` Rule: - environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy - if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention ## 5.2 Unified precedence Recommended precedence: 1. schema defaults in code 2. `config/base.yaml` 3. `config/environments/.yaml` 4. tenant overlay from `config/tenants/` 5. environment variables for the explicitly allowed runtime keys 6. CLI flags for the current process only Important rule: - only one module may implement this merge logic - no business module may call `os.getenv()` directly for configuration ## 5.3 Recommended directory structure ```text config/ schema.py loader.py sources.py base.yaml environments/ dev.yaml test.yaml prod.yaml tenants/ _default.yaml 1.yaml 162.yaml 170.yaml dictionaries/ query_rewrite.dict README.md .env.example ``` Notes: - `base.yaml` contains shared defaults and feature behavior - `environments/*.yaml` contains environment-specific non-secret overrides - `tenants/*.yaml` contains tenant-specific overrides only - `dictionaries/` stores first-class config assets such as rewrite dictionaries - `schema.py` defines the typed config model - `loader.py` is the only entry point that loads and merges config If the team prefers fewer files, `tenants.yaml` is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”. ## 5.4 Typed configuration model Introduce one root object, for example: ```python class AppConfig(BaseModel): runtime: RuntimeConfig infrastructure: InfrastructureConfig search: SearchConfig services: ServicesConfig tenants: TenantCatalogConfig assets: ConfigAssets ``` Suggested subtrees: - `runtime` - environment name - config revision/hash - bind addresses/ports - `infrastructure` - ES - DB - Redis - index namespace - `search` - field boosts - query config - function score - rerank behavior - spu config - `services` - translation - embedding - rerank - `tenants` - default tenant config - tenant overrides - `assets` - rewrite dictionary path Benefits: - one validated object shared by backend, indexer, translator, embedding, reranker - one place for defaults - one place for schema evolution ## 5.5 Loading flow Recommended loading flow: 1. determine `APP_ENV` or `RUNTIME_ENV` 2. load schema defaults 3. load `config/base.yaml` 4. load `config/environments/.yaml` if present 5. load tenant files 6. inject first-class assets such as rewrite dictionary 7. apply allowed env overrides 8. validate the final `AppConfig` 9. freeze and cache the config object 10. expose a sanitized effective-config view Important: - every process should call the same loader - services should receive a resolved `AppConfig`, not re-open YAML independently ## 5.6 Clear responsibility split ### Configuration files are responsible for - what the system should do - what providers/backends are available - which features are enabled - tenant language/index policies - non-secret service topology ### Parser/loader code is responsible for - locating sources - merge precedence - type validation - normalization - deprecation warnings - producing the final immutable config object ### Environment variables are responsible for - secrets - bind addresses/ports - infrastructure endpoints when the deployment platform injects them - a very small set of documented operational overrides ### Business code is not responsible for - inventing defaults for missing config - loading YAML directly - calling `os.getenv()` for normal application behavior ## 5.7 How to handle service config Unify all service-facing config under one structure: ```yaml services: translation: endpoint: "http://translator:6006" timeout_sec: 10 default_model: "llm" default_scene: "general" capabilities: ... embedding: endpoint: text: "http://embedding:6005" image: "http://embedding-image:6008" backend: "tei" backends: ... rerank: endpoint: "http://reranker:6007/rerank" backend: "qwen3_vllm" backends: ... ``` Rules: - `endpoint` is how callers reach the service - `backend` is how the service itself is implemented - only the service process cares about `backend` - only callers care about `endpoint` - both still belong to the same config tree, because they are part of one system ## 5.8 How to handle tenant config Tenant config should become explicit policy, not translation-era leftovers. Recommended tenant fields: - `primary_language` - `index_languages` - `search_languages` - `translation_policy` - `facet_policy` - optional tenant-specific ranking overrides Avoid keeping `translate_to_en` and `translate_to_zh` as active concepts in the long-term model. If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings. ## 5.9 How to handle rewrite rules and similar assets Treat them as declared config assets. Recommended rules: - file path declared in config - one canonical location under `config/dictionaries/` - loader validates presence and format - admin runtime updates either: - are removed, or - write back through a controlled persistence path Do not keep a hybrid model where startup loads one file and admin mutates only in memory. ## 5.10 Observability improvements Add the following: - `config dump` CLI that prints sanitized effective config - startup log with config hash, environment, and config file list - `/admin/config/effective` endpoint returning sanitized effective config - `/admin/config/meta` endpoint returning: - environment - config hash - loaded source files - deprecated keys in use This is important for operations and for multi-service debugging. ## 6. Practical Refactor Plan The refactor should be incremental. ### Phase 1: establish the new config core without changing behavior - create `config/schema.py` - create `config/loader.py` - move all current defaults into schema models - make loader read current `config/config.yaml` - make loader read `.env` only for approved keys - expose one `get_app_config()` Result: - same behavior, but one typed root config becomes available ### Phase 2: remove duplicate readers - make `services_config.py` a thin adapter over `get_app_config()` - make `tenant_config_loader.py` read from `get_app_config()` - stop reparsing YAML in `services_config.py` - stop service modules from depending on legacy local config modules for behavior Result: - one parsing path - fewer divergence risks ### Phase 3: move hidden defaults out of business logic - remove hardcoded fallback language lists from query/indexer modules - require tenant defaults to come from config schema only - remove duplicate behavior defaults from service code Result: - behavior becomes visible and reviewable ### Phase 4: clean service startup configuration - make startup scripts ask the unified loader for resolved values - keep only bind host/port and secret injection in shell env - retire or reduce `embeddings/config.py` and `reranker/config.py` Result: - startup behavior matches runtime config model ### Phase 5: split config files by responsibility - keep a single root loader - split current giant `config.yaml` into: - `base.yaml` - `environments/.yaml` - `tenants/*.yaml` - `dictionaries/query_rewrite.dict` Result: - config remains unified logically, but is easier to read and maintain physically ### Phase 6: deprecate legacy compatibility - deprecate `translate_to_en` and `translate_to_zh` - deprecate env-based backend/provider selection except for explicitly approved keys - remove old code paths after one or two release cycles Result: - the system becomes simpler instead of carrying two generations forever ## 7. Concrete Rules To Adopt These rules should be documented and enforced in code review. ### Rule 1 Only `config/loader.py` may load config files or `.env`. ### Rule 2 Only `config/loader.py` may read `os.getenv()` for application config. ### Rule 3 Business modules receive typed config objects and do not read files or env directly. ### Rule 4 Each config key has one owner. Examples: - `search.query.knn_boost` belongs to search behavior config - `services.embedding.backend` belongs to service implementation config - `infrastructure.redis.password` belongs to env/secrets ### Rule 5 Every fallback must be either: - declared in schema defaults, or - rejected at startup No hidden fallback in runtime logic. ### Rule 6 Every configuration asset must be visible in one of these places only: - config file - env var - generated runtime metadata Not inside parser code as an implicit constant. ## 8. Recommended Naming Conventions Suggested conventions: - config keys use noun-based hierarchical names - avoid mixing transport and implementation concepts in one field - use `endpoint` for caller-facing addresses - use `backend` for service-internal implementation choice - use `enabled` only for true feature toggles - use `default_*` only when a real selection happens at runtime Examples: - good: `services.rerank.endpoint` - good: `services.rerank.backend` - good: `tenants.default.index_languages` - avoid: `service_url`, `base_url`, `provider`, `backend`, and script env all meaning slightly different things without a common model ## 9. Highest-Priority Cleanup Items If the team wants the shortest path to improvement, start here: 1. build one root `AppConfig` 2. make `services_config.py` stop reparsing YAML 3. declare rewrite dictionary path explicitly and fix the current mismatch 4. remove hardcoded `["en", "zh"]` fallbacks from query/indexer logic 5. replace `/admin/config` with an effective-config endpoint 6. retire `embeddings/config.py` and `reranker/config.py` as behavior sources 7. deprecate legacy tenant translation flags ## 10. Expected Outcome After the redesign: - developers can answer “where does this setting come from?” in one step - operators can see effective config without reading source code - backend, indexer, translator, embedding, and reranker all share one model - tenant behavior is explicit instead of partially implicit - migration becomes safer because defaults and precedence are centralized - adding a new provider/backend becomes configuration extension, not configuration archaeology ## 11. Summary The current system has the right intent but not yet the right implementation shape. Today the main problems are: - duplicate config loaders - inconsistent precedence - duplicated defaults - config hidden in runtime logic - weak effective-config visibility - leftover legacy concepts The recommended direction is: - one root typed config - one loader pipeline - explicit layered sources - narrow env responsibility - no hidden business fallbacks - observable effective config That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.