docs/config-system-review-and-redesign.md

# Configuration System Review And Redesign
## 1. Goal
This document reviews the current configuration system and proposes a practical redesign for long-term maintainability.
The target is a configuration system that is:
- unified in loading and ownership
- clear in boundaries and precedence
- visible in effective behavior
- easy to evolve across development, deployment, and operations
This review is based on the current implementation, not only on the intended architecture in docs.
## 2. Project Context
The repo already defines the right architectural direction:
- `config/config.yaml` should be the main configuration source for search behavior and service wiring
- `.env` should mainly carry deployment-specific values and secrets
- provider/backend expansion should stay centralized instead of spreading through business code
That direction is described in:
- [`README.md`](/data/saas-search/README.md)
- [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md)
- [`docs/QUICKSTART.md`](/data/saas-search/docs/QUICKSTART.md)
- [`translation/README.md`](/data/saas-search/translation/README.md)
The problem is not the architectural intent. The problem is that the current implementation only partially follows it.
## 3. Current-State Review
### 3.1 What exists today
The current system effectively has several configuration channels:
- `config/config.yaml`
  - search behavior
  - rerank behavior
  - services registry
  - tenant config
- `config/config_loader.py`
  - parses search behavior and tenant config into `SearchConfig`
  - also injects some defaults from code
- `config/services_config.py`
  - reparses `config/config.yaml` again, independently
  - resolves translation, embedding, rerank service config
  - also applies env overrides
- `config/env_config.py`
  - loads `.env`
  - defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults
- service-local config modules
  - [`embeddings/config.py`](/data/saas-search/embeddings/config.py)
  - [`reranker/config.py`](/data/saas-search/reranker/config.py)
- startup scripts
  - derive defaults from shell env, Python config, and YAML in different combinations
- inline fallbacks in business logic
  - query parsing
  - indexing
  - service startup
### 3.2 Main findings
#### Finding A: there is no single loader for the full effective configuration
`ConfigLoader` and `services_config` both parse `config/config.yaml`, but they do so separately and with different responsibilities.
- [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L148)
- [`config/services_config.py`](/data/saas-search/config/services_config.py#L33)
Impact:
- the same file is loaded twice through different code paths
- search config and services config can drift in interpretation
- alternative config paths are hard to support cleanly
- tests and tools cannot ask one place for the full effective config tree
#### Finding B: precedence is not explicit, stable, or globally enforced
Current precedence differs by subsystem:
- search behavior mostly comes from YAML plus code defaults
- embedding and rerank allow env overrides for provider/backend/url
- translation intentionally blocks some env overrides
- startup scripts still choose host/port and mode via env
- some values are reconstructed from other env vars
Examples:
- env override for embedding provider/url/backend:
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L52)
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L68)
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L139)
- host/port and service URL reconstruction:
  - [`config/env_config.py`](/data/saas-search/config/env_config.py#L55)
  - [`config/env_config.py`](/data/saas-search/config/env_config.py#L75)
- translator host/port still driven by startup env:
  - [`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh#L28)
Impact:
- operators cannot reliably predict the effective configuration by reading one file
- the same setting category behaves differently across services
- incidents become harder to debug because source-of-truth depends on the code path
#### Finding C: defaults are duplicated across YAML and code
There are several layers of default values:
- dataclass defaults in `QueryConfig`
- fallback defaults in `ConfigLoader._parse_config`
- defaults in `config.yaml`
- defaults in `env_config.py`
- defaults in `embeddings/config.py`
- defaults in `reranker/config.py`
- defaults in startup scripts
Examples:
- query defaults duplicated in dataclass and parser:
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L24)
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L240)
- embedding defaults duplicated in YAML, `services_config`, `embeddings/config.py`, and startup script:
  - [`config/config.yaml`](/data/saas-search/config/config.yaml#L196)
  - [`embeddings/config.py`](/data/saas-search/embeddings/config.py#L14)
  - [`scripts/start_embedding_service.sh`](/data/saas-search/scripts/start_embedding_service.sh#L29)
- reranker defaults duplicated in YAML and `reranker/config.py`:
  - [`config/config.yaml`](/data/saas-search/config/config.yaml#L214)
  - [`reranker/config.py`](/data/saas-search/reranker/config.py#L6)
Impact:
- changing a default is risky because there may be multiple hidden copies
- code review cannot easily tell whether a value is authoritative or dead legacy
- “same config” may behave differently across processes
#### Finding D: config is still embedded in runtime logic
Some important behavior remains encoded as inline fallback logic rather than declared config.
Examples:
- query-time translation target languages fallback to `["en", "zh"]`:
  - [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L339)
- indexer text handling and LLM enrichment also fallback to `["en", "zh"]`:
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L216)
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L310)
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L649)
Impact:
- configuration is not fully visible in config files
- behavior can silently change when tenant config is missing or malformed
- “default behavior” is spread across business modules
#### Finding E: some configuration assets are not managed as first-class config
Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content.
- loader expects:
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L162)
- repo currently contains:
  - [`config/query_rewrite.dict`](/data/saas-search/config/query_rewrite.dict)
There is also an admin API that mutates rewrite rules in memory only:
- [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L68)
- [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L622)
Impact:
- rewrite rules are neither cleanly file-backed nor fully runtime-managed
- restart behavior is unclear
- configuration visibility and persistence are weak
#### Finding F: visibility is limited
The system exposes only a small sanitized subset at `/admin/config`.
- [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L42)
At the same time, the true effective config includes:
- tenant overlays
- env overrides
- service backend selections
- script-selected modes
- hidden defaults in code
Impact:
- there is no authoritative “effective config” view
- debugging configuration mismatches requires source reading
- operators cannot easily verify what each process actually started with
#### Finding G: the indexer does not really consume the unified config as a first-class dependency
Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it.
- [`api/indexer_app.py`](/data/saas-search/api/indexer_app.py#L76)
Impact:
- configuration is not truly system-wide
- search-side and indexer-side behavior can drift
- the current “unified config” is only partially unified
#### Finding H: docs still carry legacy and mixed mental models
Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as `translate_to_en` and `translate_to_zh`.
- desired model:
  - [`README.md`](/data/saas-search/README.md#L78)
  - [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md#L207)
  - [`translation/README.md`](/data/saas-search/translation/README.md#L161)
- legacy tenant translation flags still documented:
  - [`indexer/README.md`](/data/saas-search/indexer/README.md#L39)
Impact:
- new developers may follow old mental models
- cleanup work keeps getting deferred because old and new systems appear both “supported”
## 4. Design Principles For The Redesign
The redesign should follow these rules.
### 4.1 One logical configuration system
It is acceptable to have multiple files, but not multiple loaders with overlapping ownership.
There must be one loader pipeline that produces one typed `AppConfig`.
### 4.2 Configuration files declare, parser code interprets, env provides runtime injection
Responsibilities should be:
- configuration files
  - declare non-secret desired behavior and non-secret deployable settings
- parsing logic
  - load, merge, validate, normalize, and expose typed config
  - never invent hidden business behavior
- environment variables
  - carry secrets and a small set of runtime/process values
  - do not redefine business behavior casually
### 4.3 One precedence rule for the whole system
Every config category should follow the same merge model unless explicitly exempted.
### 4.4 No silent implicit fallback for business behavior
Fail fast at startup when required config is missing or invalid.
Do not silently fall back to legacy behavior such as hardcoded language lists.
### 4.5 Effective configuration must be observable
Every service should be able to show:
- config version or hash
- source files loaded
- environment name
- sanitized effective configuration
## 5. Recommended Target Design
## 5.1 Boundary model
Use three clear layers.
### Layer 1: repository-managed static config
Purpose:
- search behavior
- tenant behavior
- provider/backend registry
- non-secret service topology defaults
- feature switches
Examples:
- field boosts
- query strategy
- rerank fusion parameters
- tenant language plans
- translation capability registry
- embedding backend selection default
### Layer 2: environment-specific overlays
Purpose:
- per-environment non-secret differences
- service endpoints by environment
- resource sizing defaults by environment
- dev/test/prod operational differences
Examples:
- local embedding URL vs production URL
- dev rerank backend vs prod rerank backend
- lower concurrency in local development
### Layer 3: environment variables
Purpose:
- secrets
- bind host/port
- external infrastructure credentials
- container-orchestrator last-mile injection
Examples:
- `ES_HOST`, `ES_USERNAME`, `ES_PASSWORD`
- `DB_HOST`, `DB_USERNAME`, `DB_PASSWORD`
- `REDIS_HOST`, `REDIS_PASSWORD`
- `DASHSCOPE_API_KEY`, `DEEPL_AUTH_KEY`
- `API_HOST`, `API_PORT`, `INDEXER_PORT`, `TRANSLATION_PORT`
Rule:
- environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy
- if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention
## 5.2 Unified precedence
Recommended precedence:
1. schema defaults in code
2. `config/base.yaml`
3. `config/environments/<env>.yaml`
4. tenant overlay from `config/tenants/`
5. environment variables for the explicitly allowed runtime keys
6. CLI flags for the current process only
Important rule:
- only one module may implement this merge logic
- no business module may call `os.getenv()` directly for configuration
## 5.3 Recommended directory structure
```text
config/
  schema.py
  loader.py
  sources.py
  base.yaml
  environments/
    dev.yaml
    test.yaml
    prod.yaml
  tenants/
    _default.yaml
    1.yaml
    162.yaml
    170.yaml
  dictionaries/
    query_rewrite.dict
  README.md
.env.example
```
Notes:
- `base.yaml` contains shared defaults and feature behavior
- `environments/*.yaml` contains environment-specific non-secret overrides
- `tenants/*.yaml` contains tenant-specific overrides only
- `dictionaries/` stores first-class config assets such as rewrite dictionaries
- `schema.py` defines the typed config model
- `loader.py` is the only entry point that loads and merges config
If the team prefers fewer files, `tenants.yaml` is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”.
## 5.4 Typed configuration model
Introduce one root object, for example:
```python
class AppConfig(BaseModel):
    runtime: RuntimeConfig
    infrastructure: InfrastructureConfig
    search: SearchConfig
    services: ServicesConfig
    tenants: TenantCatalogConfig
    assets: ConfigAssets
```
Suggested subtrees:
- `runtime`
  - environment name
  - config revision/hash
  - bind addresses/ports
- `infrastructure`
  - ES
  - DB
  - Redis
  - index namespace
- `search`
  - field boosts
  - query config
  - function score
  - rerank behavior
  - spu config
- `services`
  - translation
  - embedding
  - rerank
- `tenants`
  - default tenant config
  - tenant overrides
- `assets`
  - rewrite dictionary path
Benefits:
- one validated object shared by backend, indexer, translator, embedding, reranker
- one place for defaults
- one place for schema evolution
## 5.5 Loading flow
Recommended loading flow:
1. determine `APP_ENV` or `RUNTIME_ENV`
2. load schema defaults
3. load `config/base.yaml`
4. load `config/environments/<env>.yaml` if present
5. load tenant files
6. inject first-class assets such as rewrite dictionary
7. apply allowed env overrides
8. validate the final `AppConfig`
9. freeze and cache the config object
10. expose a sanitized effective-config view
Important:
- every process should call the same loader
- services should receive a resolved `AppConfig`, not re-open YAML independently
## 5.6 Clear responsibility split
### Configuration files are responsible for
- what the system should do
- what providers/backends are available
- which features are enabled
- tenant language/index policies
- non-secret service topology
### Parser/loader code is responsible for
- locating sources
- merge precedence
- type validation
- normalization
- deprecation warnings
- producing the final immutable config object
### Environment variables are responsible for
- secrets
- bind addresses/ports
- infrastructure endpoints when the deployment platform injects them
- a very small set of documented operational overrides
### Business code is not responsible for
- inventing defaults for missing config
- loading YAML directly
- calling `os.getenv()` for normal application behavior
## 5.7 How to handle service config
Unify all service-facing config under one structure:
```yaml
services:
  translation:
    endpoint: "http://translator:6006"
    timeout_sec: 10
    default_model: "llm"
    default_scene: "general"
    capabilities: ...
  embedding:
    endpoint:
      text: "http://embedding:6005"
      image: "http://embedding-image:6008"
    backend: "tei"
    backends: ...
  rerank:
    endpoint: "http://reranker:6007/rerank"
    backend: "qwen3_vllm"
    backends: ...
```
Rules:
- `endpoint` is how callers reach the service
- `backend` is how the service itself is implemented
- only the service process cares about `backend`
- only callers care about `endpoint`
- both still belong to the same config tree, because they are part of one system
## 5.8 How to handle tenant config
Tenant config should become explicit policy, not translation-era leftovers.
Recommended tenant fields:
- `primary_language`
- `index_languages`
- `search_languages`
- `translation_policy`
- `facet_policy`
- optional tenant-specific ranking overrides
Avoid keeping `translate_to_en` and `translate_to_zh` as active concepts in the long-term model.
If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings.
## 5.9 How to handle rewrite rules and similar assets
Treat them as declared config assets.
Recommended rules:
- file path declared in config
- one canonical location under `config/dictionaries/`
- loader validates presence and format
- admin runtime updates either:
  - are removed, or
  - write back through a controlled persistence path
Do not keep a hybrid model where startup loads one file and admin mutates only in memory.
## 5.10 Observability improvements
Add the following:
- `config dump` CLI that prints sanitized effective config
- startup log with config hash, environment, and config file list
- `/admin/config/effective` endpoint returning sanitized effective config
- `/admin/config/meta` endpoint returning:
  - environment
  - config hash
  - loaded source files
  - deprecated keys in use
This is important for operations and for multi-service debugging.
## 6. Practical Refactor Plan
The refactor should be incremental.
### Phase 1: establish the new config core without changing behavior
- create `config/schema.py`
- create `config/loader.py`
- move all current defaults into schema models
- make loader read current `config/config.yaml`
- make loader read `.env` only for approved keys
- expose one `get_app_config()`
Result:
- same behavior, but one typed root config becomes available
### Phase 2: remove duplicate readers
- make `services_config.py` a thin adapter over `get_app_config()`
- make `tenant_config_loader.py` read from `get_app_config()`
- stop reparsing YAML in `services_config.py`
- stop service modules from depending on legacy local config modules for behavior
Result:
- one parsing path
- fewer divergence risks
### Phase 3: move hidden defaults out of business logic
- remove hardcoded fallback language lists from query/indexer modules
- require tenant defaults to come from config schema only
- remove duplicate behavior defaults from service code
Result:
- behavior becomes visible and reviewable
### Phase 4: clean service startup configuration
- make startup scripts ask the unified loader for resolved values
- keep only bind host/port and secret injection in shell env
- retire or reduce `embeddings/config.py` and `reranker/config.py`
Result:
- startup behavior matches runtime config model
### Phase 5: split config files by responsibility
- keep a single root loader
- split current giant `config.yaml` into:
  - `base.yaml`
  - `environments/<env>.yaml`
  - `tenants/*.yaml`
  - `dictionaries/query_rewrite.dict`
Result:
- config remains unified logically, but is easier to read and maintain physically
### Phase 6: deprecate legacy compatibility
- deprecate `translate_to_en` and `translate_to_zh`
- deprecate env-based backend/provider selection except for explicitly approved keys
- remove old code paths after one or two release cycles
Result:
- the system becomes simpler instead of carrying two generations forever
## 7. Concrete Rules To Adopt
These rules should be documented and enforced in code review.
### Rule 1
Only `config/loader.py` may load config files or `.env`.
### Rule 2
Only `config/loader.py` may read `os.getenv()` for application config.
### Rule 3
Business modules receive typed config objects and do not read files or env directly.
### Rule 4
Each config key has one owner.
Examples:
- `search.query.knn_boost` belongs to search behavior config
- `services.embedding.backend` belongs to service implementation config
- `infrastructure.redis.password` belongs to env/secrets
### Rule 5
Every fallback must be either:
- declared in schema defaults, or
- rejected at startup
No hidden fallback in runtime logic.
### Rule 6
Every configuration asset must be visible in one of these places only:
- config file
- env var
- generated runtime metadata
Not inside parser code as an implicit constant.
## 8. Recommended Naming Conventions
Suggested conventions:
- config keys use noun-based hierarchical names
- avoid mixing transport and implementation concepts in one field
- use `endpoint` for caller-facing addresses
- use `backend` for service-internal implementation choice
- use `enabled` only for true feature toggles
- use `default_*` only when a real selection happens at runtime
Examples:
- good: `services.rerank.endpoint`
- good: `services.rerank.backend`
- good: `tenants.default.index_languages`
- avoid: `service_url`, `base_url`, `provider`, `backend`, and script env all meaning slightly different things without a common model
## 9. Highest-Priority Cleanup Items
If the team wants the shortest path to improvement, start here:
1. build one root `AppConfig`
2. make `services_config.py` stop reparsing YAML
3. declare rewrite dictionary path explicitly and fix the current mismatch
4. remove hardcoded `["en", "zh"]` fallbacks from query/indexer logic
5. replace `/admin/config` with an effective-config endpoint
6. retire `embeddings/config.py` and `reranker/config.py` as behavior sources
7. deprecate legacy tenant translation flags
## 10. Expected Outcome
After the redesign:
- developers can answer “where does this setting come from?” in one step
- operators can see effective config without reading source code
- backend, indexer, translator, embedding, and reranker all share one model
- tenant behavior is explicit instead of partially implicit
- migration becomes safer because defaults and precedence are centralized
- adding a new provider/backend becomes configuration extension, not configuration archaeology
## 11. Summary
The current system has the right intent but not yet the right implementation shape.
Today the main problems are:
- duplicate config loaders
- inconsistent precedence
- duplicated defaults
- config hidden in runtime logic
- weak effective-config visibility
- leftover legacy concepts
The recommended direction is:
- one root typed config
- one loader pipeline
- explicit layered sources
- narrow env responsibility
- no hidden business fallbacks
- observable effective config
That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.