# Configuration System Review And Redesign

## 1. Goal

This document reviews the current configuration system and proposes a practical redesign for long-term maintainability.

The target is a configuration system that is:

- unified in loading and ownership
- clear in boundaries and precedence
- visible in effective behavior
- easy to evolve across development, deployment, and operations

This review is based on the current implementation, not only on the intended architecture in docs.

## 2. Project Context

The repo already defines the right architectural direction:

- `config/config.yaml` should be the main configuration source for search behavior and service wiring
- `.env` should mainly carry deployment-specific values and secrets
- provider/backend expansion should stay centralized instead of spreading through business code

That direction is described in:

- [`README.md`](/data/saas-search/README.md)
- [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md)
- [`docs/QUICKSTART.md`](/data/saas-search/docs/QUICKSTART.md)
- [`translation/README.md`](/data/saas-search/translation/README.md)

The problem is not the architectural intent. The problem is that the current implementation only partially follows it.

## 3. Current-State Review

### 3.1 What exists today

The current system effectively has several configuration channels:

- `config/config.yaml`
  - search behavior
  - rerank behavior
  - services registry
  - tenant config
- `config/config_loader.py`
  - parses search behavior and tenant config into `SearchConfig`
  - also injects some defaults from code
- `config/services_config.py`
  - reparses `config/config.yaml` again, independently
  - resolves translation, embedding, rerank service config
  - also applies env overrides
- `config/env_config.py`
  - loads `.env`
  - defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults
- service-local config modules
  - [`embeddings/config.py`](/data/saas-search/embeddings/config.py)
  - [`reranker/config.py`](/data/saas-search/reranker/config.py)
- startup scripts
  - derive defaults from shell env, Python config, and YAML in different combinations
- inline fallbacks in business logic
  - query parsing
  - indexing
  - service startup

### 3.2 Main findings

#### Finding A: there is no single loader for the full effective configuration

`ConfigLoader` and `services_config` both parse `config/config.yaml`, but they do so separately and with different responsibilities.

- [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L148)
- [`config/services_config.py`](/data/saas-search/config/services_config.py#L33)

Impact:

- the same file is loaded twice through different code paths
- search config and services config can drift in interpretation
- alternative config paths are hard to support cleanly
- tests and tools cannot ask one place for the full effective config tree

#### Finding B: precedence is not explicit, stable, or globally enforced

Current precedence differs by subsystem:

- search behavior mostly comes from YAML plus code defaults
- embedding and rerank allow env overrides for provider/backend/url
- translation intentionally blocks some env overrides
- startup scripts still choose host/port and mode via env
- some values are reconstructed from other env vars

Examples:

- env override for embedding provider/url/backend:
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L52)
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L68)
  - [`config/services_config.py`](/data/saas-search/config/services_config.py#L139)
- host/port and service URL reconstruction:
  - [`config/env_config.py`](/data/saas-search/config/env_config.py#L55)
  - [`config/env_config.py`](/data/saas-search/config/env_config.py#L75)
- translator host/port still driven by startup env:
  - [`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh#L28)

Impact:

- operators cannot reliably predict the effective configuration by reading one file
- the same setting category behaves differently across services
- incidents become harder to debug because source-of-truth depends on the code path

#### Finding C: defaults are duplicated across YAML and code

There are several layers of default values:

- dataclass defaults in `QueryConfig`
- fallback defaults in `ConfigLoader._parse_config`
- defaults in `config.yaml`
- defaults in `env_config.py`
- defaults in `embeddings/config.py`
- defaults in `reranker/config.py`
- defaults in startup scripts

Examples:

- query defaults duplicated in dataclass and parser:
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L24)
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L240)
- embedding defaults duplicated in YAML, `services_config`, `embeddings/config.py`, and startup script:
  - [`config/config.yaml`](/data/saas-search/config/config.yaml#L196)
  - [`embeddings/config.py`](/data/saas-search/embeddings/config.py#L14)
  - [`scripts/start_embedding_service.sh`](/data/saas-search/scripts/start_embedding_service.sh#L29)
- reranker defaults duplicated in YAML and `reranker/config.py`:
  - [`config/config.yaml`](/data/saas-search/config/config.yaml#L214)
  - [`reranker/config.py`](/data/saas-search/reranker/config.py#L6)

Impact:

- changing a default is risky because there may be multiple hidden copies
- code review cannot easily tell whether a value is authoritative or dead legacy
- “same config” may behave differently across processes

#### Finding D: config is still embedded in runtime logic

Some important behavior remains encoded as inline fallback logic rather than declared config.

Examples:

- query-time translation target languages fallback to `["en", "zh"]`:
  - [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L339)
- indexer text handling and LLM enrichment also fallback to `["en", "zh"]`:
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L216)
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L310)
  - [`indexer/document_transformer.py`](/data/saas-search/indexer/document_transformer.py#L649)

Impact:

- configuration is not fully visible in config files
- behavior can silently change when tenant config is missing or malformed
- “default behavior” is spread across business modules

#### Finding E: some configuration assets are not managed as first-class config

Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content.

- loader expects:
  - [`config/config_loader.py`](/data/saas-search/config/config_loader.py#L162)
- repo currently contains:
  - [`config/query_rewrite.dict`](/data/saas-search/config/query_rewrite.dict)

There is also an admin API that mutates rewrite rules in memory only:

- [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L68)
- [`query/query_parser.py`](/data/saas-search/query/query_parser.py#L622)

Impact:

- rewrite rules are neither cleanly file-backed nor fully runtime-managed
- restart behavior is unclear
- configuration visibility and persistence are weak

#### Finding F: visibility is limited

The system exposes only a small sanitized subset at `/admin/config`.

- [`api/routes/admin.py`](/data/saas-search/api/routes/admin.py#L42)

At the same time, the true effective config includes:

- tenant overlays
- env overrides
- service backend selections
- script-selected modes
- hidden defaults in code

Impact:

- there is no authoritative “effective config” view
- debugging configuration mismatches requires source reading
- operators cannot easily verify what each process actually started with

#### Finding G: the indexer does not really consume the unified config as a first-class dependency

Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it.

- [`api/indexer_app.py`](/data/saas-search/api/indexer_app.py#L76)

Impact:

- configuration is not truly system-wide
- search-side and indexer-side behavior can drift
- the current “unified config” is only partially unified

#### Finding H: docs still carry legacy and mixed mental models

Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as `translate_to_en` and `translate_to_zh`.

- desired model:
  - [`README.md`](/data/saas-search/README.md#L78)
  - [`docs/DEVELOPER_GUIDE.md`](/data/saas-search/docs/DEVELOPER_GUIDE.md#L207)
  - [`translation/README.md`](/data/saas-search/translation/README.md#L161)
- legacy tenant translation flags still documented:
  - [`indexer/README.md`](/data/saas-search/indexer/README.md#L39)

Impact:

- new developers may follow old mental models
- cleanup work keeps getting deferred because old and new systems appear both “supported”

## 4. Design Principles For The Redesign

The redesign should follow these rules.

### 4.1 One logical configuration system

It is acceptable to have multiple files, but not multiple loaders with overlapping ownership.

There must be one loader pipeline that produces one typed `AppConfig`.

### 4.2 Configuration files declare, parser code interprets, env provides runtime injection

Responsibilities should be:

- configuration files
  - declare non-secret desired behavior and non-secret deployable settings
- parsing logic
  - load, merge, validate, normalize, and expose typed config
  - never invent hidden business behavior
- environment variables
  - carry secrets and a small set of runtime/process values
  - do not redefine business behavior casually

### 4.3 One precedence rule for the whole system

Every config category should follow the same merge model unless explicitly exempted.

### 4.4 No silent implicit fallback for business behavior

Fail fast at startup when required config is missing or invalid.

Do not silently fall back to legacy behavior such as hardcoded language lists.

### 4.5 Effective configuration must be observable

Every service should be able to show:

- config version or hash
- source files loaded
- environment name
- sanitized effective configuration

## 5. Recommended Target Design

## 5.1 Boundary model

Use three clear layers.

### Layer 1: repository-managed static config

Purpose:

- search behavior
- tenant behavior
- provider/backend registry
- non-secret service topology defaults
- feature switches

Examples:

- field boosts
- query strategy
- rerank fusion parameters
- tenant language plans
- translation capability registry
- embedding backend selection default

### Layer 2: environment-specific overlays

Purpose:

- per-environment non-secret differences
- service endpoints by environment
- resource sizing defaults by environment
- dev/test/prod operational differences

Examples:

- local embedding URL vs production URL
- dev rerank backend vs prod rerank backend
- lower concurrency in local development

### Layer 3: environment variables

Purpose:

- secrets
- bind host/port
- external infrastructure credentials
- container-orchestrator last-mile injection

Examples:

- `ES_HOST`, `ES_USERNAME`, `ES_PASSWORD`
- `DB_HOST`, `DB_USERNAME`, `DB_PASSWORD`
- `REDIS_HOST`, `REDIS_PASSWORD`
- `DASHSCOPE_API_KEY`, `DEEPL_AUTH_KEY`
- `API_HOST`, `API_PORT`, `INDEXER_PORT`, `TRANSLATION_PORT`

Rule:

- environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy
- if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention

## 5.2 Unified precedence

Recommended precedence:

1. schema defaults in code
2. `config/base.yaml`
3. `config/environments/<env>.yaml`
4. tenant overlay from `config/tenants/`
5. environment variables for the explicitly allowed runtime keys
6. CLI flags for the current process only

Important rule:

- only one module may implement this merge logic
- no business module may call `os.getenv()` directly for configuration

## 5.3 Recommended directory structure

```text
config/
  schema.py
  loader.py
  sources.py
  base.yaml
  environments/
    dev.yaml
    test.yaml
    prod.yaml
  tenants/
    _default.yaml
    1.yaml
    162.yaml
    170.yaml
  dictionaries/
    query_rewrite.dict
  README.md
.env.example
```

Notes:

- `base.yaml` contains shared defaults and feature behavior
- `environments/*.yaml` contains environment-specific non-secret overrides
- `tenants/*.yaml` contains tenant-specific overrides only
- `dictionaries/` stores first-class config assets such as rewrite dictionaries
- `schema.py` defines the typed config model
- `loader.py` is the only entry point that loads and merges config

If the team prefers fewer files, `tenants.yaml` is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”.

## 5.4 Typed configuration model

Introduce one root object, for example:

```python
class AppConfig(BaseModel):
    runtime: RuntimeConfig
    infrastructure: InfrastructureConfig
    search: SearchConfig
    services: ServicesConfig
    tenants: TenantCatalogConfig
    assets: ConfigAssets
```

Suggested subtrees:

- `runtime`
  - environment name
  - config revision/hash
  - bind addresses/ports
- `infrastructure`
  - ES
  - DB
  - Redis
  - index namespace
- `search`
  - field boosts
  - query config
  - function score
  - rerank behavior
  - spu config
- `services`
  - translation
  - embedding
  - rerank
- `tenants`
  - default tenant config
  - tenant overrides
- `assets`
  - rewrite dictionary path

Benefits:

- one validated object shared by backend, indexer, translator, embedding, reranker
- one place for defaults
- one place for schema evolution

## 5.5 Loading flow

Recommended loading flow:

1. determine `APP_ENV` or `RUNTIME_ENV`
2. load schema defaults
3. load `config/base.yaml`
4. load `config/environments/<env>.yaml` if present
5. load tenant files
6. inject first-class assets such as rewrite dictionary
7. apply allowed env overrides
8. validate the final `AppConfig`
9. freeze and cache the config object
10. expose a sanitized effective-config view

Important:

- every process should call the same loader
- services should receive a resolved `AppConfig`, not re-open YAML independently

## 5.6 Clear responsibility split

### Configuration files are responsible for

- what the system should do
- what providers/backends are available
- which features are enabled
- tenant language/index policies
- non-secret service topology

### Parser/loader code is responsible for

- locating sources
- merge precedence
- type validation
- normalization
- deprecation warnings
- producing the final immutable config object

### Environment variables are responsible for

- secrets
- bind addresses/ports
- infrastructure endpoints when the deployment platform injects them
- a very small set of documented operational overrides

### Business code is not responsible for

- inventing defaults for missing config
- loading YAML directly
- calling `os.getenv()` for normal application behavior

## 5.7 How to handle service config

Unify all service-facing config under one structure:

```yaml
services:
  translation:
    endpoint: "http://translator:6006"
    timeout_sec: 10
    default_model: "llm"
    default_scene: "general"
    capabilities: ...
  embedding:
    endpoint:
      text: "http://embedding:6005"
      image: "http://embedding-image:6008"
    backend: "tei"
    backends: ...
  rerank:
    endpoint: "http://reranker:6007/rerank"
    backend: "qwen3_vllm"
    backends: ...
```

Rules:

- `endpoint` is how callers reach the service
- `backend` is how the service itself is implemented
- only the service process cares about `backend`
- only callers care about `endpoint`
- both still belong to the same config tree, because they are part of one system

## 5.8 How to handle tenant config

Tenant config should become explicit policy, not translation-era leftovers.

Recommended tenant fields:

- `primary_language`
- `index_languages`
- `search_languages`
- `translation_policy`
- `facet_policy`
- optional tenant-specific ranking overrides

Avoid keeping `translate_to_en` and `translate_to_zh` as active concepts in the long-term model.

If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings.

## 5.9 How to handle rewrite rules and similar assets

Treat them as declared config assets.

Recommended rules:

- file path declared in config
- one canonical location under `config/dictionaries/`
- loader validates presence and format
- admin runtime updates either:
  - are removed, or
  - write back through a controlled persistence path

Do not keep a hybrid model where startup loads one file and admin mutates only in memory.

## 5.10 Observability improvements

Add the following:

- `config dump` CLI that prints sanitized effective config
- startup log with config hash, environment, and config file list
- `/admin/config/effective` endpoint returning sanitized effective config
- `/admin/config/meta` endpoint returning:
  - environment
  - config hash
  - loaded source files
  - deprecated keys in use

This is important for operations and for multi-service debugging.

## 6. Practical Refactor Plan

The refactor should be incremental.

### Phase 1: establish the new config core without changing behavior

- create `config/schema.py`
- create `config/loader.py`
- move all current defaults into schema models
- make loader read current `config/config.yaml`
- make loader read `.env` only for approved keys
- expose one `get_app_config()`

Result:

- same behavior, but one typed root config becomes available

### Phase 2: remove duplicate readers

- make `services_config.py` a thin adapter over `get_app_config()`
- make `tenant_config_loader.py` read from `get_app_config()`
- stop reparsing YAML in `services_config.py`
- stop service modules from depending on legacy local config modules for behavior

Result:

- one parsing path
- fewer divergence risks

### Phase 3: move hidden defaults out of business logic

- remove hardcoded fallback language lists from query/indexer modules
- require tenant defaults to come from config schema only
- remove duplicate behavior defaults from service code

Result:

- behavior becomes visible and reviewable

### Phase 4: clean service startup configuration

- make startup scripts ask the unified loader for resolved values
- keep only bind host/port and secret injection in shell env
- retire or reduce `embeddings/config.py` and `reranker/config.py`

Result:

- startup behavior matches runtime config model

### Phase 5: split config files by responsibility

- keep a single root loader
- split current giant `config.yaml` into:
  - `base.yaml`
  - `environments/<env>.yaml`
  - `tenants/*.yaml`
  - `dictionaries/query_rewrite.dict`

Result:

- config remains unified logically, but is easier to read and maintain physically

### Phase 6: deprecate legacy compatibility

- deprecate `translate_to_en` and `translate_to_zh`
- deprecate env-based backend/provider selection except for explicitly approved keys
- remove old code paths after one or two release cycles

Result:

- the system becomes simpler instead of carrying two generations forever

## 7. Concrete Rules To Adopt

These rules should be documented and enforced in code review.

### Rule 1

Only `config/loader.py` may load config files or `.env`.

### Rule 2

Only `config/loader.py` may read `os.getenv()` for application config.

### Rule 3

Business modules receive typed config objects and do not read files or env directly.

### Rule 4

Each config key has one owner.

Examples:

- `search.query.knn_boost` belongs to search behavior config
- `services.embedding.backend` belongs to service implementation config
- `infrastructure.redis.password` belongs to env/secrets

### Rule 5

Every fallback must be either:

- declared in schema defaults, or
- rejected at startup

No hidden fallback in runtime logic.

### Rule 6

Every configuration asset must be visible in one of these places only:

- config file
- env var
- generated runtime metadata

Not inside parser code as an implicit constant.

## 8. Recommended Naming Conventions

Suggested conventions:

- config keys use noun-based hierarchical names
- avoid mixing transport and implementation concepts in one field
- use `endpoint` for caller-facing addresses
- use `backend` for service-internal implementation choice
- use `enabled` only for true feature toggles
- use `default_*` only when a real selection happens at runtime

Examples:

- good: `services.rerank.endpoint`
- good: `services.rerank.backend`
- good: `tenants.default.index_languages`
- avoid: `service_url`, `base_url`, `provider`, `backend`, and script env all meaning slightly different things without a common model

## 9. Highest-Priority Cleanup Items

If the team wants the shortest path to improvement, start here:

1. build one root `AppConfig`
2. make `services_config.py` stop reparsing YAML
3. declare rewrite dictionary path explicitly and fix the current mismatch
4. remove hardcoded `["en", "zh"]` fallbacks from query/indexer logic
5. replace `/admin/config` with an effective-config endpoint
6. retire `embeddings/config.py` and `reranker/config.py` as behavior sources
7. deprecate legacy tenant translation flags

## 10. Expected Outcome

After the redesign:

- developers can answer “where does this setting come from?” in one step
- operators can see effective config without reading source code
- backend, indexer, translator, embedding, and reranker all share one model
- tenant behavior is explicit instead of partially implicit
- migration becomes safer because defaults and precedence are centralized
- adding a new provider/backend becomes configuration extension, not configuration archaeology

## 11. Summary

The current system has the right intent but not yet the right implementation shape.

Today the main problems are:

- duplicate config loaders
- inconsistent precedence
- duplicated defaults
- config hidden in runtime logic
- weak effective-config visibility
- leftover legacy concepts

The recommended direction is:

- one root typed config
- one loader pipeline
- explicit layered sources
- narrow env responsibility
- no hidden business fallbacks
- observable effective config

That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.