config-system-review-and-redesign.md 22.1 KB
Edit Raw Blame History


Configuration System Review And Redesign
1. Goal
This document reviews the current configuration system and proposes a practical redesign for long-term maintainability.

The target is a configuration system that is:


unified in loading and ownership
clear in boundaries and precedence
visible in effective behavior
easy to evolve across development, deployment, and operations


This review is based on the current implementation, not only on the intended architecture in docs.
2. Project Context
The repo already defines the right architectural direction:


config/config.yaml should be the main configuration source for search behavior and service wiring
.env should mainly carry deployment-specific values and secrets
provider/backend expansion should stay centralized instead of spreading through business code


That direction is described in:


<code>README.md</code>
<code>docs/DEVELOPER_GUIDE.md</code>
<code>docs/QUICKSTART.md</code>
<code>translation/README.md</code>


The problem is not the architectural intent. The problem is that the current implementation only partially follows it.
3. Current-State Review
3.1 What exists today
The current system effectively has several configuration channels:


config/config.yaml


search behavior
rerank behavior
services registry
tenant config

config/config_loader.py


parses search behavior and tenant config into SearchConfig
also injects some defaults from code

config/services_config.py


reparses config/config.yaml again, independently
resolves translation, embedding, rerank service config
also applies env overrides

config/env_config.py


loads .env
defines ES, Redis, DB, host/port, service URLs, namespace, model path defaults

service-local config modules


<code>embeddings/config.py</code>
<code>reranker/config.py</code>

startup scripts


derive defaults from shell env, Python config, and YAML in different combinations

inline fallbacks in business logic


query parsing
indexing
service startup


3.2 Main findings
Finding A: there is no single loader for the full effective configuration
ConfigLoader and services_config both parse config/config.yaml, but they do so separately and with different responsibilities.


<code>config/config_loader.py</code>
<code>config/services_config.py</code>


Impact:


the same file is loaded twice through different code paths
search config and services config can drift in interpretation
alternative config paths are hard to support cleanly
tests and tools cannot ask one place for the full effective config tree

Finding B: precedence is not explicit, stable, or globally enforced
Current precedence differs by subsystem:


search behavior mostly comes from YAML plus code defaults
embedding and rerank allow env overrides for provider/backend/url
translation intentionally blocks some env overrides
startup scripts still choose host/port and mode via env
some values are reconstructed from other env vars


Examples:


env override for embedding provider/url/backend:


<code>config/services_config.py</code>
<code>config/services_config.py</code>
<code>config/services_config.py</code>

host/port and service URL reconstruction:


<code>config/env_config.py</code>
<code>config/env_config.py</code>

translator host/port still driven by startup env:


<code>scripts/start_translator.sh</code>


Impact:


operators cannot reliably predict the effective configuration by reading one file
the same setting category behaves differently across services
incidents become harder to debug because source-of-truth depends on the code path

Finding C: defaults are duplicated across YAML and code
There are several layers of default values:


dataclass defaults in QueryConfig
fallback defaults in ConfigLoader._parse_config
defaults in config.yaml
defaults in env_config.py
defaults in embeddings/config.py
defaults in reranker/config.py
defaults in startup scripts


Examples:


query defaults duplicated in dataclass and parser:


<code>config/config_loader.py</code>
<code>config/config_loader.py</code>

embedding defaults duplicated in YAML, services_config, embeddings/config.py, and startup script:


<code>config/config.yaml</code>
<code>embeddings/config.py</code>
<code>scripts/start_embedding_service.sh</code>

reranker defaults duplicated in YAML and reranker/config.py:


<code>config/config.yaml</code>
<code>reranker/config.py</code>


Impact:


changing a default is risky because there may be multiple hidden copies
code review cannot easily tell whether a value is authoritative or dead legacy
“same config” may behave differently across processes

Finding D: config is still embedded in runtime logic
Some important behavior remains encoded as inline fallback logic rather than declared config.

Examples:


query-time translation target languages fallback to ["en", "zh"]:


<code>query/query_parser.py</code>

indexer text handling and LLM enrichment also fallback to ["en", "zh"]:


<code>indexer/document_transformer.py</code>
<code>indexer/document_transformer.py</code>
<code>indexer/document_transformer.py</code>


Impact:


configuration is not fully visible in config files
behavior can silently change when tenant config is missing or malformed
“default behavior” is spread across business modules

Finding E: some configuration assets are not managed as first-class config
Query rewrite is configured through an external file, but the file path is hardcoded and currently inconsistent with the repository content.


loader expects:


<code>config/config_loader.py</code>

repo currently contains:


<code>config/query_rewrite.dict</code>


There is also an admin API that mutates rewrite rules in memory only:


<code>api/routes/admin.py</code>
<code>query/query_parser.py</code>


Impact:


rewrite rules are neither cleanly file-backed nor fully runtime-managed
restart behavior is unclear
configuration visibility and persistence are weak

Finding F: visibility is limited
The system exposes only a small sanitized subset at /admin/config.


<code>api/routes/admin.py</code>


At the same time, the true effective config includes:


tenant overlays
env overrides
service backend selections
script-selected modes
hidden defaults in code


Impact:


there is no authoritative “effective config” view
debugging configuration mismatches requires source reading
operators cannot easily verify what each process actually started with

Finding G: the indexer does not really consume the unified config as a first-class dependency
Indexer startup explicitly says config is loaded only for parity/logging and routes do not depend on it.


<code>api/indexer_app.py</code>


Impact:


configuration is not truly system-wide
search-side and indexer-side behavior can drift
the current “unified config” is only partially unified

Finding H: docs still carry legacy and mixed mental models
Most high-level docs describe the desired centralized model, but some implementation/docs still expose legacy concepts such as translate_to_en and translate_to_zh.


desired model:


<code>README.md</code>
<code>docs/DEVELOPER_GUIDE.md</code>
<code>translation/README.md</code>

legacy tenant translation flags still documented:


<code>indexer/README.md</code>


Impact:


new developers may follow old mental models
cleanup work keeps getting deferred because old and new systems appear both “supported”

4. Design Principles For The Redesign
The redesign should follow these rules.
4.1 One logical configuration system
It is acceptable to have multiple files, but not multiple loaders with overlapping ownership.

There must be one loader pipeline that produces one typed AppConfig.
4.2 Configuration files declare, parser code interprets, env provides runtime injection
Responsibilities should be:


configuration files


declare non-secret desired behavior and non-secret deployable settings

parsing logic


load, merge, validate, normalize, and expose typed config
never invent hidden business behavior

environment variables


carry secrets and a small set of runtime/process values
do not redefine business behavior casually


4.3 One precedence rule for the whole system
Every config category should follow the same merge model unless explicitly exempted.
4.4 No silent implicit fallback for business behavior
Fail fast at startup when required config is missing or invalid.

Do not silently fall back to legacy behavior such as hardcoded language lists.
4.5 Effective configuration must be observable
Every service should be able to show:


config version or hash
source files loaded
environment name
sanitized effective configuration

5. Recommended Target Design
5.1 Boundary model
Use three clear layers.
Layer 1: repository-managed static config
Purpose:


search behavior
tenant behavior
provider/backend registry
non-secret service topology defaults
feature switches


Examples:


field boosts
query strategy
rerank fusion parameters
tenant language plans
translation capability registry
embedding backend selection default

Layer 2: environment-specific overlays
Purpose:


per-environment non-secret differences
service endpoints by environment
resource sizing defaults by environment
dev/test/prod operational differences


Examples:


local embedding URL vs production URL
dev rerank backend vs prod rerank backend
lower concurrency in local development

Layer 3: environment variables
Purpose:


secrets
bind host/port
external infrastructure credentials
container-orchestrator last-mile injection


Examples:


ES_HOST, ES_USERNAME, ES_PASSWORD
DB_HOST, DB_USERNAME, DB_PASSWORD
REDIS_HOST, REDIS_PASSWORD
DASHSCOPE_API_KEY, DEEPL_AUTH_KEY
API_HOST, API_PORT, INDEXER_PORT, TRANSLATION_PORT


Rule:


environment variables should not be the normal path for choosing business behavior such as translation model, embedding backend, or tenant language policy
if an env override is allowed for a non-secret field, it must be explicitly listed and documented as an operational override, not a hidden convention

5.2 Unified precedence
Recommended precedence:


schema defaults in code
config/base.yaml
config/environments/<env>.yaml
tenant overlay from config/tenants/
environment variables for the explicitly allowed runtime keys
CLI flags for the current process only


Important rule:


only one module may implement this merge logic
no business module may call os.getenv() directly for configuration

5.3 Recommended directory structure
config/
  schema.py
  loader.py
  sources.py
  base.yaml
  environments/
    dev.yaml
    test.yaml
    prod.yaml
  tenants/
    _default.yaml
    1.yaml
    162.yaml
    170.yaml
  dictionaries/
    query_rewrite.dict
  README.md
.env.example


Notes:


base.yaml contains shared defaults and feature behavior
environments/*.yaml contains environment-specific non-secret overrides
tenants/*.yaml contains tenant-specific overrides only
dictionaries/ stores first-class config assets such as rewrite dictionaries
schema.py defines the typed config model
loader.py is the only entry point that loads and merges config


If the team prefers fewer files, tenants.yaml is also acceptable. The key requirement is not “one file”, but “one loading model with clear ownership”.
5.4 Typed configuration model
Introduce one root object, for example:
class AppConfig(BaseModel):
    runtime: RuntimeConfig
    infrastructure: InfrastructureConfig
    search: SearchConfig
    services: ServicesConfig
    tenants: TenantCatalogConfig
    assets: ConfigAssets


Suggested subtrees:


runtime


environment name
config revision/hash
bind addresses/ports

infrastructure


ES
DB
Redis
index namespace

search


field boosts
query config
function score
rerank behavior
spu config

services


translation
embedding
rerank

tenants


default tenant config
tenant overrides

assets


rewrite dictionary path


Benefits:


one validated object shared by backend, indexer, translator, embedding, reranker
one place for defaults
one place for schema evolution

5.5 Loading flow
Recommended loading flow:


determine APP_ENV or RUNTIME_ENV
load schema defaults
load config/base.yaml
load config/environments/<env>.yaml if present
load tenant files
inject first-class assets such as rewrite dictionary
apply allowed env overrides
validate the final AppConfig
freeze and cache the config object
expose a sanitized effective-config view


Important:


every process should call the same loader
services should receive a resolved AppConfig, not re-open YAML independently

5.6 Clear responsibility split
Configuration files are responsible for

what the system should do
what providers/backends are available
which features are enabled
tenant language/index policies
non-secret service topology

Parser/loader code is responsible for

locating sources
merge precedence
type validation
normalization
deprecation warnings
producing the final immutable config object

Environment variables are responsible for

secrets
bind addresses/ports
infrastructure endpoints when the deployment platform injects them
a very small set of documented operational overrides

Business code is not responsible for

inventing defaults for missing config
loading YAML directly
calling os.getenv() for normal application behavior

5.7 How to handle service config
Unify all service-facing config under one structure:
services:
  translation:
    endpoint: "http://translator:6006"
    timeout_sec: 10
    default_model: "llm"
    default_scene: "general"
    capabilities: ...
  embedding:
    endpoint:
      text: "http://embedding:6005"
      image: "http://embedding-image:6008"
    backend: "tei"
    backends: ...
  rerank:
    endpoint: "http://reranker:6007/rerank"
    backend: "qwen3_vllm"
    backends: ...


Rules:


endpoint is how callers reach the service
backend is how the service itself is implemented
only the service process cares about backend
only callers care about endpoint
both still belong to the same config tree, because they are part of one system

5.8 How to handle tenant config
Tenant config should become explicit policy, not translation-era leftovers.

Recommended tenant fields:


primary_language
index_languages
search_languages
translation_policy
facet_policy
optional tenant-specific ranking overrides


Avoid keeping translate_to_en and translate_to_zh as active concepts in the long-term model.

If compatibility is needed, support them only in the loader as deprecated aliases and emit warnings.
5.9 How to handle rewrite rules and similar assets
Treat them as declared config assets.

Recommended rules:


file path declared in config
one canonical location under config/dictionaries/
loader validates presence and format
admin runtime updates either:


are removed, or
write back through a controlled persistence path


Do not keep a hybrid model where startup loads one file and admin mutates only in memory.
5.10 Observability improvements
Add the following:


config dump CLI that prints sanitized effective config
startup log with config hash, environment, and config file list
/admin/config/effective endpoint returning sanitized effective config
/admin/config/meta endpoint returning:


environment
config hash
loaded source files
deprecated keys in use


This is important for operations and for multi-service debugging.
6. Practical Refactor Plan
The refactor should be incremental.
Phase 1: establish the new config core without changing behavior

create config/schema.py
create config/loader.py
move all current defaults into schema models
make loader read current config/config.yaml
make loader read .env only for approved keys
expose one get_app_config()


Result:


same behavior, but one typed root config becomes available

Phase 2: remove duplicate readers

make services_config.py a thin adapter over get_app_config()
make tenant_config_loader.py read from get_app_config()
stop reparsing YAML in services_config.py
stop service modules from depending on legacy local config modules for behavior


Result:


one parsing path
fewer divergence risks

Phase 3: move hidden defaults out of business logic

remove hardcoded fallback language lists from query/indexer modules
require tenant defaults to come from config schema only
remove duplicate behavior defaults from service code


Result:


behavior becomes visible and reviewable

Phase 4: clean service startup configuration

make startup scripts ask the unified loader for resolved values
keep only bind host/port and secret injection in shell env
retire or reduce embeddings/config.py and reranker/config.py


Result:


startup behavior matches runtime config model

Phase 5: split config files by responsibility

keep a single root loader
split current giant config.yaml into:


base.yaml
environments/<env>.yaml
tenants/*.yaml
dictionaries/query_rewrite.dict


Result:


config remains unified logically, but is easier to read and maintain physically

Phase 6: deprecate legacy compatibility

deprecate translate_to_en and translate_to_zh
deprecate env-based backend/provider selection except for explicitly approved keys
remove old code paths after one or two release cycles


Result:


the system becomes simpler instead of carrying two generations forever

7. Concrete Rules To Adopt
These rules should be documented and enforced in code review.
Rule 1
Only config/loader.py may load config files or .env.
Rule 2
Only config/loader.py may read os.getenv() for application config.
Rule 3
Business modules receive typed config objects and do not read files or env directly.
Rule 4
Each config key has one owner.

Examples:


search.query.knn_boost belongs to search behavior config
services.embedding.backend belongs to service implementation config
infrastructure.redis.password belongs to env/secrets

Rule 5
Every fallback must be either:


declared in schema defaults, or
rejected at startup


No hidden fallback in runtime logic.
Rule 6
Every configuration asset must be visible in one of these places only:


config file
env var
generated runtime metadata


Not inside parser code as an implicit constant.
8. Recommended Naming Conventions
Suggested conventions:


config keys use noun-based hierarchical names
avoid mixing transport and implementation concepts in one field
use endpoint for caller-facing addresses
use backend for service-internal implementation choice
use enabled only for true feature toggles
use default_* only when a real selection happens at runtime


Examples:


good: services.rerank.endpoint
good: services.rerank.backend
good: tenants.default.index_languages
avoid: service_url, base_url, provider, backend, and script env all meaning slightly different things without a common model

9. Highest-Priority Cleanup Items
If the team wants the shortest path to improvement, start here:


build one root AppConfig
make services_config.py stop reparsing YAML
declare rewrite dictionary path explicitly and fix the current mismatch
remove hardcoded ["en", "zh"] fallbacks from query/indexer logic
replace /admin/config with an effective-config endpoint
retire embeddings/config.py and reranker/config.py as behavior sources
deprecate legacy tenant translation flags

10. Expected Outcome
After the redesign:


developers can answer “where does this setting come from?” in one step
operators can see effective config without reading source code
backend, indexer, translator, embedding, and reranker all share one model
tenant behavior is explicit instead of partially implicit
migration becomes safer because defaults and precedence are centralized
adding a new provider/backend becomes configuration extension, not configuration archaeology

11. Summary
The current system has the right intent but not yet the right implementation shape.

Today the main problems are:


duplicate config loaders
inconsistent precedence
duplicated defaults
config hidden in runtime logic
weak effective-config visibility
leftover legacy concepts


The recommended direction is:


one root typed config
one loader pipeline
explicit layered sources
narrow env responsibility
no hidden business fallbacks
observable effective config


That design is practical to implement incrementally in this repository and aligns well with the project's multi-tenant, multi-service, provider/backend-based architecture.