Commit 14e67b717157040d906b1e33098f59e1f9d66ba3
1 parent
294c3d0a
分句后的 batching 现在是“先全量分句,再按 segment 总数按模型 batch_size
推理”,不再是先按原始输入条数切块。也就是说,如果 100 条请求分句后变成 150 个 segments,batch_size=64 时会按 64 + 64 + 22 三批推理,推理完再按原始分句计划合并并还原成 100 条返回。这个改动在 local_seq2seq.py (line 241) 和 local_ctranslate2.py (line 391)。 日志这边也补上了两层你要的关键信息: 分句摘要日志:Translation segmentation summary,会打印输入条数、非空条数、发生分句的输入数、总 segments 数、当前 batch_size、每条输入分成多少段的统计,见 local_seq2seq.py (line 216) 和 local_ctranslate2.py (line 366)。 每个预测批次日志:Translation inference batch,会打印第几批、总批数、该批 segment 数、长度统计、首条预览。CTranslate2 另外还会打印 Translation model batch detail,补充 token 长度和 max_decoding_length,见 local_ctranslate2.py (line 294)。 我也补了测试,覆盖了“分句后再 batching”和“日志中有分句摘要与每批推理日志”,在 test_translation_local_backends.py (line 358)。
Showing
10 changed files
with
610 additions
and
113 deletions
Show diff stats
CLAUDE.md
| ... | ... | @@ -23,12 +23,24 @@ This is a **production-ready Multi-Tenant E-Commerce Search SaaS** platform spec |
| 23 | 23 | |
| 24 | 24 | ## Development Environment |
| 25 | 25 | |
| 26 | -**Required Environment Setup:** Use project root `activate.sh` (activates conda env `searchengine` and loads `.env`). On a new machine, set `CONDA_ROOT` if conda is not at default path. | |
| 26 | +**Required Environment Setup:** Default to the project venv via root `activate.sh` (activates `./.venv` and loads `.env`). Do not default to system `python3`/`pip3` for repo work. | |
| 27 | 27 | ```bash |
| 28 | -# Optional on new machine: if conda is ~/anaconda3/bin/conda → export CONDA_ROOT=$HOME/anaconda3 | |
| 29 | 28 | source activate.sh |
| 30 | 29 | ``` |
| 31 | -See `docs/QUICKSTART.md` §1.4–1.8 for first-time env creation and production credentials (venv: `./scripts/create_venv.sh`; conda: `conda env create -f environment.yml` or `pip install -r requirements.txt`). | |
| 30 | +See `docs/QUICKSTART.md` §1.4–1.8 for first-time env creation and production credentials (`./scripts/create_venv.sh` for the main venv). | |
| 31 | + | |
| 32 | +**Environment Resolution Rules:** | |
| 33 | +- Main app, backend, indexer, frontend, generic scripts, and most tests: `./.venv` | |
| 34 | +- Translator service runtime and local translation model tooling: `./.venv-translator` | |
| 35 | +- Embedding service runtime: `./.venv-embedding` | |
| 36 | +- Reranker service runtime: `./.venv-reranker` | |
| 37 | +- CN-CLIP service runtime: `./.venv-cnclip` | |
| 38 | +- Never assume the system interpreter reflects project dependencies; prefer `source activate.sh` or invoke the exact venv binary directly. | |
| 39 | + | |
| 40 | +**Operational Rule For Commands:** | |
| 41 | +- For repo-wide `pytest`, ad hoc Python scripts, and lightweight development commands, use the matching venv interpreter first. | |
| 42 | +- For isolated services, prefer the service scripts (`./scripts/start_translator.sh`, `./scripts/start_embedding_service.sh`, `./scripts/start_reranker.sh`, `./scripts/start_cnclip_service.sh`) because they already select the correct environment. | |
| 43 | +- If a dependency appears “missing”, check whether the command was run under the wrong venv before proposing installs. | |
| 32 | 44 | |
| 33 | 45 | **Database Configuration:** |
| 34 | 46 | ```yaml |
| ... | ... | @@ -49,12 +61,14 @@ password: P89cZHS5d7dFyc9R |
| 49 | 61 | |
| 50 | 62 | ### Environment Setup |
| 51 | 63 | ```bash |
| 52 | -# Activate environment (canonical: use activate.sh) | |
| 64 | +# Activate main project environment (canonical) | |
| 53 | 65 | source activate.sh |
| 54 | 66 | |
| 55 | 67 | # First-time / new machine: create env and install deps |
| 56 | -./setup.sh # or: conda env create -f environment.yml | |
| 57 | -# If pip-only: pip install -r requirements.txt | |
| 68 | +./setup.sh | |
| 69 | +# or: | |
| 70 | +./scripts/create_venv.sh | |
| 71 | +source activate.sh | |
| 58 | 72 | ``` |
| 59 | 73 | |
| 60 | 74 | ### Data Management |
| ... | ... | @@ -83,12 +97,12 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload |
| 83 | 97 | ### Testing |
| 84 | 98 | ```bash |
| 85 | 99 | # Run all tests |
| 86 | -python -m pytest tests/ | |
| 100 | +pytest tests/ | |
| 87 | 101 | |
| 88 | 102 | # Run specific test types |
| 89 | -python -m pytest tests/unit/ # Unit tests | |
| 90 | -python -m pytest tests/integration/ # Integration tests | |
| 91 | -python -m pytest -m "api" # API tests only | |
| 103 | +pytest tests/unit/ # Unit tests | |
| 104 | +pytest tests/integration/ # Integration tests | |
| 105 | +pytest -m "api" # API tests only | |
| 92 | 106 | |
| 93 | 107 | # Test search from command line |
| 94 | 108 | python main.py search "query" --tenant-id 1 --size 10 |
| ... | ... | @@ -602,4 +616,3 @@ python main.py search "query" --tenant-id 1 # Quick search test |
| 602 | 616 | 8. **Multi-tenant Architecture**: Single index with `tenant_id` isolation |
| 603 | 617 | 9. **Hybrid Search**: BM25 + vector similarity with configurable weighting |
| 604 | 618 | 10. **Production Ready**: Health checks, monitoring, graceful degradation |
| 605 | - | ... | ... |
api/translator_app.py
| ... | ... | @@ -5,18 +5,24 @@ import logging |
| 5 | 5 | import os |
| 6 | 6 | import pathlib |
| 7 | 7 | import time |
| 8 | +import uuid | |
| 8 | 9 | from contextlib import asynccontextmanager |
| 9 | 10 | from functools import lru_cache |
| 10 | 11 | from logging.handlers import TimedRotatingFileHandler |
| 11 | 12 | from typing import List, Optional, Union |
| 12 | 13 | |
| 13 | 14 | import uvicorn |
| 14 | -from fastapi import FastAPI, HTTPException | |
| 15 | +from fastapi import FastAPI, HTTPException, Request | |
| 15 | 16 | from fastapi.middleware.cors import CORSMiddleware |
| 16 | 17 | from fastapi.responses import JSONResponse |
| 17 | 18 | from pydantic import BaseModel, ConfigDict, Field |
| 18 | 19 | |
| 19 | 20 | from config.services_config import get_translation_config |
| 21 | +from translation.logging_utils import ( | |
| 22 | + TranslationRequestFilter, | |
| 23 | + bind_translation_request_id, | |
| 24 | + reset_translation_request_id, | |
| 25 | +) | |
| 20 | 26 | from translation.service import TranslationService |
| 21 | 27 | from translation.settings import ( |
| 22 | 28 | get_enabled_translation_models, |
| ... | ... | @@ -33,7 +39,8 @@ def configure_translator_logging() -> None: |
| 33 | 39 | |
| 34 | 40 | log_level = os.getenv("LOG_LEVEL", "INFO").upper() |
| 35 | 41 | numeric_level = getattr(logging, log_level, logging.INFO) |
| 36 | - formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s") | |
| 42 | + formatter = logging.Formatter("%(asctime)s | reqid:%(reqid)s | %(name)s | %(levelname)s | %(message)s") | |
| 43 | + request_filter = TranslationRequestFilter() | |
| 37 | 44 | |
| 38 | 45 | root_logger = logging.getLogger() |
| 39 | 46 | root_logger.setLevel(numeric_level) |
| ... | ... | @@ -42,6 +49,7 @@ def configure_translator_logging() -> None: |
| 42 | 49 | console_handler = logging.StreamHandler() |
| 43 | 50 | console_handler.setLevel(numeric_level) |
| 44 | 51 | console_handler.setFormatter(formatter) |
| 52 | + console_handler.addFilter(request_filter) | |
| 45 | 53 | root_logger.addHandler(console_handler) |
| 46 | 54 | |
| 47 | 55 | file_handler = TimedRotatingFileHandler( |
| ... | ... | @@ -53,6 +61,7 @@ def configure_translator_logging() -> None: |
| 53 | 61 | ) |
| 54 | 62 | file_handler.setLevel(numeric_level) |
| 55 | 63 | file_handler.setFormatter(formatter) |
| 64 | + file_handler.addFilter(request_filter) | |
| 56 | 65 | root_logger.addHandler(file_handler) |
| 57 | 66 | |
| 58 | 67 | verbose_logger = logging.getLogger("translator.verbose") |
| ... | ... | @@ -69,6 +78,7 @@ def configure_translator_logging() -> None: |
| 69 | 78 | ) |
| 70 | 79 | verbose_handler.setLevel(numeric_level) |
| 71 | 80 | verbose_handler.setFormatter(formatter) |
| 81 | + verbose_handler.addFilter(request_filter) | |
| 72 | 82 | verbose_logger.addHandler(verbose_handler) |
| 73 | 83 | |
| 74 | 84 | |
| ... | ... | @@ -178,6 +188,13 @@ def _result_preview(translated: Union[str, List[Optional[str]], None]) -> str: |
| 178 | 188 | return _text_preview(str(translated)) |
| 179 | 189 | |
| 180 | 190 | |
| 191 | +def _resolve_request_id(http_request: Request) -> str: | |
| 192 | + header_value = http_request.headers.get("X-Request-ID") | |
| 193 | + if header_value and header_value.strip(): | |
| 194 | + return header_value.strip()[:32] | |
| 195 | + return str(uuid.uuid4())[:8] | |
| 196 | + | |
| 197 | + | |
| 181 | 198 | def _translate_batch( |
| 182 | 199 | service: TranslationService, |
| 183 | 200 | raw_text: List[str], |
| ... | ... | @@ -189,15 +206,11 @@ def _translate_batch( |
| 189 | 206 | ) -> List[Optional[str]]: |
| 190 | 207 | backend = service.get_backend(model) |
| 191 | 208 | logger.info( |
| 192 | - "Translation batch dispatch | model=%s scene=%s target_lang=%s source_lang=%s count=%s lengths=%s first_preview=%s supports_batch=%s", | |
| 193 | - model, | |
| 194 | - scene, | |
| 195 | - target_lang, | |
| 196 | - source_lang or "auto", | |
| 209 | + "Translation batch dispatch | execution=%s count=%s lengths=%s first_preview=%s", | |
| 210 | + "backend-batch" if getattr(backend, "supports_batch", False) else "per-item", | |
| 197 | 211 | len(raw_text), |
| 198 | 212 | [len(str(item or "")) for item in raw_text], |
| 199 | 213 | _text_preview(raw_text[0] if raw_text else ""), |
| 200 | - bool(getattr(backend, "supports_batch", False)), | |
| 201 | 214 | ) |
| 202 | 215 | if getattr(backend, "supports_batch", False): |
| 203 | 216 | try: |
| ... | ... | @@ -330,12 +343,13 @@ async def health_check(): |
| 330 | 343 | |
| 331 | 344 | |
| 332 | 345 | @app.post("/translate", response_model=TranslationResponse) |
| 333 | -async def translate(request: TranslationRequest): | |
| 346 | +async def translate(request: TranslationRequest, http_request: Request): | |
| 334 | 347 | _ensure_valid_text(request.text) |
| 335 | 348 | |
| 336 | 349 | if not request.target_lang: |
| 337 | 350 | raise HTTPException(status_code=400, detail="target_lang is required") |
| 338 | 351 | |
| 352 | + _, request_token = bind_translation_request_id(_resolve_request_id(http_request)) | |
| 339 | 353 | request_started = time.perf_counter() |
| 340 | 354 | try: |
| 341 | 355 | service = get_translation_service() |
| ... | ... | @@ -447,12 +461,14 @@ async def translate(request: TranslationRequest): |
| 447 | 461 | raise |
| 448 | 462 | except ValueError as e: |
| 449 | 463 | latency_ms = (time.perf_counter() - request_started) * 1000 |
| 450 | - logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True) | |
| 464 | + logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms) | |
| 451 | 465 | raise HTTPException(status_code=400, detail=str(e)) from e |
| 452 | 466 | except Exception as e: |
| 453 | 467 | latency_ms = (time.perf_counter() - request_started) * 1000 |
| 454 | 468 | logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True) |
| 455 | 469 | raise HTTPException(status_code=500, detail=f"Translation error: {str(e)}") |
| 470 | + finally: | |
| 471 | + reset_translation_request_id(request_token) | |
| 456 | 472 | |
| 457 | 473 | |
| 458 | 474 | @app.get("/") | ... | ... |
tests/test_translation_local_backends.py
| 1 | +import logging | |
| 2 | + | |
| 3 | +import pytest | |
| 1 | 4 | import torch |
| 2 | 5 | |
| 3 | 6 | from translation.backends.local_seq2seq import MarianMTTranslationBackend, NLLBTranslationBackend |
| 7 | +from translation.backends.local_ctranslate2 import NLLBCTranslate2TranslationBackend | |
| 4 | 8 | from translation.service import TranslationService |
| 5 | 9 | from translation.text_splitter import compute_safe_input_token_limit, split_text_for_translation |
| 6 | 10 | |
| ... | ... | @@ -44,11 +48,59 @@ class _FakeModel: |
| 44 | 48 | return [[42]] |
| 45 | 49 | |
| 46 | 50 | |
| 51 | +class _FakeCT2Tokenizer: | |
| 52 | + def __init__(self, src_lang=None): | |
| 53 | + self.src_lang = src_lang | |
| 54 | + self.pad_token = "</s>" | |
| 55 | + self.eos_token = "</s>" | |
| 56 | + self.last_call = None | |
| 57 | + | |
| 58 | + def __call__(self, texts, **kwargs): | |
| 59 | + self.last_call = {"texts": list(texts), **kwargs} | |
| 60 | + return {"input_ids": [[1, 2, 3] for _ in texts]} | |
| 61 | + | |
| 62 | + def convert_ids_to_tokens(self, ids): | |
| 63 | + del ids | |
| 64 | + return ["tok_a", "tok_b", "tok_c"] | |
| 65 | + | |
| 66 | + def convert_tokens_to_ids(self, tokens): | |
| 67 | + if isinstance(tokens, list): | |
| 68 | + return [1 for _ in tokens] | |
| 69 | + return 1 | |
| 70 | + | |
| 71 | + def decode(self, token_ids, skip_special_tokens=True): | |
| 72 | + del token_ids, skip_special_tokens | |
| 73 | + return "translated" | |
| 74 | + | |
| 75 | + | |
| 76 | +class _FakeCT2Result: | |
| 77 | + def __init__(self, tokens): | |
| 78 | + self.hypotheses = [tokens] | |
| 79 | + | |
| 80 | + | |
| 81 | +class _FakeCT2Translator: | |
| 82 | + def __init__(self): | |
| 83 | + self.last_translate_batch_kwargs = None | |
| 84 | + | |
| 85 | + def translate_batch(self, source_tokens, **kwargs): | |
| 86 | + self.last_translate_batch_kwargs = {"source_tokens": source_tokens, **kwargs} | |
| 87 | + target_prefix = kwargs.get("target_prefix") or [] | |
| 88 | + return [ | |
| 89 | + _FakeCT2Result((target_prefix[idx] or []) + ["translated_token"]) | |
| 90 | + for idx, _ in enumerate(source_tokens) | |
| 91 | + ] | |
| 92 | + | |
| 93 | + | |
| 47 | 94 | def _stub_load_model(self): |
| 48 | 95 | self.tokenizer = _FakeTokenizer() |
| 49 | 96 | self.seq2seq_model = _FakeModel() |
| 50 | 97 | |
| 51 | 98 | |
| 99 | +def _stub_load_ct2_runtime(self): | |
| 100 | + self.tokenizer = _FakeCT2Tokenizer() | |
| 101 | + self.translator = _FakeCT2Translator() | |
| 102 | + | |
| 103 | + | |
| 52 | 104 | def test_marian_language_validation(monkeypatch): |
| 53 | 105 | monkeypatch.setattr(MarianMTTranslationBackend, "_load_model", _stub_load_model) |
| 54 | 106 | backend = MarianMTTranslationBackend( |
| ... | ... | @@ -68,12 +120,8 @@ def test_marian_language_validation(monkeypatch): |
| 68 | 120 | result = backend.translate("测试", source_lang="zh", target_lang="en") |
| 69 | 121 | assert result == "translated" |
| 70 | 122 | |
| 71 | - try: | |
| 123 | + with pytest.raises(ValueError, match="source languages"): | |
| 72 | 124 | backend.translate("test", source_lang="en", target_lang="zh") |
| 73 | - except ValueError as exc: | |
| 74 | - assert "source languages" in str(exc) | |
| 75 | - else: | |
| 76 | - raise AssertionError("Expected unsupported source language to raise") | |
| 77 | 125 | |
| 78 | 126 | |
| 79 | 127 | def test_nllb_uses_src_lang_and_forced_bos(monkeypatch): |
| ... | ... | @@ -97,6 +145,61 @@ def test_nllb_uses_src_lang_and_forced_bos(monkeypatch): |
| 97 | 145 | assert backend.seq2seq_model.last_generate_kwargs["forced_bos_token_id"] == 202 |
| 98 | 146 | |
| 99 | 147 | |
| 148 | +def test_nllb_accepts_finnish_short_code(monkeypatch): | |
| 149 | + monkeypatch.setattr(NLLBTranslationBackend, "_load_model", _stub_load_model) | |
| 150 | + backend = NLLBTranslationBackend( | |
| 151 | + name="nllb-200-distilled-600m", | |
| 152 | + model_id="facebook/nllb-200-distilled-600M", | |
| 153 | + model_dir="./models/translation/facebook/nllb-200-distilled-600M", | |
| 154 | + device="cpu", | |
| 155 | + torch_dtype="float32", | |
| 156 | + batch_size=1, | |
| 157 | + max_input_length=16, | |
| 158 | + max_new_tokens=16, | |
| 159 | + num_beams=1, | |
| 160 | + ) | |
| 161 | + | |
| 162 | + result = backend.translate("test", source_lang="fi", target_lang="zh") | |
| 163 | + | |
| 164 | + assert result == "translated" | |
| 165 | + assert backend.tokenizer.src_lang == "fin_Latn" | |
| 166 | + assert backend.seq2seq_model.last_generate_kwargs["forced_bos_token_id"] == 202 | |
| 167 | + | |
| 168 | + | |
| 169 | +def test_nllb_ctranslate2_accepts_finnish_short_code(monkeypatch): | |
| 170 | + created_tokenizers = [] | |
| 171 | + | |
| 172 | + def _fake_from_pretrained(source, src_lang=None, **kwargs): | |
| 173 | + del source, kwargs | |
| 174 | + tokenizer = _FakeCT2Tokenizer(src_lang=src_lang) | |
| 175 | + created_tokenizers.append(tokenizer) | |
| 176 | + return tokenizer | |
| 177 | + | |
| 178 | + monkeypatch.setattr(NLLBCTranslate2TranslationBackend, "_load_runtime", _stub_load_ct2_runtime) | |
| 179 | + monkeypatch.setattr( | |
| 180 | + "translation.backends.local_ctranslate2.AutoTokenizer.from_pretrained", | |
| 181 | + _fake_from_pretrained, | |
| 182 | + ) | |
| 183 | + backend = NLLBCTranslate2TranslationBackend( | |
| 184 | + name="nllb-200-distilled-600m", | |
| 185 | + model_id="facebook/nllb-200-distilled-600M", | |
| 186 | + model_dir="./models/translation/facebook/nllb-200-distilled-600M", | |
| 187 | + device="cpu", | |
| 188 | + torch_dtype="float32", | |
| 189 | + batch_size=1, | |
| 190 | + max_input_length=16, | |
| 191 | + max_new_tokens=16, | |
| 192 | + num_beams=1, | |
| 193 | + ) | |
| 194 | + | |
| 195 | + result = backend.translate("test", source_lang="fi", target_lang="zh") | |
| 196 | + | |
| 197 | + assert result == "translated" | |
| 198 | + assert len(created_tokenizers) == 1 | |
| 199 | + assert created_tokenizers[0].src_lang == "fin_Latn" | |
| 200 | + assert backend.translator.last_translate_batch_kwargs["target_prefix"] == [["zho_Hans"]] | |
| 201 | + | |
| 202 | + | |
| 100 | 203 | def test_translation_service_preloads_enabled_backends(monkeypatch): |
| 101 | 204 | created = [] |
| 102 | 205 | |
| ... | ... | @@ -245,7 +348,71 @@ def test_local_backend_splits_oversized_text_before_translation(): |
| 245 | 348 | result = backend.translate(text, source_lang="en", target_lang="zh") |
| 246 | 349 | |
| 247 | 350 | assert result is not None |
| 248 | - assert len(backend.translated_batches) == 1 | |
| 249 | - assert len(backend.translated_batches[0]) >= 2 | |
| 250 | - assert all(len(piece) <= 16 for piece in backend.translated_batches[0]) | |
| 251 | - assert result == "".join(f"<{piece.strip()}>" for piece in backend.translated_batches[0]) | |
| 351 | + all_segments = [piece for batch in backend.translated_batches for piece in batch] | |
| 352 | + assert len(all_segments) >= 2 | |
| 353 | + assert all(len(batch) <= backend.batch_size for batch in backend.translated_batches) | |
| 354 | + assert all(len(piece) <= 16 for piece in all_segments) | |
| 355 | + assert result == "".join(f"<{piece.strip()}>" for piece in all_segments) | |
| 356 | + | |
| 357 | + | |
| 358 | +def test_local_backend_batches_after_segmentation(): | |
| 359 | + backend = _SegmentingMarianBackend( | |
| 360 | + name="opus-mt-en-zh", | |
| 361 | + model_id="Helsinki-NLP/opus-mt-en-zh", | |
| 362 | + model_dir="./models/translation/Helsinki-NLP/opus-mt-en-zh", | |
| 363 | + device="cpu", | |
| 364 | + torch_dtype="float32", | |
| 365 | + batch_size=4, | |
| 366 | + max_input_length=24, | |
| 367 | + max_new_tokens=24, | |
| 368 | + num_beams=1, | |
| 369 | + source_langs=["en"], | |
| 370 | + target_langs=["zh"], | |
| 371 | + ) | |
| 372 | + | |
| 373 | + texts = [ | |
| 374 | + "alpha beta gamma delta, epsilon zeta eta theta, iota kappa lambda mu.", | |
| 375 | + "nu xi omicron pi, rho sigma tau upsilon, phi chi psi omega.", | |
| 376 | + "dress shirt coat pants, socks shoes belt scarf, hat gloves bag watch.", | |
| 377 | + ] | |
| 378 | + | |
| 379 | + result = backend.translate(texts, source_lang="en", target_lang="zh") | |
| 380 | + | |
| 381 | + assert isinstance(result, list) | |
| 382 | + assert len(result) == 3 | |
| 383 | + assert len(backend.translated_batches) >= 2 | |
| 384 | + assert all(len(batch) <= backend.batch_size for batch in backend.translated_batches) | |
| 385 | + assert sum(len(batch) for batch in backend.translated_batches) > backend.batch_size | |
| 386 | + assert all(item is not None for item in result) | |
| 387 | + | |
| 388 | + | |
| 389 | +def test_local_backend_logs_segmentation_and_inference_batches(caplog): | |
| 390 | + backend = _SegmentingMarianBackend( | |
| 391 | + name="opus-mt-en-zh", | |
| 392 | + model_id="Helsinki-NLP/opus-mt-en-zh", | |
| 393 | + model_dir="./models/translation/Helsinki-NLP/opus-mt-en-zh", | |
| 394 | + device="cpu", | |
| 395 | + torch_dtype="float32", | |
| 396 | + batch_size=2, | |
| 397 | + max_input_length=24, | |
| 398 | + max_new_tokens=24, | |
| 399 | + num_beams=1, | |
| 400 | + source_langs=["en"], | |
| 401 | + target_langs=["zh"], | |
| 402 | + ) | |
| 403 | + | |
| 404 | + texts = [ | |
| 405 | + "one two three four, five six seven eight, nine ten eleven twelve.", | |
| 406 | + "thirteen fourteen fifteen sixteen, seventeen eighteen nineteen twenty.", | |
| 407 | + ] | |
| 408 | + | |
| 409 | + with caplog.at_level(logging.INFO): | |
| 410 | + backend.translate(texts, source_lang="en", target_lang="zh") | |
| 411 | + | |
| 412 | + messages = [record.getMessage() for record in caplog.records] | |
| 413 | + | |
| 414 | + assert any(message.startswith("Translation segmentation summary |") for message in messages) | |
| 415 | + inference_logs = [ | |
| 416 | + message for message in messages if message.startswith("Translation inference batch |") | |
| 417 | + ] | |
| 418 | + assert len(inference_logs) >= 2 | ... | ... |
tests/test_translator_failure_semantics.py
| 1 | +import logging | |
| 2 | + | |
| 1 | 3 | from translation.cache import TranslationCache |
| 4 | +from translation.logging_utils import ( | |
| 5 | + TranslationRequestFilter, | |
| 6 | + bind_translation_request_id, | |
| 7 | + reset_translation_request_id, | |
| 8 | +) | |
| 2 | 9 | from translation.service import TranslationService |
| 3 | 10 | |
| 4 | 11 | |
| ... | ... | @@ -107,3 +114,80 @@ def test_service_caches_all_capabilities(monkeypatch): |
| 107 | 114 | ("opus-mt-zh-en", "en", "连衣裙", "opus-mt-zh-en:连衣裙"), |
| 108 | 115 | ("opus-mt-zh-en", "en", "衬衫", "opus-mt-zh-en:衬衫"), |
| 109 | 116 | ] |
| 117 | + | |
| 118 | + | |
| 119 | +def test_translation_request_filter_injects_reqid(): | |
| 120 | + reqid, token = bind_translation_request_id("req-test-1234567890") | |
| 121 | + try: | |
| 122 | + record = logging.LogRecord( | |
| 123 | + name="translation.service", | |
| 124 | + level=logging.INFO, | |
| 125 | + pathname=__file__, | |
| 126 | + lineno=1, | |
| 127 | + msg="hello", | |
| 128 | + args=(), | |
| 129 | + exc_info=None, | |
| 130 | + ) | |
| 131 | + TranslationRequestFilter().filter(record) | |
| 132 | + | |
| 133 | + assert reqid == "req-test-1234567890" | |
| 134 | + assert record.reqid == "req-test-1234567890" | |
| 135 | + finally: | |
| 136 | + reset_translation_request_id(token) | |
| 137 | + | |
| 138 | + | |
| 139 | +def test_translation_route_log_focuses_on_routing_decision(monkeypatch, caplog): | |
| 140 | + monkeypatch.setattr(TranslationCache, "_init_redis_client", staticmethod(lambda: None)) | |
| 141 | + | |
| 142 | + def _fake_create_backend(self, *, name, backend_type, cfg): | |
| 143 | + del self, backend_type, cfg | |
| 144 | + | |
| 145 | + class _Backend: | |
| 146 | + model = name | |
| 147 | + | |
| 148 | + @property | |
| 149 | + def supports_batch(self): | |
| 150 | + return True | |
| 151 | + | |
| 152 | + def translate(self, text, target_lang, source_lang=None, scene=None): | |
| 153 | + del target_lang, source_lang, scene | |
| 154 | + return text | |
| 155 | + | |
| 156 | + return _Backend() | |
| 157 | + | |
| 158 | + monkeypatch.setattr(TranslationService, "_create_backend", _fake_create_backend) | |
| 159 | + service = TranslationService( | |
| 160 | + { | |
| 161 | + "service_url": "http://127.0.0.1:6006", | |
| 162 | + "timeout_sec": 10.0, | |
| 163 | + "default_model": "llm", | |
| 164 | + "default_scene": "general", | |
| 165 | + "capabilities": { | |
| 166 | + "llm": { | |
| 167 | + "enabled": True, | |
| 168 | + "backend": "llm", | |
| 169 | + "model": "dummy-llm", | |
| 170 | + "base_url": "https://example.com", | |
| 171 | + "timeout_sec": 10.0, | |
| 172 | + "use_cache": True, | |
| 173 | + } | |
| 174 | + }, | |
| 175 | + "cache": { | |
| 176 | + "ttl_seconds": 60, | |
| 177 | + "sliding_expiration": True, | |
| 178 | + }, | |
| 179 | + } | |
| 180 | + ) | |
| 181 | + | |
| 182 | + with caplog.at_level(logging.INFO): | |
| 183 | + service.translate("商品标题", target_lang="en", source_lang="zh", model="llm") | |
| 184 | + | |
| 185 | + route_messages = [ | |
| 186 | + record.getMessage() | |
| 187 | + for record in caplog.records | |
| 188 | + if record.name == "translation.service" and record.getMessage().startswith("Translation route |") | |
| 189 | + ] | |
| 190 | + | |
| 191 | + assert route_messages == [ | |
| 192 | + "Translation route | backend=llm request_type=single use_cache=True cache_available=False" | |
| 193 | + ] | ... | ... |
translation/backends/local_ctranslate2.py
| ... | ... | @@ -13,7 +13,12 @@ from typing import Dict, List, Optional, Sequence, Union |
| 13 | 13 | |
| 14 | 14 | from transformers import AutoTokenizer |
| 15 | 15 | |
| 16 | -from translation.languages import MARIAN_LANGUAGE_DIRECTIONS, NLLB_LANGUAGE_CODES | |
| 16 | +from translation.languages import ( | |
| 17 | + MARIAN_LANGUAGE_DIRECTIONS, | |
| 18 | + build_nllb_language_catalog, | |
| 19 | + normalize_language_key, | |
| 20 | + resolve_nllb_language_code, | |
| 21 | +) | |
| 17 | 22 | from translation.text_splitter import ( |
| 18 | 23 | compute_safe_input_token_limit, |
| 19 | 24 | join_translated_segments, |
| ... | ... | @@ -23,6 +28,17 @@ from translation.text_splitter import ( |
| 23 | 28 | logger = logging.getLogger(__name__) |
| 24 | 29 | |
| 25 | 30 | |
| 31 | +def _text_preview(text: Optional[str], limit: int = 32) -> str: | |
| 32 | + return str(text or "").replace("\n", "\\n")[:limit] | |
| 33 | + | |
| 34 | + | |
| 35 | +def _summarize_lengths(values: Sequence[int]) -> str: | |
| 36 | + if not values: | |
| 37 | + return "[]" | |
| 38 | + total = sum(values) | |
| 39 | + return f"min={min(values)} max={max(values)} avg={total / len(values):.1f}" | |
| 40 | + | |
| 41 | + | |
| 26 | 42 | def _resolve_device(device: Optional[str]) -> str: |
| 27 | 43 | value = str(device or "auto").strip().lower() |
| 28 | 44 | if value not in {"auto", "cpu", "cuda"}: |
| ... | ... | @@ -285,6 +301,17 @@ class LocalCTranslate2TranslationBackend: |
| 285 | 301 | source_tokens = self._encode_source_tokens(texts, source_lang, target_lang) |
| 286 | 302 | target_prefix = self._target_prefixes(len(source_tokens), source_lang, target_lang) |
| 287 | 303 | max_decoding_length = self._resolve_max_decoding_length(source_tokens) |
| 304 | + logger.info( | |
| 305 | + "Translation model batch detail | model=%s segment_count=%s token_lengths=%s max_decoding_length=%s batch_type=%s beam_size=%s target_lang=%s source_lang=%s", | |
| 306 | + self.model, | |
| 307 | + len(source_tokens), | |
| 308 | + _summarize_lengths([len(tokens) for tokens in source_tokens]), | |
| 309 | + max_decoding_length, | |
| 310 | + self.ct2_batch_type, | |
| 311 | + self.num_beams, | |
| 312 | + target_lang, | |
| 313 | + source_lang or "auto", | |
| 314 | + ) | |
| 288 | 315 | results = self.translator.translate_batch( |
| 289 | 316 | source_tokens, |
| 290 | 317 | target_prefix=target_prefix, |
| ... | ... | @@ -336,6 +363,59 @@ class LocalCTranslate2TranslationBackend: |
| 336 | 363 | ), |
| 337 | 364 | ) |
| 338 | 365 | |
| 366 | + def _log_segmentation_summary( | |
| 367 | + self, | |
| 368 | + *, | |
| 369 | + texts: Sequence[str], | |
| 370 | + segment_plans: Sequence[Sequence[str]], | |
| 371 | + target_lang: str, | |
| 372 | + source_lang: Optional[str], | |
| 373 | + ) -> None: | |
| 374 | + non_empty_count = sum(1 for text in texts if text.strip()) | |
| 375 | + segment_counts = [len(segments) for segments in segment_plans if segments] | |
| 376 | + total_segments = sum(segment_counts) | |
| 377 | + segmented_inputs = sum(1 for count in segment_counts if count > 1) | |
| 378 | + logger.info( | |
| 379 | + "Translation segmentation summary | model=%s inputs=%s non_empty_inputs=%s segmented_inputs=%s total_segments=%s batch_size=%s target_lang=%s source_lang=%s segments_per_input=%s", | |
| 380 | + self.model, | |
| 381 | + len(texts), | |
| 382 | + non_empty_count, | |
| 383 | + segmented_inputs, | |
| 384 | + total_segments, | |
| 385 | + self.batch_size, | |
| 386 | + target_lang, | |
| 387 | + source_lang or "auto", | |
| 388 | + _summarize_lengths(segment_counts), | |
| 389 | + ) | |
| 390 | + | |
| 391 | + def _translate_segment_batches( | |
| 392 | + self, | |
| 393 | + segments: List[str], | |
| 394 | + target_lang: str, | |
| 395 | + source_lang: Optional[str] = None, | |
| 396 | + ) -> List[Optional[str]]: | |
| 397 | + if not segments: | |
| 398 | + return [] | |
| 399 | + outputs: List[Optional[str]] = [] | |
| 400 | + total_batches = (len(segments) + self.batch_size - 1) // self.batch_size | |
| 401 | + for batch_index, start in enumerate(range(0, len(segments), self.batch_size), start=1): | |
| 402 | + batch = segments[start:start + self.batch_size] | |
| 403 | + logger.info( | |
| 404 | + "Translation inference batch | model=%s batch_index=%s total_batches=%s segment_count=%s char_lengths=%s first_preview=%s target_lang=%s source_lang=%s", | |
| 405 | + self.model, | |
| 406 | + batch_index, | |
| 407 | + total_batches, | |
| 408 | + len(batch), | |
| 409 | + _summarize_lengths([len(segment) for segment in batch]), | |
| 410 | + _text_preview(batch[0] if batch else ""), | |
| 411 | + target_lang, | |
| 412 | + source_lang or "auto", | |
| 413 | + ) | |
| 414 | + outputs.extend( | |
| 415 | + self._translate_batch(batch, target_lang=target_lang, source_lang=source_lang) | |
| 416 | + ) | |
| 417 | + return outputs | |
| 418 | + | |
| 339 | 419 | def _translate_with_segmentation( |
| 340 | 420 | self, |
| 341 | 421 | texts: List[str], |
| ... | ... | @@ -352,8 +432,15 @@ class LocalCTranslate2TranslationBackend: |
| 352 | 432 | segment_plans.append(segments) |
| 353 | 433 | flat_segments.extend(segments) |
| 354 | 434 | |
| 435 | + self._log_segmentation_summary( | |
| 436 | + texts=texts, | |
| 437 | + segment_plans=segment_plans, | |
| 438 | + target_lang=target_lang, | |
| 439 | + source_lang=source_lang, | |
| 440 | + ) | |
| 441 | + | |
| 355 | 442 | translated_segments = ( |
| 356 | - self._translate_batch(flat_segments, target_lang=target_lang, source_lang=source_lang) | |
| 443 | + self._translate_segment_batches(flat_segments, target_lang=target_lang, source_lang=source_lang) | |
| 357 | 444 | if flat_segments |
| 358 | 445 | else [] |
| 359 | 446 | ) |
| ... | ... | @@ -387,13 +474,10 @@ class LocalCTranslate2TranslationBackend: |
| 387 | 474 | del scene |
| 388 | 475 | is_single = isinstance(text, str) |
| 389 | 476 | texts = self._normalize_texts(text) |
| 390 | - outputs: List[Optional[str]] = [] | |
| 391 | - for start in range(0, len(texts), self.batch_size): | |
| 392 | - chunk = texts[start:start + self.batch_size] | |
| 393 | - if not any(item.strip() for item in chunk): | |
| 394 | - outputs.extend([None if not item.strip() else item for item in chunk]) # type: ignore[list-item] | |
| 395 | - continue | |
| 396 | - outputs.extend(self._translate_with_segmentation(chunk, target_lang=target_lang, source_lang=source_lang)) | |
| 477 | + if not any(item.strip() for item in texts): | |
| 478 | + outputs = [None if not item.strip() else item for item in texts] # type: ignore[list-item] | |
| 479 | + return outputs[0] if is_single else outputs | |
| 480 | + outputs = self._translate_with_segmentation(texts, target_lang=target_lang, source_lang=source_lang) | |
| 397 | 481 | return outputs[0] if is_single else outputs |
| 398 | 482 | |
| 399 | 483 | |
| ... | ... | @@ -492,11 +576,7 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): |
| 492 | 576 | ct2_decoding_length_extra: int = 0, |
| 493 | 577 | ct2_decoding_length_min: int = 1, |
| 494 | 578 | ) -> None: |
| 495 | - overrides = language_codes or {} | |
| 496 | - self.language_codes = { | |
| 497 | - **NLLB_LANGUAGE_CODES, | |
| 498 | - **{str(k).strip().lower(): str(v).strip() for k, v in overrides.items() if str(k).strip()}, | |
| 499 | - } | |
| 579 | + self.language_codes = build_nllb_language_catalog(language_codes) | |
| 500 | 580 | self._tokenizers_by_source: Dict[str, object] = {} |
| 501 | 581 | super().__init__( |
| 502 | 582 | name=name, |
| ... | ... | @@ -522,17 +602,17 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): |
| 522 | 602 | ) |
| 523 | 603 | |
| 524 | 604 | def _validate_languages(self, source_lang: Optional[str], target_lang: str) -> None: |
| 525 | - src = str(source_lang or "").strip().lower() | |
| 526 | - tgt = str(target_lang or "").strip().lower() | |
| 527 | - if not src: | |
| 605 | + if not str(source_lang or "").strip(): | |
| 528 | 606 | raise ValueError(f"Model '{self.model}' requires source_lang") |
| 529 | - if src not in self.language_codes: | |
| 607 | + if resolve_nllb_language_code(source_lang, self.language_codes) is None: | |
| 530 | 608 | raise ValueError(f"Unsupported NLLB source language: {source_lang}") |
| 531 | - if tgt not in self.language_codes: | |
| 609 | + if resolve_nllb_language_code(target_lang, self.language_codes) is None: | |
| 532 | 610 | raise ValueError(f"Unsupported NLLB target language: {target_lang}") |
| 533 | 611 | |
| 534 | 612 | def _get_tokenizer_for_source(self, source_lang: str): |
| 535 | - src_code = self.language_codes[source_lang] | |
| 613 | + src_code = resolve_nllb_language_code(source_lang, self.language_codes) | |
| 614 | + if src_code is None: | |
| 615 | + raise ValueError(f"Unsupported NLLB source language: {source_lang}") | |
| 536 | 616 | with self._tokenizer_lock: |
| 537 | 617 | tokenizer = self._tokenizers_by_source.get(src_code) |
| 538 | 618 | if tokenizer is None: |
| ... | ... | @@ -549,7 +629,7 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): |
| 549 | 629 | target_lang: str, |
| 550 | 630 | ) -> List[List[str]]: |
| 551 | 631 | del target_lang |
| 552 | - source_key = str(source_lang or "").strip().lower() | |
| 632 | + source_key = normalize_language_key(source_lang) | |
| 553 | 633 | tokenizer = self._get_tokenizer_for_source(source_key) |
| 554 | 634 | encoded = tokenizer( |
| 555 | 635 | texts, |
| ... | ... | @@ -567,7 +647,9 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): |
| 567 | 647 | target_lang: str, |
| 568 | 648 | ) -> Optional[List[Optional[List[str]]]]: |
| 569 | 649 | del source_lang |
| 570 | - tgt_code = self.language_codes[str(target_lang).strip().lower()] | |
| 650 | + tgt_code = resolve_nllb_language_code(target_lang, self.language_codes) | |
| 651 | + if tgt_code is None: | |
| 652 | + raise ValueError(f"Unsupported NLLB target language: {target_lang}") | |
| 571 | 653 | return [[tgt_code] for _ in range(count)] |
| 572 | 654 | |
| 573 | 655 | def _postprocess_hypothesis( |
| ... | ... | @@ -577,7 +659,9 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): |
| 577 | 659 | target_lang: str, |
| 578 | 660 | ) -> List[str]: |
| 579 | 661 | del source_lang |
| 580 | - tgt_code = self.language_codes[str(target_lang).strip().lower()] | |
| 662 | + tgt_code = resolve_nllb_language_code(target_lang, self.language_codes) | |
| 663 | + if tgt_code is None: | |
| 664 | + raise ValueError(f"Unsupported NLLB target language: {target_lang}") | |
| 581 | 665 | if tokens and tokens[0] == tgt_code: |
| 582 | 666 | return tokens[1:] |
| 583 | 667 | return tokens | ... | ... |
translation/backends/local_seq2seq.py
| ... | ... | @@ -10,7 +10,11 @@ from typing import Dict, List, Optional, Sequence, Union |
| 10 | 10 | import torch |
| 11 | 11 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| 12 | 12 | |
| 13 | -from translation.languages import MARIAN_LANGUAGE_DIRECTIONS, NLLB_LANGUAGE_CODES | |
| 13 | +from translation.languages import ( | |
| 14 | + MARIAN_LANGUAGE_DIRECTIONS, | |
| 15 | + build_nllb_language_catalog, | |
| 16 | + resolve_nllb_language_code, | |
| 17 | +) | |
| 14 | 18 | from translation.text_splitter import ( |
| 15 | 19 | compute_safe_input_token_limit, |
| 16 | 20 | join_translated_segments, |
| ... | ... | @@ -20,6 +24,17 @@ from translation.text_splitter import ( |
| 20 | 24 | logger = logging.getLogger(__name__) |
| 21 | 25 | |
| 22 | 26 | |
| 27 | +def _text_preview(text: Optional[str], limit: int = 32) -> str: | |
| 28 | + return str(text or "").replace("\n", "\\n")[:limit] | |
| 29 | + | |
| 30 | + | |
| 31 | +def _summarize_lengths(values: Sequence[int]) -> str: | |
| 32 | + if not values: | |
| 33 | + return "[]" | |
| 34 | + total = sum(values) | |
| 35 | + return f"min={min(values)} max={max(values)} avg={total / len(values):.1f}" | |
| 36 | + | |
| 37 | + | |
| 23 | 38 | def _resolve_device(device: Optional[str]) -> str: |
| 24 | 39 | value = str(device or "auto").strip().lower() |
| 25 | 40 | if value == "auto": |
| ... | ... | @@ -198,6 +213,59 @@ class LocalSeq2SeqTranslationBackend: |
| 198 | 213 | ), |
| 199 | 214 | ) |
| 200 | 215 | |
| 216 | + def _log_segmentation_summary( | |
| 217 | + self, | |
| 218 | + *, | |
| 219 | + texts: Sequence[str], | |
| 220 | + segment_plans: Sequence[Sequence[str]], | |
| 221 | + target_lang: str, | |
| 222 | + source_lang: Optional[str], | |
| 223 | + ) -> None: | |
| 224 | + non_empty_count = sum(1 for text in texts if text.strip()) | |
| 225 | + segment_counts = [len(segments) for segments in segment_plans if segments] | |
| 226 | + total_segments = sum(segment_counts) | |
| 227 | + segmented_inputs = sum(1 for count in segment_counts if count > 1) | |
| 228 | + logger.info( | |
| 229 | + "Translation segmentation summary | model=%s inputs=%s non_empty_inputs=%s segmented_inputs=%s total_segments=%s batch_size=%s target_lang=%s source_lang=%s segments_per_input=%s", | |
| 230 | + self.model, | |
| 231 | + len(texts), | |
| 232 | + non_empty_count, | |
| 233 | + segmented_inputs, | |
| 234 | + total_segments, | |
| 235 | + self.batch_size, | |
| 236 | + target_lang, | |
| 237 | + source_lang or "auto", | |
| 238 | + _summarize_lengths(segment_counts), | |
| 239 | + ) | |
| 240 | + | |
| 241 | + def _translate_segment_batches( | |
| 242 | + self, | |
| 243 | + segments: List[str], | |
| 244 | + target_lang: str, | |
| 245 | + source_lang: Optional[str] = None, | |
| 246 | + ) -> List[Optional[str]]: | |
| 247 | + if not segments: | |
| 248 | + return [] | |
| 249 | + outputs: List[Optional[str]] = [] | |
| 250 | + total_batches = (len(segments) + self.batch_size - 1) // self.batch_size | |
| 251 | + for batch_index, start in enumerate(range(0, len(segments), self.batch_size), start=1): | |
| 252 | + batch = segments[start:start + self.batch_size] | |
| 253 | + logger.info( | |
| 254 | + "Translation inference batch | model=%s batch_index=%s total_batches=%s segment_count=%s char_lengths=%s first_preview=%s target_lang=%s source_lang=%s", | |
| 255 | + self.model, | |
| 256 | + batch_index, | |
| 257 | + total_batches, | |
| 258 | + len(batch), | |
| 259 | + _summarize_lengths([len(segment) for segment in batch]), | |
| 260 | + _text_preview(batch[0] if batch else ""), | |
| 261 | + target_lang, | |
| 262 | + source_lang or "auto", | |
| 263 | + ) | |
| 264 | + outputs.extend( | |
| 265 | + self._translate_batch(batch, target_lang=target_lang, source_lang=source_lang) | |
| 266 | + ) | |
| 267 | + return outputs | |
| 268 | + | |
| 201 | 269 | def _translate_with_segmentation( |
| 202 | 270 | self, |
| 203 | 271 | texts: List[str], |
| ... | ... | @@ -214,8 +282,15 @@ class LocalSeq2SeqTranslationBackend: |
| 214 | 282 | segment_plans.append(segments) |
| 215 | 283 | flat_segments.extend(segments) |
| 216 | 284 | |
| 285 | + self._log_segmentation_summary( | |
| 286 | + texts=texts, | |
| 287 | + segment_plans=segment_plans, | |
| 288 | + target_lang=target_lang, | |
| 289 | + source_lang=source_lang, | |
| 290 | + ) | |
| 291 | + | |
| 217 | 292 | translated_segments = ( |
| 218 | - self._translate_batch(flat_segments, target_lang=target_lang, source_lang=source_lang) | |
| 293 | + self._translate_segment_batches(flat_segments, target_lang=target_lang, source_lang=source_lang) | |
| 219 | 294 | if flat_segments |
| 220 | 295 | else [] |
| 221 | 296 | ) |
| ... | ... | @@ -249,13 +324,10 @@ class LocalSeq2SeqTranslationBackend: |
| 249 | 324 | del scene |
| 250 | 325 | is_single = isinstance(text, str) |
| 251 | 326 | texts = self._normalize_texts(text) |
| 252 | - outputs: List[Optional[str]] = [] | |
| 253 | - for start in range(0, len(texts), self.batch_size): | |
| 254 | - chunk = texts[start:start + self.batch_size] | |
| 255 | - if not any(item.strip() for item in chunk): | |
| 256 | - outputs.extend([None if not item.strip() else item for item in chunk]) # type: ignore[list-item] | |
| 257 | - continue | |
| 258 | - outputs.extend(self._translate_with_segmentation(chunk, target_lang=target_lang, source_lang=source_lang)) | |
| 327 | + if not any(item.strip() for item in texts): | |
| 328 | + outputs = [None if not item.strip() else item for item in texts] # type: ignore[list-item] | |
| 329 | + return outputs[0] if is_single else outputs | |
| 330 | + outputs = self._translate_with_segmentation(texts, target_lang=target_lang, source_lang=source_lang) | |
| 259 | 331 | return outputs[0] if is_single else outputs |
| 260 | 332 | |
| 261 | 333 | |
| ... | ... | @@ -324,11 +396,7 @@ class NLLBTranslationBackend(LocalSeq2SeqTranslationBackend): |
| 324 | 396 | language_codes: Optional[Dict[str, str]] = None, |
| 325 | 397 | attn_implementation: Optional[str] = None, |
| 326 | 398 | ) -> None: |
| 327 | - overrides = language_codes or {} | |
| 328 | - self.language_codes = { | |
| 329 | - **NLLB_LANGUAGE_CODES, | |
| 330 | - **{str(k).strip().lower(): str(v).strip() for k, v in overrides.items() if str(k).strip()}, | |
| 331 | - } | |
| 399 | + self.language_codes = build_nllb_language_catalog(language_codes) | |
| 332 | 400 | super().__init__( |
| 333 | 401 | name=name, |
| 334 | 402 | model_id=model_id, |
| ... | ... | @@ -343,24 +411,26 @@ class NLLBTranslationBackend(LocalSeq2SeqTranslationBackend): |
| 343 | 411 | ) |
| 344 | 412 | |
| 345 | 413 | def _validate_languages(self, source_lang: Optional[str], target_lang: str) -> None: |
| 346 | - src = str(source_lang or "").strip().lower() | |
| 347 | - tgt = str(target_lang or "").strip().lower() | |
| 348 | - if not src: | |
| 414 | + if not str(source_lang or "").strip(): | |
| 349 | 415 | raise ValueError(f"Model '{self.model}' requires source_lang") |
| 350 | - if src not in self.language_codes: | |
| 416 | + if resolve_nllb_language_code(source_lang, self.language_codes) is None: | |
| 351 | 417 | raise ValueError(f"Unsupported NLLB source language: {source_lang}") |
| 352 | - if tgt not in self.language_codes: | |
| 418 | + if resolve_nllb_language_code(target_lang, self.language_codes) is None: | |
| 353 | 419 | raise ValueError(f"Unsupported NLLB target language: {target_lang}") |
| 354 | 420 | |
| 355 | 421 | def _prepare_tokenizer(self, source_lang: Optional[str], target_lang: str) -> Dict[str, object]: |
| 356 | 422 | del target_lang |
| 357 | - src_code = self.language_codes[str(source_lang).strip().lower()] | |
| 423 | + src_code = resolve_nllb_language_code(source_lang, self.language_codes) | |
| 424 | + if src_code is None: | |
| 425 | + raise ValueError(f"Unsupported NLLB source language: {source_lang}") | |
| 358 | 426 | self.tokenizer.src_lang = src_code |
| 359 | 427 | return {} |
| 360 | 428 | |
| 361 | 429 | def _build_generate_kwargs(self, source_lang: Optional[str], target_lang: str) -> Dict[str, object]: |
| 362 | 430 | del source_lang |
| 363 | - tgt_code = self.language_codes[str(target_lang).strip().lower()] | |
| 431 | + tgt_code = resolve_nllb_language_code(target_lang, self.language_codes) | |
| 432 | + if tgt_code is None: | |
| 433 | + raise ValueError(f"Unsupported NLLB target language: {target_lang}") | |
| 364 | 434 | forced_bos_token_id = None |
| 365 | 435 | if hasattr(self.tokenizer, "lang_code_to_id"): |
| 366 | 436 | forced_bos_token_id = self.tokenizer.lang_code_to_id.get(tgt_code) | ... | ... |
translation/cache.py
| ... | ... | @@ -40,10 +40,8 @@ class TranslationCache: |
| 40 | 40 | try: |
| 41 | 41 | value = self.redis_client.get(key) |
| 42 | 42 | logger.info( |
| 43 | - "Translation cache %s | model=%s target_lang=%s text_len=%s key=%s", | |
| 43 | + "Translation cache %s | text_len=%s key=%s", | |
| 44 | 44 | "hit" if value is not None else "miss", |
| 45 | - model, | |
| 46 | - target_lang, | |
| 47 | 45 | len(str(source_text or "")), |
| 48 | 46 | key, |
| 49 | 47 | ) |
| ... | ... | @@ -61,9 +59,7 @@ class TranslationCache: |
| 61 | 59 | try: |
| 62 | 60 | self.redis_client.setex(key, self.ttl_seconds, translated_text) |
| 63 | 61 | logger.info( |
| 64 | - "Translation cache write | model=%s target_lang=%s text_len=%s result_len=%s ttl_seconds=%s key=%s", | |
| 65 | - model, | |
| 66 | - target_lang, | |
| 62 | + "Translation cache write | text_len=%s result_len=%s ttl_seconds=%s key=%s", | |
| 67 | 63 | len(str(source_text or "")), |
| 68 | 64 | len(str(translated_text or "")), |
| 69 | 65 | self.ttl_seconds, | ... | ... |
translation/languages.py
| ... | ... | @@ -2,12 +2,13 @@ |
| 2 | 2 | |
| 3 | 3 | from __future__ import annotations |
| 4 | 4 | |
| 5 | -from typing import Dict, Tuple | |
| 5 | +from typing import Dict, Mapping, Optional, Tuple | |
| 6 | 6 | |
| 7 | 7 | |
| 8 | 8 | LANGUAGE_LABELS: Dict[str, str] = { |
| 9 | 9 | "zh": "Chinese", |
| 10 | 10 | "en": "English", |
| 11 | + "fi": "Finnish", | |
| 11 | 12 | "ru": "Russian", |
| 12 | 13 | "ar": "Arabic", |
| 13 | 14 | "ja": "Japanese", |
| ... | ... | @@ -49,6 +50,7 @@ DEEPL_LANGUAGE_CODES: Dict[str, str] = { |
| 49 | 50 | |
| 50 | 51 | NLLB_LANGUAGE_CODES: Dict[str, str] = { |
| 51 | 52 | "en": "eng_Latn", |
| 53 | + "fi": "fin_Latn", | |
| 52 | 54 | "zh": "zho_Hans", |
| 53 | 55 | "ru": "rus_Cyrl", |
| 54 | 56 | "ar": "arb_Arab", |
| ... | ... | @@ -65,3 +67,56 @@ MARIAN_LANGUAGE_DIRECTIONS: Dict[str, Tuple[str, str]] = { |
| 65 | 67 | "opus-mt-zh-en": ("zh", "en"), |
| 66 | 68 | "opus-mt-en-zh": ("en", "zh"), |
| 67 | 69 | } |
| 70 | + | |
| 71 | + | |
| 72 | +NLLB_LANGUAGE_ALIASES: Dict[str, str] = { | |
| 73 | + "fi_fi": "fi", | |
| 74 | + "fin": "fi", | |
| 75 | + "fin_fin": "fi", | |
| 76 | + "zh_cn": "zh", | |
| 77 | + "zh_hans": "zh", | |
| 78 | +} | |
| 79 | + | |
| 80 | + | |
| 81 | +def normalize_language_key(language: Optional[str]) -> str: | |
| 82 | + return str(language or "").strip().lower().replace("-", "_") | |
| 83 | + | |
| 84 | + | |
| 85 | +def build_nllb_language_catalog( | |
| 86 | + overrides: Optional[Mapping[str, str]] = None, | |
| 87 | +) -> Dict[str, str]: | |
| 88 | + catalog = { | |
| 89 | + normalize_language_key(key): str(value).strip() | |
| 90 | + for key, value in NLLB_LANGUAGE_CODES.items() | |
| 91 | + if str(key).strip() | |
| 92 | + } | |
| 93 | + for key, value in (overrides or {}).items(): | |
| 94 | + normalized_key = normalize_language_key(key) | |
| 95 | + if normalized_key: | |
| 96 | + catalog[normalized_key] = str(value).strip() | |
| 97 | + return catalog | |
| 98 | + | |
| 99 | + | |
| 100 | +def resolve_nllb_language_code( | |
| 101 | + language: Optional[str], | |
| 102 | + language_codes: Optional[Mapping[str, str]] = None, | |
| 103 | +) -> Optional[str]: | |
| 104 | + normalized = normalize_language_key(language) | |
| 105 | + if not normalized: | |
| 106 | + return None | |
| 107 | + | |
| 108 | + catalog = build_nllb_language_catalog(language_codes) | |
| 109 | + direct = catalog.get(normalized) | |
| 110 | + if direct is not None: | |
| 111 | + return direct | |
| 112 | + | |
| 113 | + alias = NLLB_LANGUAGE_ALIASES.get(normalized) | |
| 114 | + if alias is not None: | |
| 115 | + aliased = catalog.get(normalize_language_key(alias)) | |
| 116 | + if aliased is not None: | |
| 117 | + return aliased | |
| 118 | + | |
| 119 | + for code in catalog.values(): | |
| 120 | + if normalize_language_key(code) == normalized: | |
| 121 | + return code | |
| 122 | + return None | ... | ... |
| ... | ... | @@ -0,0 +1,37 @@ |
| 1 | +"""Shared translation logging context helpers.""" | |
| 2 | + | |
| 3 | +from __future__ import annotations | |
| 4 | + | |
| 5 | +import contextvars | |
| 6 | +import logging | |
| 7 | +import uuid | |
| 8 | +from typing import Optional | |
| 9 | + | |
| 10 | + | |
| 11 | +_translation_request_id_var: contextvars.ContextVar[Optional[str]] = contextvars.ContextVar( | |
| 12 | + "translation_request_id", | |
| 13 | + default=None, | |
| 14 | +) | |
| 15 | + | |
| 16 | + | |
| 17 | +def current_translation_request_id() -> str: | |
| 18 | + return _translation_request_id_var.get() or "-1" | |
| 19 | + | |
| 20 | + | |
| 21 | +def bind_translation_request_id(request_id: Optional[str] = None) -> tuple[str, contextvars.Token]: | |
| 22 | + raw_value = str(request_id or "").strip() | |
| 23 | + normalized = raw_value[:32] if raw_value else str(uuid.uuid4())[:8] | |
| 24 | + return normalized, _translation_request_id_var.set(normalized) | |
| 25 | + | |
| 26 | + | |
| 27 | +def reset_translation_request_id(token: contextvars.Token) -> None: | |
| 28 | + _translation_request_id_var.reset(token) | |
| 29 | + | |
| 30 | + | |
| 31 | +class TranslationRequestFilter(logging.Filter): | |
| 32 | + """Inject a request id into translator logs when one is available.""" | |
| 33 | + | |
| 34 | + def filter(self, record: logging.LogRecord) -> bool: | |
| 35 | + if not hasattr(record, "reqid"): | |
| 36 | + record.reqid = current_translation_request_id() | |
| 37 | + return True | ... | ... |
translation/service.py
| ... | ... | @@ -198,15 +198,10 @@ class TranslationService: |
| 198 | 198 | active_scene = normalize_translation_scene(self.config, scene) |
| 199 | 199 | capability_cfg = self._enabled_capabilities[normalized_model] |
| 200 | 200 | use_cache = bool(capability_cfg.get("use_cache")) |
| 201 | - text_count = 1 if isinstance(text, str) else len(list(text)) | |
| 202 | 201 | logger.info( |
| 203 | - "Translation route | model=%s backend=%s scene=%s target_lang=%s source_lang=%s count=%s use_cache=%s cache_available=%s", | |
| 204 | - normalized_model, | |
| 202 | + "Translation route | backend=%s request_type=%s use_cache=%s cache_available=%s", | |
| 205 | 203 | getattr(backend, "model", normalized_model), |
| 206 | - active_scene, | |
| 207 | - target_lang, | |
| 208 | - source_lang or "auto", | |
| 209 | - text_count, | |
| 204 | + "single" if isinstance(text, str) else "batch", | |
| 210 | 205 | use_cache, |
| 211 | 206 | self._translation_cache.available, |
| 212 | 207 | ) |
| ... | ... | @@ -252,11 +247,7 @@ class TranslationService: |
| 252 | 247 | cached = self._translation_cache.get(model=model, target_lang=target_lang, source_text=text) |
| 253 | 248 | if cached is not None: |
| 254 | 249 | logger.info( |
| 255 | - "Translation cache served | model=%s scene=%s target_lang=%s source_lang=%s text_len=%s", | |
| 256 | - model, | |
| 257 | - scene, | |
| 258 | - target_lang, | |
| 259 | - source_lang or "auto", | |
| 250 | + "Translation cache served | request_type=single text_len=%s", | |
| 260 | 251 | len(text), |
| 261 | 252 | ) |
| 262 | 253 | return cached |
| ... | ... | @@ -274,21 +265,13 @@ class TranslationService: |
| 274 | 265 | translated_text=translated, |
| 275 | 266 | ) |
| 276 | 267 | logger.info( |
| 277 | - "Translation backend result cached | model=%s scene=%s target_lang=%s source_lang=%s text_len=%s result_len=%s", | |
| 278 | - model, | |
| 279 | - scene, | |
| 280 | - target_lang, | |
| 281 | - source_lang or "auto", | |
| 268 | + "Translation backend result cached | request_type=single text_len=%s result_len=%s", | |
| 282 | 269 | len(text), |
| 283 | 270 | len(str(translated)), |
| 284 | 271 | ) |
| 285 | 272 | else: |
| 286 | 273 | logger.warning( |
| 287 | - "Translation backend returned empty result | model=%s scene=%s target_lang=%s source_lang=%s text_len=%s", | |
| 288 | - model, | |
| 289 | - scene, | |
| 290 | - target_lang, | |
| 291 | - source_lang or "auto", | |
| 274 | + "Translation backend returned empty result | request_type=single text_len=%s", | |
| 292 | 275 | len(text), |
| 293 | 276 | ) |
| 294 | 277 | return translated |
| ... | ... | @@ -327,11 +310,7 @@ class TranslationService: |
| 327 | 310 | miss_indices.append(idx) |
| 328 | 311 | |
| 329 | 312 | logger.info( |
| 330 | - "Translation batch cache summary | model=%s scene=%s target_lang=%s source_lang=%s total=%s cache_hits=%s cache_misses=%s", | |
| 331 | - model, | |
| 332 | - scene, | |
| 333 | - target_lang, | |
| 334 | - source_lang or "auto", | |
| 313 | + "Translation batch cache summary | total=%s cache_hits=%s cache_misses=%s", | |
| 335 | 314 | len(texts), |
| 336 | 315 | cache_hits, |
| 337 | 316 | len(misses), |
| ... | ... | @@ -356,11 +335,7 @@ class TranslationService: |
| 356 | 335 | ) |
| 357 | 336 | else: |
| 358 | 337 | logger.warning( |
| 359 | - "Translation batch item returned empty result | model=%s scene=%s target_lang=%s source_lang=%s item_index=%s text_len=%s", | |
| 360 | - model, | |
| 361 | - scene, | |
| 362 | - target_lang, | |
| 363 | - source_lang or "auto", | |
| 338 | + "Translation batch item returned empty result | item_index=%s text_len=%s", | |
| 364 | 339 | idx, |
| 365 | 340 | len(original_text), |
| 366 | 341 | ) | ... | ... |