Commit 522a39647a24d001f7da792bc30ba54b4f01f238

Authored by tangwang
1 parent a5a3856d

多语言搜索翻译的优化(deepL添加上下文提示词)

DEEPL_OPTIMIZATION.md 0 → 100644
... ... @@ -0,0 +1,185 @@
  1 +# DeepL 翻译优化指南
  2 +
  3 +## 问题描述
  4 +
  5 +在电商搜索环境中,DeepL 翻译可能会遇到多义词翻译不准确的问题。例如:
  6 +- "车" 被翻译为 "rook"(象棋中的车)而不是 "car"(汽车)
  7 +
  8 +## 解决方案
  9 +
  10 +我们实现了以下优化方案来改善 DeepL 在电商场景下的翻译准确性:
  11 +
  12 +### 1. 上下文提示(Context Hints)
  13 +
  14 +系统会自动为单字查询添加电商上下文,帮助 DeepL 理解查询的领域。
  15 +
  16 +**工作原理:**
  17 +- 对于中文单字查询(如 "车"),系统会自动添加上下文 "购买 车"
  18 +- DeepL 会根据上下文将 "车" 翻译为 "car" 而不是 "rook"
  19 +- 翻译完成后,系统会自动提取实际的查询词("car")
  20 +
  21 +**配置:**
  22 +在 `config/config.yaml` 中可以设置翻译上下文:
  23 +
  24 +```yaml
  25 +query_config:
  26 + translation_context: "e-commerce product search" # 默认值
  27 +```
  28 +
  29 +### 2. 术语表(Glossary)支持(推荐方案)
  30 +
  31 +DeepL 支持使用自定义术语表来确保特定词汇的准确翻译。这是解决多义词问题的最佳方案。
  32 +
  33 +#### 创建术语表
  34 +
  35 +1. **使用 DeepL API 创建术语表:**
  36 +
  37 +```python
  38 +import requests
  39 +
  40 +# 创建术语表
  41 +api_url = "https://api.deepl.com/v2/glossaries"
  42 +headers = {
  43 + "Authorization": "DeepL-Auth-Key YOUR_API_KEY",
  44 + "Content-Type": "application/json",
  45 +}
  46 +
  47 +# 术语表内容(TSV 格式)
  48 +glossary_entries = """车\tcar
  49 +手机\tmobile phone
  50 +电脑\tcomputer"""
  51 +
  52 +payload = {
  53 + "name": "e-commerce-glossary",
  54 + "source_lang": "ZH",
  55 + "target_lang": "EN",
  56 + "entries": glossary_entries,
  57 + "entries_format": "tsv"
  58 +}
  59 +
  60 +response = requests.post(api_url, headers=headers, json=payload)
  61 +if response.status_code == 201:
  62 + glossary_id = response.json()["glossary_id"]
  63 + print(f"术语表创建成功,ID: {glossary_id}")
  64 +```
  65 +
  66 +2. **或者使用 DeepL 网页界面创建:**
  67 + - 登录 DeepL Pro 账户
  68 + - 进入术语表管理页面
  69 + - 创建新的术语表,添加 "车" -> "car" 等映射
  70 +
  71 +#### 配置术语表
  72 +
  73 +在 `config/config.yaml` 中配置术语表 ID:
  74 +
  75 +```yaml
  76 +query_config:
  77 + translation_glossary_id: "your-glossary-id-here" # DeepL 术语表 ID
  78 +```
  79 +
  80 +#### 术语表格式
  81 +
  82 +术语表使用 TSV(Tab-Separated Values)格式,每行一个词条:
  83 +
  84 +```
  85 +车 car
  86 +手机 mobile phone
  87 +电脑 computer
  88 +```
  89 +
  90 +**注意:**
  91 +- 术语表功能需要 DeepL Pro 账户(付费版)
  92 +- Free API 不支持术语表功能
  93 +
  94 +### 3. 自动上下文处理
  95 +
  96 +系统会自动检测以下情况并应用优化:
  97 +
  98 +- **单字中文查询**:自动添加电商上下文
  99 +- **多字查询**:DeepL 通常有足够的上下文,无需特殊处理
  100 +- **非中文查询**:不应用上下文优化
  101 +
  102 +## 使用示例
  103 +
  104 +### 示例 1:使用上下文提示(自动)
  105 +
  106 +查询 "车" 时:
  107 +1. 系统检测到这是单字中文查询
  108 +2. 自动添加上下文:"购买 车"
  109 +3. DeepL 翻译为 "buy car"
  110 +4. 系统提取实际查询词:"car"
  111 +
  112 +### 示例 2:使用术语表(推荐)
  113 +
  114 +1. 创建术语表,包含 "车" -> "car" 的映射
  115 +2. 在配置中设置 `translation_glossary_id`
  116 +3. 查询 "车" 时,DeepL 直接使用术语表翻译为 "car"
  117 +
  118 +## 最佳实践
  119 +
  120 +1. **优先使用术语表**:
  121 + - 对于常见的电商术语,创建术语表是最可靠的方案
  122 + - 术语表可以确保翻译的一致性和准确性
  123 +
  124 +2. **上下文提示作为补充**:
  125 + - 对于未在术语表中的词汇,上下文提示可以提供帮助
  126 + - 系统已默认启用,无需额外配置
  127 +
  128 +3. **定期更新术语表**:
  129 + - 根据实际使用情况,不断添加新的术语映射
  130 + - 特别是品牌名、产品类别等专业术语
  131 +
  132 +## 技术实现细节
  133 +
  134 +### 上下文添加逻辑
  135 +
  136 +```python
  137 +# 对于单字查询(长度 <= 2 个字符)
  138 +if len(text.strip().split()) == 1 and len(text.strip()) <= 2:
  139 + context_phrase = f"购买 {text}" # 添加 "购买" 前缀
  140 + return context_phrase, True # 需要从结果中提取
  141 +```
  142 +
  143 +### 结果提取逻辑
  144 +
  145 +翻译结果 "buy car" 会被处理:
  146 +1. 识别上下文词(buy, purchase, product 等)
  147 +2. 提取非上下文词作为实际查询词
  148 +3. 返回 "car"
  149 +
  150 +## 常见问题
  151 +
  152 +### Q: 为什么 "车" 会被翻译为 "rook"?
  153 +
  154 +A: DeepL 在处理单字查询时,缺乏上下文来判断词义。"车" 在中文中既可以指汽车,也可以指象棋中的车。通过添加电商上下文或使用术语表,可以解决这个问题。
  155 +
  156 +### Q: 术语表和上下文提示哪个更好?
  157 +
  158 +A: 术语表是更可靠的方案,因为它直接指定了翻译映射。上下文提示是自动的补充方案,适用于未在术语表中的词汇。
  159 +
  160 +### Q: Free API 可以使用术语表吗?
  161 +
  162 +A: 不可以。术语表功能需要 DeepL Pro(付费版)账户。Free API 只能使用上下文提示优化。
  163 +
  164 +### Q: 如何测试翻译效果?
  165 +
  166 +A: 可以通过搜索 API 测试翻译结果,查看返回的 `translations` 字段:
  167 +
  168 +```bash
  169 +curl -X POST http://localhost:6002/api/search \
  170 + -H "Content-Type: application/json" \
  171 + -d '{"query": "车", "tenant_id": "test"}'
  172 +```
  173 +
  174 +## 相关文件
  175 +
  176 +- `query/translator.py` - 翻译器实现
  177 +- `query/query_parser.py` - 查询解析器(调用翻译器)
  178 +- `config/config.yaml` - 配置文件
  179 +- `config/config_loader.py` - 配置加载器
  180 +
  181 +## 参考资源
  182 +
  183 +- [DeepL API 文档](https://www.deepl.com/docs-api)
  184 +- [DeepL 术语表功能](https://www.deepl.com/docs-api/managing-glossaries/)
  185 +
... ...
config/config.yaml
... ... @@ -242,6 +242,8 @@ query_config:
242 242 # Translation API (DeepL)
243 243 translation_service: "deepl"
244 244 translation_api_key: null # Set via environment variable
  245 + # translation_glossary_id: null # Optional: DeepL glossary ID for custom terminology (e.g., "车" -> "car")
  246 + # translation_context: "e-commerce product search" # Context hint for better translation disambiguation
245 247  
246 248 # Ranking Configuration
247 249 ranking:
... ...
config/config_loader.py
... ... @@ -51,6 +51,8 @@ class QueryConfig:
51 51 # Translation API settings
52 52 translation_api_key: Optional[str] = None
53 53 translation_service: str = "deepl" # deepl, google, etc.
  54 + translation_glossary_id: Optional[str] = None # DeepL glossary ID for custom terminology
  55 + translation_context: str = "e-commerce product search" # Context hint for translation
54 56  
55 57 # ES source fields configuration - fields to return in search results
56 58 source_fields: List[str] = field(default_factory=lambda: [
... ... @@ -209,7 +211,9 @@ class ConfigLoader:
209 211 enable_query_rewrite=query_config_data.get("enable_query_rewrite", True),
210 212 rewrite_dictionary=rewrite_dictionary,
211 213 translation_api_key=query_config_data.get("translation_api_key"),
212   - translation_service=query_config_data.get("translation_service", "deepl")
  214 + translation_service=query_config_data.get("translation_service", "deepl"),
  215 + translation_glossary_id=query_config_data.get("translation_glossary_id"),
  216 + translation_context=query_config_data.get("translation_context", "e-commerce product search")
213 217 )
214 218  
215 219 # Parse ranking config
... ...
query/query_parser.py
... ... @@ -98,7 +98,9 @@ class QueryParser:
98 98 print("[QueryParser] Initializing translator...")
99 99 self._translator = Translator(
100 100 api_key=self.query_config.translation_api_key,
101   - use_cache=True
  101 + use_cache=True,
  102 + glossary_id=getattr(self.query_config, 'translation_glossary_id', None),
  103 + translation_context=getattr(self.query_config, 'translation_context', 'e-commerce product search')
102 104 )
103 105 return self._translator
104 106  
... ... @@ -195,10 +197,13 @@ class QueryParser:
195 197  
196 198 if target_langs:
197 199 log_info(f"开始翻译 | 源语言: {detected_lang} | 目标语言: {target_langs}")
  200 + # Use e-commerce context for better disambiguation
  201 + translation_context = getattr(self.query_config, 'translation_context', 'e-commerce product search')
198 202 translations = self.translator.translate_multi(
199 203 query_text,
200 204 target_langs,
201   - source_lang=detected_lang
  205 + source_lang=detected_lang,
  206 + context=translation_context
202 207 )
203 208 log_info(f"翻译完成 | 结果: {translations}")
204 209 if context:
... ...
query/translator.py
... ... @@ -32,7 +32,9 @@ class Translator:
32 32 self,
33 33 api_key: Optional[str] = None,
34 34 use_cache: bool = True,
35   - timeout: int = 10
  35 + timeout: int = 10,
  36 + glossary_id: Optional[str] = None,
  37 + translation_context: Optional[str] = None
36 38 ):
37 39 """
38 40 Initialize translator.
... ... @@ -41,6 +43,8 @@ class Translator:
41 43 api_key: DeepL API key (or None to use from config/env)
42 44 use_cache: Whether to cache translations
43 45 timeout: Request timeout in seconds
  46 + glossary_id: DeepL glossary ID for custom terminology (optional)
  47 + translation_context: Context hint for translation (e.g., "e-commerce", "product search")
44 48 """
45 49 # Get API key from config if not provided
46 50 if api_key is None:
... ... @@ -53,6 +57,8 @@ class Translator:
53 57 self.api_key = api_key
54 58 self.timeout = timeout
55 59 self.use_cache = use_cache
  60 + self.glossary_id = glossary_id
  61 + self.translation_context = translation_context or "e-commerce product search"
56 62  
57 63 if use_cache:
58 64 self.cache = DictCache(".cache/translations.json")
... ... @@ -63,7 +69,8 @@ class Translator:
63 69 self,
64 70 text: str,
65 71 target_lang: str,
66   - source_lang: Optional[str] = None
  72 + source_lang: Optional[str] = None,
  73 + context: Optional[str] = None
67 74 ) -> Optional[str]:
68 75 """
69 76 Translate text to target language.
... ... @@ -72,6 +79,7 @@ class Translator:
72 79 text: Text to translate
73 80 target_lang: Target language code ('zh', 'en', 'ru', etc.)
74 81 source_lang: Source language code (optional, auto-detect if None)
  82 + context: Additional context for translation (overrides default context)
75 83  
76 84 Returns:
77 85 Translated text or None if translation fails
... ... @@ -84,9 +92,12 @@ class Translator:
84 92 if source_lang:
85 93 source_lang = source_lang.lower()
86 94  
87   - # Check cache
  95 + # Use provided context or default context
  96 + translation_context = context or self.translation_context
  97 +
  98 + # Check cache (include context in cache key for accuracy)
88 99 if self.use_cache:
89   - cache_key = f"{source_lang or 'auto'}:{target_lang}:{text}"
  100 + cache_key = f"{source_lang or 'auto'}:{target_lang}:{translation_context}:{text}"
90 101 cached = self.cache.get(cache_key, category="translations")
91 102 if cached:
92 103 return cached
... ... @@ -97,12 +108,12 @@ class Translator:
97 108 return text
98 109  
99 110 # Translate using DeepL with fallback
100   - result = self._translate_deepl(text, target_lang, source_lang)
  111 + result = self._translate_deepl(text, target_lang, source_lang, translation_context)
101 112  
102 113 # If translation failed, try fallback to free API
103 114 if result is None and "api.deepl.com" in self.DEEPL_API_URL:
104 115 print(f"[Translator] Pro API failed, trying free API...")
105   - result = self._translate_deepl_free(text, target_lang, source_lang)
  116 + result = self._translate_deepl_free(text, target_lang, source_lang, translation_context)
106 117  
107 118 # If still failed, return original text with warning
108 119 if result is None:
... ... @@ -111,7 +122,7 @@ class Translator:
111 122  
112 123 # Cache result
113 124 if result and self.use_cache:
114   - cache_key = f"{source_lang or 'auto'}:{target_lang}:{text}"
  125 + cache_key = f"{source_lang or 'auto'}:{target_lang}:{translation_context}:{text}"
115 126 self.cache.set(cache_key, result, category="translations")
116 127  
117 128 return result
... ... @@ -120,9 +131,18 @@ class Translator:
120 131 self,
121 132 text: str,
122 133 target_lang: str,
123   - source_lang: Optional[str]
  134 + source_lang: Optional[str],
  135 + context: Optional[str] = None
124 136 ) -> Optional[str]:
125   - """Translate using DeepL API."""
  137 + """
  138 + Translate using DeepL API with context and glossary support.
  139 +
  140 + Args:
  141 + text: Text to translate
  142 + target_lang: Target language code
  143 + source_lang: Source language code (optional)
  144 + context: Context hint for translation (e.g., "e-commerce product search")
  145 + """
126 146 # Map to DeepL language codes
127 147 target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper())
128 148  
... ... @@ -131,8 +151,13 @@ class Translator:
131 151 "Content-Type": "application/json",
132 152 }
133 153  
  154 + # Build text with context for better disambiguation
  155 + # For e-commerce, add context words to help DeepL understand the domain
  156 + # This is especially important for single-word ambiguous terms like "车" (car vs rook)
  157 + text_to_translate, needs_extraction = self._add_ecommerce_context(text, source_lang, context)
  158 +
134 159 payload = {
135   - "text": [text],
  160 + "text": [text_to_translate],
136 161 "target_lang": target_code,
137 162 }
138 163  
... ... @@ -140,6 +165,16 @@ class Translator:
140 165 source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper())
141 166 payload["source_lang"] = source_code
142 167  
  168 + # Add glossary if configured
  169 + if self.glossary_id:
  170 + payload["glossary_id"] = self.glossary_id
  171 +
  172 + # Note: DeepL API v2 doesn't have a direct "context" parameter,
  173 + # but we can improve translation by:
  174 + # 1. Using glossary for domain-specific terms (best solution)
  175 + # 2. Adding context words to the text (for single-word queries) - implemented in _add_ecommerce_context
  176 + # 3. Using more specific source language detection
  177 +
143 178 try:
144 179 response = requests.post(
145 180 self.DEEPL_API_URL,
... ... @@ -151,7 +186,13 @@ class Translator:
151 186 if response.status_code == 200:
152 187 data = response.json()
153 188 if "translations" in data and len(data["translations"]) > 0:
154   - return data["translations"][0]["text"]
  189 + translated_text = data["translations"][0]["text"]
  190 + # If we added context, extract just the term from the result
  191 + if needs_extraction:
  192 + translated_text = self._extract_term_from_translation(
  193 + translated_text, text, target_code
  194 + )
  195 + return translated_text
155 196 else:
156 197 print(f"[Translator] DeepL API error: {response.status_code} - {response.text}")
157 198 return None
... ... @@ -167,9 +208,14 @@ class Translator:
167 208 self,
168 209 text: str,
169 210 target_lang: str,
170   - source_lang: Optional[str]
  211 + source_lang: Optional[str],
  212 + context: Optional[str] = None
171 213 ) -> Optional[str]:
172   - """Translate using DeepL Free API."""
  214 + """
  215 + Translate using DeepL Free API.
  216 +
  217 + Note: Free API may not support glossary_id parameter.
  218 + """
173 219 # Map to DeepL language codes
174 220 target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper())
175 221  
... ... @@ -187,6 +233,9 @@ class Translator:
187 233 source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper())
188 234 payload["source_lang"] = source_code
189 235  
  236 + # Note: Free API typically doesn't support glossary_id
  237 + # But we can still use context hints in the text
  238 +
190 239 try:
191 240 response = requests.post(
192 241 "https://api-free.deepl.com/v2/translate",
... ... @@ -214,7 +263,8 @@ class Translator:
214 263 self,
215 264 text: str,
216 265 target_langs: List[str],
217   - source_lang: Optional[str] = None
  266 + source_lang: Optional[str] = None,
  267 + context: Optional[str] = None
218 268 ) -> Dict[str, Optional[str]]:
219 269 """
220 270 Translate text to multiple target languages.
... ... @@ -223,15 +273,98 @@ class Translator:
223 273 text: Text to translate
224 274 target_langs: List of target language codes
225 275 source_lang: Source language code (optional)
  276 + context: Context hint for translation (optional)
226 277  
227 278 Returns:
228 279 Dictionary mapping language code to translated text
229 280 """
230 281 results = {}
231 282 for lang in target_langs:
232   - results[lang] = self.translate(text, lang, source_lang)
  283 + results[lang] = self.translate(text, lang, source_lang, context)
233 284 return results
234 285  
  286 + def _add_ecommerce_context(
  287 + self,
  288 + text: str,
  289 + source_lang: Optional[str],
  290 + context: Optional[str]
  291 + ) -> tuple:
  292 + """
  293 + Add e-commerce context to text for better disambiguation.
  294 +
  295 + For single-word ambiguous Chinese terms, we add context words that help
  296 + DeepL understand this is an e-commerce/product search context.
  297 +
  298 + Args:
  299 + text: Original text to translate
  300 + source_lang: Source language code
  301 + context: Context hint
  302 +
  303 + Returns:
  304 + Tuple of (text_with_context, needs_extraction)
  305 + - text_with_context: Text to send to DeepL
  306 + - needs_extraction: Whether we need to extract the term from the result
  307 + """
  308 + # Only apply for e-commerce context and Chinese source
  309 + if not context or "e-commerce" not in context.lower():
  310 + return text, False
  311 +
  312 + if not source_lang or source_lang.lower() != 'zh':
  313 + return text, False
  314 +
  315 + # For single-word queries, add context to help disambiguation
  316 + text_stripped = text.strip()
  317 + if len(text_stripped.split()) == 1 and len(text_stripped) <= 2:
  318 + # Common ambiguous Chinese e-commerce terms like "车" (car vs rook)
  319 + # We add a context phrase: "购买 [term]" (buy [term]) or "商品 [term]" (product [term])
  320 + # This helps DeepL understand the e-commerce context
  321 + # We'll need to extract just the term from the translation result
  322 + context_phrase = f"购买 {text_stripped}"
  323 + return context_phrase, True
  324 +
  325 + # For multi-word queries, DeepL usually has enough context
  326 + return text, False
  327 +
  328 + def _extract_term_from_translation(
  329 + self,
  330 + translated_text: str,
  331 + original_text: str,
  332 + target_lang_code: str
  333 + ) -> str:
  334 + """
  335 + Extract the actual term from a translation that included context.
  336 +
  337 + For example, if we translated "购买 车" (buy car) and got "buy car",
  338 + we want to extract just "car".
  339 +
  340 + Args:
  341 + translated_text: Full translation result
  342 + original_text: Original single-word query
  343 + target_lang_code: Target language code (EN, ZH, etc.)
  344 +
  345 + Returns:
  346 + Extracted term or original translation if extraction fails
  347 + """
  348 + # For English target, try to extract the last word (the actual term)
  349 + if target_lang_code == "EN":
  350 + words = translated_text.strip().split()
  351 + if len(words) > 1:
  352 + # Usually the last word is the term we want
  353 + # But we need to be smart - if it's "buy car", we want "car"
  354 + # Common context words to skip: buy, purchase, product, item, etc.
  355 + context_words = {"buy", "purchase", "product", "item", "commodity", "goods"}
  356 + # Try to find the term (not a context word)
  357 + for word in reversed(words):
  358 + word_lower = word.lower().rstrip('.,!?;:')
  359 + if word_lower not in context_words:
  360 + return word_lower
  361 + # If all words are context words, return the last one
  362 + return words[-1].lower().rstrip('.,!?;:')
  363 +
  364 + # For other languages or if extraction fails, return as-is
  365 + # The user can configure a glossary for better results
  366 + return translated_text
  367 +
235 368 def get_translation_needs(
236 369 self,
237 370 detected_lang: str,
... ...