Commit 522a39647a24d001f7da792bc30ba54b4f01f238
1 parent
a5a3856d
多语言搜索翻译的优化(deepL添加上下文提示词)
Showing
5 changed files
with
347 additions
and
18 deletions
Show diff stats
| @@ -0,0 +1,185 @@ | @@ -0,0 +1,185 @@ | ||
| 1 | +# DeepL 翻译优化指南 | ||
| 2 | + | ||
| 3 | +## 问题描述 | ||
| 4 | + | ||
| 5 | +在电商搜索环境中,DeepL 翻译可能会遇到多义词翻译不准确的问题。例如: | ||
| 6 | +- "车" 被翻译为 "rook"(象棋中的车)而不是 "car"(汽车) | ||
| 7 | + | ||
| 8 | +## 解决方案 | ||
| 9 | + | ||
| 10 | +我们实现了以下优化方案来改善 DeepL 在电商场景下的翻译准确性: | ||
| 11 | + | ||
| 12 | +### 1. 上下文提示(Context Hints) | ||
| 13 | + | ||
| 14 | +系统会自动为单字查询添加电商上下文,帮助 DeepL 理解查询的领域。 | ||
| 15 | + | ||
| 16 | +**工作原理:** | ||
| 17 | +- 对于中文单字查询(如 "车"),系统会自动添加上下文 "购买 车" | ||
| 18 | +- DeepL 会根据上下文将 "车" 翻译为 "car" 而不是 "rook" | ||
| 19 | +- 翻译完成后,系统会自动提取实际的查询词("car") | ||
| 20 | + | ||
| 21 | +**配置:** | ||
| 22 | +在 `config/config.yaml` 中可以设置翻译上下文: | ||
| 23 | + | ||
| 24 | +```yaml | ||
| 25 | +query_config: | ||
| 26 | + translation_context: "e-commerce product search" # 默认值 | ||
| 27 | +``` | ||
| 28 | + | ||
| 29 | +### 2. 术语表(Glossary)支持(推荐方案) | ||
| 30 | + | ||
| 31 | +DeepL 支持使用自定义术语表来确保特定词汇的准确翻译。这是解决多义词问题的最佳方案。 | ||
| 32 | + | ||
| 33 | +#### 创建术语表 | ||
| 34 | + | ||
| 35 | +1. **使用 DeepL API 创建术语表:** | ||
| 36 | + | ||
| 37 | +```python | ||
| 38 | +import requests | ||
| 39 | + | ||
| 40 | +# 创建术语表 | ||
| 41 | +api_url = "https://api.deepl.com/v2/glossaries" | ||
| 42 | +headers = { | ||
| 43 | + "Authorization": "DeepL-Auth-Key YOUR_API_KEY", | ||
| 44 | + "Content-Type": "application/json", | ||
| 45 | +} | ||
| 46 | + | ||
| 47 | +# 术语表内容(TSV 格式) | ||
| 48 | +glossary_entries = """车\tcar | ||
| 49 | +手机\tmobile phone | ||
| 50 | +电脑\tcomputer""" | ||
| 51 | + | ||
| 52 | +payload = { | ||
| 53 | + "name": "e-commerce-glossary", | ||
| 54 | + "source_lang": "ZH", | ||
| 55 | + "target_lang": "EN", | ||
| 56 | + "entries": glossary_entries, | ||
| 57 | + "entries_format": "tsv" | ||
| 58 | +} | ||
| 59 | + | ||
| 60 | +response = requests.post(api_url, headers=headers, json=payload) | ||
| 61 | +if response.status_code == 201: | ||
| 62 | + glossary_id = response.json()["glossary_id"] | ||
| 63 | + print(f"术语表创建成功,ID: {glossary_id}") | ||
| 64 | +``` | ||
| 65 | + | ||
| 66 | +2. **或者使用 DeepL 网页界面创建:** | ||
| 67 | + - 登录 DeepL Pro 账户 | ||
| 68 | + - 进入术语表管理页面 | ||
| 69 | + - 创建新的术语表,添加 "车" -> "car" 等映射 | ||
| 70 | + | ||
| 71 | +#### 配置术语表 | ||
| 72 | + | ||
| 73 | +在 `config/config.yaml` 中配置术语表 ID: | ||
| 74 | + | ||
| 75 | +```yaml | ||
| 76 | +query_config: | ||
| 77 | + translation_glossary_id: "your-glossary-id-here" # DeepL 术语表 ID | ||
| 78 | +``` | ||
| 79 | + | ||
| 80 | +#### 术语表格式 | ||
| 81 | + | ||
| 82 | +术语表使用 TSV(Tab-Separated Values)格式,每行一个词条: | ||
| 83 | + | ||
| 84 | +``` | ||
| 85 | +车 car | ||
| 86 | +手机 mobile phone | ||
| 87 | +电脑 computer | ||
| 88 | +``` | ||
| 89 | + | ||
| 90 | +**注意:** | ||
| 91 | +- 术语表功能需要 DeepL Pro 账户(付费版) | ||
| 92 | +- Free API 不支持术语表功能 | ||
| 93 | + | ||
| 94 | +### 3. 自动上下文处理 | ||
| 95 | + | ||
| 96 | +系统会自动检测以下情况并应用优化: | ||
| 97 | + | ||
| 98 | +- **单字中文查询**:自动添加电商上下文 | ||
| 99 | +- **多字查询**:DeepL 通常有足够的上下文,无需特殊处理 | ||
| 100 | +- **非中文查询**:不应用上下文优化 | ||
| 101 | + | ||
| 102 | +## 使用示例 | ||
| 103 | + | ||
| 104 | +### 示例 1:使用上下文提示(自动) | ||
| 105 | + | ||
| 106 | +查询 "车" 时: | ||
| 107 | +1. 系统检测到这是单字中文查询 | ||
| 108 | +2. 自动添加上下文:"购买 车" | ||
| 109 | +3. DeepL 翻译为 "buy car" | ||
| 110 | +4. 系统提取实际查询词:"car" | ||
| 111 | + | ||
| 112 | +### 示例 2:使用术语表(推荐) | ||
| 113 | + | ||
| 114 | +1. 创建术语表,包含 "车" -> "car" 的映射 | ||
| 115 | +2. 在配置中设置 `translation_glossary_id` | ||
| 116 | +3. 查询 "车" 时,DeepL 直接使用术语表翻译为 "car" | ||
| 117 | + | ||
| 118 | +## 最佳实践 | ||
| 119 | + | ||
| 120 | +1. **优先使用术语表**: | ||
| 121 | + - 对于常见的电商术语,创建术语表是最可靠的方案 | ||
| 122 | + - 术语表可以确保翻译的一致性和准确性 | ||
| 123 | + | ||
| 124 | +2. **上下文提示作为补充**: | ||
| 125 | + - 对于未在术语表中的词汇,上下文提示可以提供帮助 | ||
| 126 | + - 系统已默认启用,无需额外配置 | ||
| 127 | + | ||
| 128 | +3. **定期更新术语表**: | ||
| 129 | + - 根据实际使用情况,不断添加新的术语映射 | ||
| 130 | + - 特别是品牌名、产品类别等专业术语 | ||
| 131 | + | ||
| 132 | +## 技术实现细节 | ||
| 133 | + | ||
| 134 | +### 上下文添加逻辑 | ||
| 135 | + | ||
| 136 | +```python | ||
| 137 | +# 对于单字查询(长度 <= 2 个字符) | ||
| 138 | +if len(text.strip().split()) == 1 and len(text.strip()) <= 2: | ||
| 139 | + context_phrase = f"购买 {text}" # 添加 "购买" 前缀 | ||
| 140 | + return context_phrase, True # 需要从结果中提取 | ||
| 141 | +``` | ||
| 142 | + | ||
| 143 | +### 结果提取逻辑 | ||
| 144 | + | ||
| 145 | +翻译结果 "buy car" 会被处理: | ||
| 146 | +1. 识别上下文词(buy, purchase, product 等) | ||
| 147 | +2. 提取非上下文词作为实际查询词 | ||
| 148 | +3. 返回 "car" | ||
| 149 | + | ||
| 150 | +## 常见问题 | ||
| 151 | + | ||
| 152 | +### Q: 为什么 "车" 会被翻译为 "rook"? | ||
| 153 | + | ||
| 154 | +A: DeepL 在处理单字查询时,缺乏上下文来判断词义。"车" 在中文中既可以指汽车,也可以指象棋中的车。通过添加电商上下文或使用术语表,可以解决这个问题。 | ||
| 155 | + | ||
| 156 | +### Q: 术语表和上下文提示哪个更好? | ||
| 157 | + | ||
| 158 | +A: 术语表是更可靠的方案,因为它直接指定了翻译映射。上下文提示是自动的补充方案,适用于未在术语表中的词汇。 | ||
| 159 | + | ||
| 160 | +### Q: Free API 可以使用术语表吗? | ||
| 161 | + | ||
| 162 | +A: 不可以。术语表功能需要 DeepL Pro(付费版)账户。Free API 只能使用上下文提示优化。 | ||
| 163 | + | ||
| 164 | +### Q: 如何测试翻译效果? | ||
| 165 | + | ||
| 166 | +A: 可以通过搜索 API 测试翻译结果,查看返回的 `translations` 字段: | ||
| 167 | + | ||
| 168 | +```bash | ||
| 169 | +curl -X POST http://localhost:6002/api/search \ | ||
| 170 | + -H "Content-Type: application/json" \ | ||
| 171 | + -d '{"query": "车", "tenant_id": "test"}' | ||
| 172 | +``` | ||
| 173 | + | ||
| 174 | +## 相关文件 | ||
| 175 | + | ||
| 176 | +- `query/translator.py` - 翻译器实现 | ||
| 177 | +- `query/query_parser.py` - 查询解析器(调用翻译器) | ||
| 178 | +- `config/config.yaml` - 配置文件 | ||
| 179 | +- `config/config_loader.py` - 配置加载器 | ||
| 180 | + | ||
| 181 | +## 参考资源 | ||
| 182 | + | ||
| 183 | +- [DeepL API 文档](https://www.deepl.com/docs-api) | ||
| 184 | +- [DeepL 术语表功能](https://www.deepl.com/docs-api/managing-glossaries/) | ||
| 185 | + |
config/config.yaml
| @@ -242,6 +242,8 @@ query_config: | @@ -242,6 +242,8 @@ query_config: | ||
| 242 | # Translation API (DeepL) | 242 | # Translation API (DeepL) |
| 243 | translation_service: "deepl" | 243 | translation_service: "deepl" |
| 244 | translation_api_key: null # Set via environment variable | 244 | translation_api_key: null # Set via environment variable |
| 245 | + # translation_glossary_id: null # Optional: DeepL glossary ID for custom terminology (e.g., "车" -> "car") | ||
| 246 | + # translation_context: "e-commerce product search" # Context hint for better translation disambiguation | ||
| 245 | 247 | ||
| 246 | # Ranking Configuration | 248 | # Ranking Configuration |
| 247 | ranking: | 249 | ranking: |
config/config_loader.py
| @@ -51,6 +51,8 @@ class QueryConfig: | @@ -51,6 +51,8 @@ class QueryConfig: | ||
| 51 | # Translation API settings | 51 | # Translation API settings |
| 52 | translation_api_key: Optional[str] = None | 52 | translation_api_key: Optional[str] = None |
| 53 | translation_service: str = "deepl" # deepl, google, etc. | 53 | translation_service: str = "deepl" # deepl, google, etc. |
| 54 | + translation_glossary_id: Optional[str] = None # DeepL glossary ID for custom terminology | ||
| 55 | + translation_context: str = "e-commerce product search" # Context hint for translation | ||
| 54 | 56 | ||
| 55 | # ES source fields configuration - fields to return in search results | 57 | # ES source fields configuration - fields to return in search results |
| 56 | source_fields: List[str] = field(default_factory=lambda: [ | 58 | source_fields: List[str] = field(default_factory=lambda: [ |
| @@ -209,7 +211,9 @@ class ConfigLoader: | @@ -209,7 +211,9 @@ class ConfigLoader: | ||
| 209 | enable_query_rewrite=query_config_data.get("enable_query_rewrite", True), | 211 | enable_query_rewrite=query_config_data.get("enable_query_rewrite", True), |
| 210 | rewrite_dictionary=rewrite_dictionary, | 212 | rewrite_dictionary=rewrite_dictionary, |
| 211 | translation_api_key=query_config_data.get("translation_api_key"), | 213 | translation_api_key=query_config_data.get("translation_api_key"), |
| 212 | - translation_service=query_config_data.get("translation_service", "deepl") | 214 | + translation_service=query_config_data.get("translation_service", "deepl"), |
| 215 | + translation_glossary_id=query_config_data.get("translation_glossary_id"), | ||
| 216 | + translation_context=query_config_data.get("translation_context", "e-commerce product search") | ||
| 213 | ) | 217 | ) |
| 214 | 218 | ||
| 215 | # Parse ranking config | 219 | # Parse ranking config |
query/query_parser.py
| @@ -98,7 +98,9 @@ class QueryParser: | @@ -98,7 +98,9 @@ class QueryParser: | ||
| 98 | print("[QueryParser] Initializing translator...") | 98 | print("[QueryParser] Initializing translator...") |
| 99 | self._translator = Translator( | 99 | self._translator = Translator( |
| 100 | api_key=self.query_config.translation_api_key, | 100 | api_key=self.query_config.translation_api_key, |
| 101 | - use_cache=True | 101 | + use_cache=True, |
| 102 | + glossary_id=getattr(self.query_config, 'translation_glossary_id', None), | ||
| 103 | + translation_context=getattr(self.query_config, 'translation_context', 'e-commerce product search') | ||
| 102 | ) | 104 | ) |
| 103 | return self._translator | 105 | return self._translator |
| 104 | 106 | ||
| @@ -195,10 +197,13 @@ class QueryParser: | @@ -195,10 +197,13 @@ class QueryParser: | ||
| 195 | 197 | ||
| 196 | if target_langs: | 198 | if target_langs: |
| 197 | log_info(f"开始翻译 | 源语言: {detected_lang} | 目标语言: {target_langs}") | 199 | log_info(f"开始翻译 | 源语言: {detected_lang} | 目标语言: {target_langs}") |
| 200 | + # Use e-commerce context for better disambiguation | ||
| 201 | + translation_context = getattr(self.query_config, 'translation_context', 'e-commerce product search') | ||
| 198 | translations = self.translator.translate_multi( | 202 | translations = self.translator.translate_multi( |
| 199 | query_text, | 203 | query_text, |
| 200 | target_langs, | 204 | target_langs, |
| 201 | - source_lang=detected_lang | 205 | + source_lang=detected_lang, |
| 206 | + context=translation_context | ||
| 202 | ) | 207 | ) |
| 203 | log_info(f"翻译完成 | 结果: {translations}") | 208 | log_info(f"翻译完成 | 结果: {translations}") |
| 204 | if context: | 209 | if context: |
query/translator.py
| @@ -32,7 +32,9 @@ class Translator: | @@ -32,7 +32,9 @@ class Translator: | ||
| 32 | self, | 32 | self, |
| 33 | api_key: Optional[str] = None, | 33 | api_key: Optional[str] = None, |
| 34 | use_cache: bool = True, | 34 | use_cache: bool = True, |
| 35 | - timeout: int = 10 | 35 | + timeout: int = 10, |
| 36 | + glossary_id: Optional[str] = None, | ||
| 37 | + translation_context: Optional[str] = None | ||
| 36 | ): | 38 | ): |
| 37 | """ | 39 | """ |
| 38 | Initialize translator. | 40 | Initialize translator. |
| @@ -41,6 +43,8 @@ class Translator: | @@ -41,6 +43,8 @@ class Translator: | ||
| 41 | api_key: DeepL API key (or None to use from config/env) | 43 | api_key: DeepL API key (or None to use from config/env) |
| 42 | use_cache: Whether to cache translations | 44 | use_cache: Whether to cache translations |
| 43 | timeout: Request timeout in seconds | 45 | timeout: Request timeout in seconds |
| 46 | + glossary_id: DeepL glossary ID for custom terminology (optional) | ||
| 47 | + translation_context: Context hint for translation (e.g., "e-commerce", "product search") | ||
| 44 | """ | 48 | """ |
| 45 | # Get API key from config if not provided | 49 | # Get API key from config if not provided |
| 46 | if api_key is None: | 50 | if api_key is None: |
| @@ -53,6 +57,8 @@ class Translator: | @@ -53,6 +57,8 @@ class Translator: | ||
| 53 | self.api_key = api_key | 57 | self.api_key = api_key |
| 54 | self.timeout = timeout | 58 | self.timeout = timeout |
| 55 | self.use_cache = use_cache | 59 | self.use_cache = use_cache |
| 60 | + self.glossary_id = glossary_id | ||
| 61 | + self.translation_context = translation_context or "e-commerce product search" | ||
| 56 | 62 | ||
| 57 | if use_cache: | 63 | if use_cache: |
| 58 | self.cache = DictCache(".cache/translations.json") | 64 | self.cache = DictCache(".cache/translations.json") |
| @@ -63,7 +69,8 @@ class Translator: | @@ -63,7 +69,8 @@ class Translator: | ||
| 63 | self, | 69 | self, |
| 64 | text: str, | 70 | text: str, |
| 65 | target_lang: str, | 71 | target_lang: str, |
| 66 | - source_lang: Optional[str] = None | 72 | + source_lang: Optional[str] = None, |
| 73 | + context: Optional[str] = None | ||
| 67 | ) -> Optional[str]: | 74 | ) -> Optional[str]: |
| 68 | """ | 75 | """ |
| 69 | Translate text to target language. | 76 | Translate text to target language. |
| @@ -72,6 +79,7 @@ class Translator: | @@ -72,6 +79,7 @@ class Translator: | ||
| 72 | text: Text to translate | 79 | text: Text to translate |
| 73 | target_lang: Target language code ('zh', 'en', 'ru', etc.) | 80 | target_lang: Target language code ('zh', 'en', 'ru', etc.) |
| 74 | source_lang: Source language code (optional, auto-detect if None) | 81 | source_lang: Source language code (optional, auto-detect if None) |
| 82 | + context: Additional context for translation (overrides default context) | ||
| 75 | 83 | ||
| 76 | Returns: | 84 | Returns: |
| 77 | Translated text or None if translation fails | 85 | Translated text or None if translation fails |
| @@ -84,9 +92,12 @@ class Translator: | @@ -84,9 +92,12 @@ class Translator: | ||
| 84 | if source_lang: | 92 | if source_lang: |
| 85 | source_lang = source_lang.lower() | 93 | source_lang = source_lang.lower() |
| 86 | 94 | ||
| 87 | - # Check cache | 95 | + # Use provided context or default context |
| 96 | + translation_context = context or self.translation_context | ||
| 97 | + | ||
| 98 | + # Check cache (include context in cache key for accuracy) | ||
| 88 | if self.use_cache: | 99 | if self.use_cache: |
| 89 | - cache_key = f"{source_lang or 'auto'}:{target_lang}:{text}" | 100 | + cache_key = f"{source_lang or 'auto'}:{target_lang}:{translation_context}:{text}" |
| 90 | cached = self.cache.get(cache_key, category="translations") | 101 | cached = self.cache.get(cache_key, category="translations") |
| 91 | if cached: | 102 | if cached: |
| 92 | return cached | 103 | return cached |
| @@ -97,12 +108,12 @@ class Translator: | @@ -97,12 +108,12 @@ class Translator: | ||
| 97 | return text | 108 | return text |
| 98 | 109 | ||
| 99 | # Translate using DeepL with fallback | 110 | # Translate using DeepL with fallback |
| 100 | - result = self._translate_deepl(text, target_lang, source_lang) | 111 | + result = self._translate_deepl(text, target_lang, source_lang, translation_context) |
| 101 | 112 | ||
| 102 | # If translation failed, try fallback to free API | 113 | # If translation failed, try fallback to free API |
| 103 | if result is None and "api.deepl.com" in self.DEEPL_API_URL: | 114 | if result is None and "api.deepl.com" in self.DEEPL_API_URL: |
| 104 | print(f"[Translator] Pro API failed, trying free API...") | 115 | print(f"[Translator] Pro API failed, trying free API...") |
| 105 | - result = self._translate_deepl_free(text, target_lang, source_lang) | 116 | + result = self._translate_deepl_free(text, target_lang, source_lang, translation_context) |
| 106 | 117 | ||
| 107 | # If still failed, return original text with warning | 118 | # If still failed, return original text with warning |
| 108 | if result is None: | 119 | if result is None: |
| @@ -111,7 +122,7 @@ class Translator: | @@ -111,7 +122,7 @@ class Translator: | ||
| 111 | 122 | ||
| 112 | # Cache result | 123 | # Cache result |
| 113 | if result and self.use_cache: | 124 | if result and self.use_cache: |
| 114 | - cache_key = f"{source_lang or 'auto'}:{target_lang}:{text}" | 125 | + cache_key = f"{source_lang or 'auto'}:{target_lang}:{translation_context}:{text}" |
| 115 | self.cache.set(cache_key, result, category="translations") | 126 | self.cache.set(cache_key, result, category="translations") |
| 116 | 127 | ||
| 117 | return result | 128 | return result |
| @@ -120,9 +131,18 @@ class Translator: | @@ -120,9 +131,18 @@ class Translator: | ||
| 120 | self, | 131 | self, |
| 121 | text: str, | 132 | text: str, |
| 122 | target_lang: str, | 133 | target_lang: str, |
| 123 | - source_lang: Optional[str] | 134 | + source_lang: Optional[str], |
| 135 | + context: Optional[str] = None | ||
| 124 | ) -> Optional[str]: | 136 | ) -> Optional[str]: |
| 125 | - """Translate using DeepL API.""" | 137 | + """ |
| 138 | + Translate using DeepL API with context and glossary support. | ||
| 139 | + | ||
| 140 | + Args: | ||
| 141 | + text: Text to translate | ||
| 142 | + target_lang: Target language code | ||
| 143 | + source_lang: Source language code (optional) | ||
| 144 | + context: Context hint for translation (e.g., "e-commerce product search") | ||
| 145 | + """ | ||
| 126 | # Map to DeepL language codes | 146 | # Map to DeepL language codes |
| 127 | target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper()) | 147 | target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper()) |
| 128 | 148 | ||
| @@ -131,8 +151,13 @@ class Translator: | @@ -131,8 +151,13 @@ class Translator: | ||
| 131 | "Content-Type": "application/json", | 151 | "Content-Type": "application/json", |
| 132 | } | 152 | } |
| 133 | 153 | ||
| 154 | + # Build text with context for better disambiguation | ||
| 155 | + # For e-commerce, add context words to help DeepL understand the domain | ||
| 156 | + # This is especially important for single-word ambiguous terms like "车" (car vs rook) | ||
| 157 | + text_to_translate, needs_extraction = self._add_ecommerce_context(text, source_lang, context) | ||
| 158 | + | ||
| 134 | payload = { | 159 | payload = { |
| 135 | - "text": [text], | 160 | + "text": [text_to_translate], |
| 136 | "target_lang": target_code, | 161 | "target_lang": target_code, |
| 137 | } | 162 | } |
| 138 | 163 | ||
| @@ -140,6 +165,16 @@ class Translator: | @@ -140,6 +165,16 @@ class Translator: | ||
| 140 | source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper()) | 165 | source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper()) |
| 141 | payload["source_lang"] = source_code | 166 | payload["source_lang"] = source_code |
| 142 | 167 | ||
| 168 | + # Add glossary if configured | ||
| 169 | + if self.glossary_id: | ||
| 170 | + payload["glossary_id"] = self.glossary_id | ||
| 171 | + | ||
| 172 | + # Note: DeepL API v2 doesn't have a direct "context" parameter, | ||
| 173 | + # but we can improve translation by: | ||
| 174 | + # 1. Using glossary for domain-specific terms (best solution) | ||
| 175 | + # 2. Adding context words to the text (for single-word queries) - implemented in _add_ecommerce_context | ||
| 176 | + # 3. Using more specific source language detection | ||
| 177 | + | ||
| 143 | try: | 178 | try: |
| 144 | response = requests.post( | 179 | response = requests.post( |
| 145 | self.DEEPL_API_URL, | 180 | self.DEEPL_API_URL, |
| @@ -151,7 +186,13 @@ class Translator: | @@ -151,7 +186,13 @@ class Translator: | ||
| 151 | if response.status_code == 200: | 186 | if response.status_code == 200: |
| 152 | data = response.json() | 187 | data = response.json() |
| 153 | if "translations" in data and len(data["translations"]) > 0: | 188 | if "translations" in data and len(data["translations"]) > 0: |
| 154 | - return data["translations"][0]["text"] | 189 | + translated_text = data["translations"][0]["text"] |
| 190 | + # If we added context, extract just the term from the result | ||
| 191 | + if needs_extraction: | ||
| 192 | + translated_text = self._extract_term_from_translation( | ||
| 193 | + translated_text, text, target_code | ||
| 194 | + ) | ||
| 195 | + return translated_text | ||
| 155 | else: | 196 | else: |
| 156 | print(f"[Translator] DeepL API error: {response.status_code} - {response.text}") | 197 | print(f"[Translator] DeepL API error: {response.status_code} - {response.text}") |
| 157 | return None | 198 | return None |
| @@ -167,9 +208,14 @@ class Translator: | @@ -167,9 +208,14 @@ class Translator: | ||
| 167 | self, | 208 | self, |
| 168 | text: str, | 209 | text: str, |
| 169 | target_lang: str, | 210 | target_lang: str, |
| 170 | - source_lang: Optional[str] | 211 | + source_lang: Optional[str], |
| 212 | + context: Optional[str] = None | ||
| 171 | ) -> Optional[str]: | 213 | ) -> Optional[str]: |
| 172 | - """Translate using DeepL Free API.""" | 214 | + """ |
| 215 | + Translate using DeepL Free API. | ||
| 216 | + | ||
| 217 | + Note: Free API may not support glossary_id parameter. | ||
| 218 | + """ | ||
| 173 | # Map to DeepL language codes | 219 | # Map to DeepL language codes |
| 174 | target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper()) | 220 | target_code = self.LANG_CODE_MAP.get(target_lang, target_lang.upper()) |
| 175 | 221 | ||
| @@ -187,6 +233,9 @@ class Translator: | @@ -187,6 +233,9 @@ class Translator: | ||
| 187 | source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper()) | 233 | source_code = self.LANG_CODE_MAP.get(source_lang, source_lang.upper()) |
| 188 | payload["source_lang"] = source_code | 234 | payload["source_lang"] = source_code |
| 189 | 235 | ||
| 236 | + # Note: Free API typically doesn't support glossary_id | ||
| 237 | + # But we can still use context hints in the text | ||
| 238 | + | ||
| 190 | try: | 239 | try: |
| 191 | response = requests.post( | 240 | response = requests.post( |
| 192 | "https://api-free.deepl.com/v2/translate", | 241 | "https://api-free.deepl.com/v2/translate", |
| @@ -214,7 +263,8 @@ class Translator: | @@ -214,7 +263,8 @@ class Translator: | ||
| 214 | self, | 263 | self, |
| 215 | text: str, | 264 | text: str, |
| 216 | target_langs: List[str], | 265 | target_langs: List[str], |
| 217 | - source_lang: Optional[str] = None | 266 | + source_lang: Optional[str] = None, |
| 267 | + context: Optional[str] = None | ||
| 218 | ) -> Dict[str, Optional[str]]: | 268 | ) -> Dict[str, Optional[str]]: |
| 219 | """ | 269 | """ |
| 220 | Translate text to multiple target languages. | 270 | Translate text to multiple target languages. |
| @@ -223,15 +273,98 @@ class Translator: | @@ -223,15 +273,98 @@ class Translator: | ||
| 223 | text: Text to translate | 273 | text: Text to translate |
| 224 | target_langs: List of target language codes | 274 | target_langs: List of target language codes |
| 225 | source_lang: Source language code (optional) | 275 | source_lang: Source language code (optional) |
| 276 | + context: Context hint for translation (optional) | ||
| 226 | 277 | ||
| 227 | Returns: | 278 | Returns: |
| 228 | Dictionary mapping language code to translated text | 279 | Dictionary mapping language code to translated text |
| 229 | """ | 280 | """ |
| 230 | results = {} | 281 | results = {} |
| 231 | for lang in target_langs: | 282 | for lang in target_langs: |
| 232 | - results[lang] = self.translate(text, lang, source_lang) | 283 | + results[lang] = self.translate(text, lang, source_lang, context) |
| 233 | return results | 284 | return results |
| 234 | 285 | ||
| 286 | + def _add_ecommerce_context( | ||
| 287 | + self, | ||
| 288 | + text: str, | ||
| 289 | + source_lang: Optional[str], | ||
| 290 | + context: Optional[str] | ||
| 291 | + ) -> tuple: | ||
| 292 | + """ | ||
| 293 | + Add e-commerce context to text for better disambiguation. | ||
| 294 | + | ||
| 295 | + For single-word ambiguous Chinese terms, we add context words that help | ||
| 296 | + DeepL understand this is an e-commerce/product search context. | ||
| 297 | + | ||
| 298 | + Args: | ||
| 299 | + text: Original text to translate | ||
| 300 | + source_lang: Source language code | ||
| 301 | + context: Context hint | ||
| 302 | + | ||
| 303 | + Returns: | ||
| 304 | + Tuple of (text_with_context, needs_extraction) | ||
| 305 | + - text_with_context: Text to send to DeepL | ||
| 306 | + - needs_extraction: Whether we need to extract the term from the result | ||
| 307 | + """ | ||
| 308 | + # Only apply for e-commerce context and Chinese source | ||
| 309 | + if not context or "e-commerce" not in context.lower(): | ||
| 310 | + return text, False | ||
| 311 | + | ||
| 312 | + if not source_lang or source_lang.lower() != 'zh': | ||
| 313 | + return text, False | ||
| 314 | + | ||
| 315 | + # For single-word queries, add context to help disambiguation | ||
| 316 | + text_stripped = text.strip() | ||
| 317 | + if len(text_stripped.split()) == 1 and len(text_stripped) <= 2: | ||
| 318 | + # Common ambiguous Chinese e-commerce terms like "车" (car vs rook) | ||
| 319 | + # We add a context phrase: "购买 [term]" (buy [term]) or "商品 [term]" (product [term]) | ||
| 320 | + # This helps DeepL understand the e-commerce context | ||
| 321 | + # We'll need to extract just the term from the translation result | ||
| 322 | + context_phrase = f"购买 {text_stripped}" | ||
| 323 | + return context_phrase, True | ||
| 324 | + | ||
| 325 | + # For multi-word queries, DeepL usually has enough context | ||
| 326 | + return text, False | ||
| 327 | + | ||
| 328 | + def _extract_term_from_translation( | ||
| 329 | + self, | ||
| 330 | + translated_text: str, | ||
| 331 | + original_text: str, | ||
| 332 | + target_lang_code: str | ||
| 333 | + ) -> str: | ||
| 334 | + """ | ||
| 335 | + Extract the actual term from a translation that included context. | ||
| 336 | + | ||
| 337 | + For example, if we translated "购买 车" (buy car) and got "buy car", | ||
| 338 | + we want to extract just "car". | ||
| 339 | + | ||
| 340 | + Args: | ||
| 341 | + translated_text: Full translation result | ||
| 342 | + original_text: Original single-word query | ||
| 343 | + target_lang_code: Target language code (EN, ZH, etc.) | ||
| 344 | + | ||
| 345 | + Returns: | ||
| 346 | + Extracted term or original translation if extraction fails | ||
| 347 | + """ | ||
| 348 | + # For English target, try to extract the last word (the actual term) | ||
| 349 | + if target_lang_code == "EN": | ||
| 350 | + words = translated_text.strip().split() | ||
| 351 | + if len(words) > 1: | ||
| 352 | + # Usually the last word is the term we want | ||
| 353 | + # But we need to be smart - if it's "buy car", we want "car" | ||
| 354 | + # Common context words to skip: buy, purchase, product, item, etc. | ||
| 355 | + context_words = {"buy", "purchase", "product", "item", "commodity", "goods"} | ||
| 356 | + # Try to find the term (not a context word) | ||
| 357 | + for word in reversed(words): | ||
| 358 | + word_lower = word.lower().rstrip('.,!?;:') | ||
| 359 | + if word_lower not in context_words: | ||
| 360 | + return word_lower | ||
| 361 | + # If all words are context words, return the last one | ||
| 362 | + return words[-1].lower().rstrip('.,!?;:') | ||
| 363 | + | ||
| 364 | + # For other languages or if extraction fails, return as-is | ||
| 365 | + # The user can configure a glossary for better results | ||
| 366 | + return translated_text | ||
| 367 | + | ||
| 235 | def get_translation_needs( | 368 | def get_translation_needs( |
| 236 | self, | 369 | self, |
| 237 | detected_lang: str, | 370 | detected_lang: str, |