nllb-200-distilled-600M性能优化
已完成（2026-03）
- CTranslate2 迁移 + float16 转换
- 扩展压测报告：`perf_reports/20260318/translation_local_models_ct2/README.md`
- T4 聚焦调优报告：`perf_reports/20260318/translation_local_models_ct2_focus/README.md`
- NLLB T4 商品标题专项报告：`perf_reports/20260318/nllb_t4_product_names_ct2/README.md`
- 当前结论：
  - NLLB 在线默认推荐：`ct2_inter_threads=4 + ct2_max_queued_batches=32 + ct2_batch_type=examples + ct2_decoding_length_mode=source(+8,min=32)`
  - `opus-mt-zh-en` 维持保守默认更稳
  - `opus-mt-en-zh` 如追求离线吞吐可继续做单独 profile

请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型，有哪些性能优化方案，提高线上翻译服务的吞吐量、降低耗时，搜索相关的在线推理服务方案，找到高性能的服务化方法

cnclip的性能优化

rerank 性能优化


超时
Query 分析阶段等待翻译/embedding 的硬超时
配置文件位置：config/config.yaml
配置项：query_config.async_wait_timeout_ms: 80
代码生效点：query/query_parser.py 使用该值换算成秒传给 wait(...)
2) Embedding HTTP 调用超时（Text/Image）
不再使用任何环境变量覆盖（之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用）
配置文件位置：config/config.yaml
配置项：services.embedding.providers.http.timeout_sec（已在 YAML 里补了示例默认 60）
代码生效点：
embeddings/text_encoder.py：requests.post(..., timeout=self.timeout_sec)
embeddings/image_encoder.py：requests.post(..., timeout=self.timeout_sec)


product_enrich : Partial Mode   :   done
https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR
需在messages 数组中将最后一条消息的 role 设置为 assistant，并在其 content 中提供前缀，在此消息中设置参数 "partial": true。messages格式如下：
[
    {
        "role": "user",
        "content": "请补全这个斐波那契函数，勿添加其它内容"
    },
    {
        "role": "assistant",
        "content": "def calculate_fibonacci(n):\n    if n <= 1:\n        return n\n    else:\n",
        "partial": true
    }
]
模型会以前缀内容为起点开始生成。
支持 非思考模式。


融合打分（已完成，2026-03）
1. `fuse_scores_and_resort` 已改为乘法融合，并通过 `matched_queries` 提取：
   - `base_query`
   - `base_query_trans_*`
   - `fallback_original_query_*`
   - `knn_query`
2. 文本相关性大分不再依赖 `phrase_query` / `keywords_query`，这两类查询已清理。
3. 当前融合策略：
   - `text_score = primary(weighted_source, weighted_translation, weighted_fallback) + 0.25 * support`
   - `fused_score = (rerank_score + 0.00001) * (text_score + 0.1) ** 0.35 * (knn_score + 0.6) ** 0.2`
4. `track_scores` 与 `include_named_queries_score` 已接入，调试字段与评估方法已同步到：
   - `docs/相关性检索优化说明.md`
   - `docs/搜索API对接指南.md`
   - `docs/Usage-Guide.md`


suggest 索引，现在是全量脚本，要交给金伟


翻译，增加facebook/nllb-200-distilled-600M
https://blog.csdn.net/qq_42746084/article/details/154947534
https://huggingface.co/facebook/nllb-200-distilled-600M


店铺的语言：英语能占到80%，所以专门增加一个en-zh的
https://huggingface.co/Helsinki-NLP/opus-mt-zh-en
https://huggingface.co/Helsinki-NLP/opus-mt-en-zh


opus-mt-zh-en

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "./models/opus-mt-en-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
data = 'test'
encoded = tokenizer([data], return_tensors="pt")
translation = model.generate(**encoded)
result = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(result)


Qwen3-Reranker-4B-GGUF
https://modelscope.cn/models/dengcao/Qwen3-Reranker-4B-GGUF/summary
1. 要确定选择哪种量化方式
2. 确定提示词


reranker 补充：nvidia/llama-nemotron-rerank-1b-v2
encoder架构。
比较新。
性能更好。
亚马逊 电商搜索数据集比qwen-reranker-4b更好。
支持vLLM。


查看翻译的缓存情况

向量的缓存


AI - 生产 - MySQL
HOST：10.200.16.14 / localhost
端口：3316
用户名：root
密码：qY8tgodLoA&KT#yQ

AI - 生产 - Redis
HOST：10.200.16.14 / localhost
端口：6479
密码：dxEkegEZ@C5SXWKv


远程登录方式：
# redis
redis-cli -h 43.166.252.75 -p 6479

# mysql 3个用户，都可以远程登录
mysql -uroot -p'qY8tgodLoA&KT#yQ'
CREATE USER 'saas'@'%' IDENTIFIED BY '6dlpco6dVGuqzt^l';
CREATE USER 'sa'@'%' IDENTIFIED BY 'C#HU!GPps7ck8tsM';


ES：
HOST：10.200.16.14 / localhost
端口：9200
访问示例：
用户名密码：saas:4hOaLaf41y2VuI8y


安装 nvidia-container-toolkit （done）
https://mirrors.aliyun.com/github/releases/NVIDIA/nvidia-container-toolkit/
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html


qwen3-embedding、qwen3-reranker （done）
选一个推理引擎，相比于我自己直接调 sentence-transformers，主要是多进程和负载均衡、连续批处理，比较有用
当前结论：embedding 场景优先 TEI；vLLM 更偏向生成式与 rerank 场景。


混用 大模型 使用：hunyuan-turbos-latest
混元 OpenAI 兼容接口相关调用示例：https://cloud.tencent.com/document/product/1729/111007


腾讯云 混元大模型 API_KEY：sk-mN2PiW2gp57B3ykxGs4QhvYxhPzXRZ2bcR5kPqadjboGYwiz

hunyuan翻译：使用模型  hunyuan-translation
https://cloud.tencent.com/document/product/1729/113395#4.-.E7.A4.BA.E4.BE.8B


谷歌翻译 基础版：https://docs.cloud.google.com/translate/docs/reference/rest/v2/translate


阿里云 百炼模型 现在使用的apikey是国内的。
各地域的 Base URL 和对应的 API Key 是绑定的。

现在使用了美国的服务器，使用了美国的地址，需要在 美国地域控制台页面（https://modelstudio.console.aliyun.com/us-east-1 ）中创建或获取API_KEY：

登录 百炼美国地域控制台:https://modelstudio.console.aliyun.com/us-east-1?spm=5176.2020520104.0.0.6b383a98WjpXff
在 API Key 管理 中创建或复制一个适用于美国地域的 Key