Commit 3abbc95aef12ebad7dead00036e4c0523b2d5c55

Authored by tangwang
1 parent 048631be

重构(scripts): 整理scripts目录,按现架构分类并迁移性能/手动测试脚本

问题背景:
- scripts/
  目录下混有服务启动、数据转换、性能压测、临时脚本及历史备份目录
- 存在大量中间迭代遗留信息,不利于维护和新人理解
- 现行服务编排已稳定为 service_ctl up all 的集合:tei / cnclip /
  embedding / embedding-image / translator / reranker / backend /
indexer / frontend / eval-web,不再保留 reranker-fine 默认位

调整内容:
1. 根 scripts/ 收敛为运行、运维、环境、数据处理脚本,并新增
   scripts/README.md 说明文档
2. 性能/压测/调参脚本整体迁至 benchmarks/ 目录,同步更新
   benchmarks/README.md
3. 人工试跑脚本迁至 tests/manual/ 目录,同步更新 tests/manual/README.md
4. 删除明确过时内容:
   - scripts/indexer__old_2025_11/
   - scripts/start.sh
   - scripts/install_server_deps.sh
5. 同步修正以下文档中的路径及过时描述:
   - 根目录 README.md
   - 性能报告相关文档
   - reranker/translation 模块文档

技术细节:
- 性能测试不放常规 tests/
  的原因:这类脚本依赖真实服务、GPU、模型和环境噪声,不适合作为稳定回归门禁;benchmarks/
更贴合其定位
- tests/manual/ 仅存放需要人工启动依赖、手工观察结果的接口试跑脚本
- 所有迁移后的 Python 脚本已通过 py_compile 语法校验
- 所有迁移后的 Shell 脚本已通过 bash -n 语法校验

校验结果:
- py_compile: 通过
- bash -n: 通过
Showing 53 changed files with 191 additions and 1437 deletions   Show diff stats
CLAUDE.md
... ... @@ -77,9 +77,11 @@ source activate.sh
77 77 # Generate test data (Tenant1 Mock + Tenant2 CSV)
78 78 ./scripts/mock_data.sh
79 79  
80   -# Ingest data to Elasticsearch
81   -./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
82   -python main.py ingest data.csv --limit 1000 --batch-size 50
  80 +# Create tenant index structure
  81 +./scripts/create_tenant_index.sh <tenant_id>
  82 +
  83 +# Build / refresh suggestion index
  84 +./scripts/build_suggestions.sh <tenant_id> --mode incremental
83 85 ```
84 86  
85 87 ### Running Services
... ... @@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload
100 102 # Run all tests
101 103 pytest tests/
102 104  
103   -# Run specific test types
104   -pytest tests/unit/ # Unit tests
105   -pytest tests/integration/ # Integration tests
106   -pytest -m "api" # API tests only
  105 +# Run focused regression sets
  106 +python -m pytest tests/ci -q
  107 +pytest tests/test_rerank_client.py
  108 +pytest tests/test_query_parser_mixed_language.py
107 109  
108 110 # Test search from command line
109 111 python main.py search "query" --tenant-id 1 --size 10
... ... @@ -114,12 +116,8 @@ python main.py search &quot;query&quot; --tenant-id 1 --size 10
114 116 # Stop all services
115 117 ./scripts/stop.sh
116 118  
117   -# Test environment (for CI/development)
118   -./scripts/start_test_environment.sh
119   -./scripts/stop_test_environment.sh
120   -
121   -# Install server dependencies
122   -./scripts/install_server_deps.sh
  119 +# Run CI contract tests
  120 +./scripts/run_ci_tests.sh
123 121 ```
124 122  
125 123 ## Architecture Overview
... ... @@ -585,7 +583,7 @@ GET /admin/stats # Index statistics
585 583 ./scripts/start_frontend.sh # Frontend UI (port 6003)
586 584  
587 585 # Data Operations
588   -./scripts/ingest.sh <tenant_id> [recreate] # Index data
  586 +./scripts/create_tenant_index.sh <tenant_id> # Create tenant index
589 587 ./scripts/mock_data.sh # Generate test data
590 588  
591 589 # Testing
... ...
benchmarks/README.md 0 → 100644
... ... @@ -0,0 +1,17 @@
  1 +# Benchmarks
  2 +
  3 +基准压测脚本统一放在 `benchmarks/`,不再和 `scripts/` 里的服务启动/运维脚本混放。
  4 +
  5 +目录约定:
  6 +
  7 +- `benchmarks/perf_api_benchmark.py`:通用 HTTP 接口压测入口
  8 +- `benchmarks/reranker/`:reranker 定向 benchmark、smoke、手工对比脚本
  9 +- `benchmarks/translation/`:translation 本地模型 benchmark
  10 +
  11 +这些脚本默认不是 CI 测试的一部分,因为它们通常具备以下特征:
  12 +
  13 +- 依赖真实服务、GPU、模型或特定数据集
  14 +- 结果受机器配置和运行时负载影响,不适合作为稳定回归门禁
  15 +- 更多用于容量评估、调参和问题复现,而不是功能正确性判定
  16 +
  17 +如果某个性能场景需要进入自动化回归,应新增到 `tests/` 下并明确收敛输入、环境和判定阈值,而不是直接复用这里的基准脚本。
... ...
scripts/perf_api_benchmark.py renamed to benchmarks/perf_api_benchmark.py
... ... @@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0
11 11 - rerank POST /rerank
12 12  
13 13 Examples:
14   - python scripts/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
15   - python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
16   - python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
17   - python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json
  14 + python benchmarks/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
  15 + python benchmarks/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
  16 + python benchmarks/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
  17 + python benchmarks/perf_api_benchmark.py --scenario all --cases-file benchmarks/perf_cases.json.example --output perf_result.json
18 18 # Embedding admission / priority (query param `priority`; same semantics as embedding service):
19   - python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
20   - python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
  19 + python benchmarks/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
  20 + python benchmarks/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
21 21 """
22 22  
23 23 from __future__ import annotations
... ... @@ -229,7 +229,7 @@ def apply_embed_priority_params(
229 229 ) -> None:
230 230 """
231 231 Merge default `priority` query param into embed templates when absent.
232   - `scripts/perf_cases.json` may set per-request `params.priority` to override.
  232 + `benchmarks/perf_cases.json` may set per-request `params.priority` to override.
233 233 """
234 234 mapping = {
235 235 "embed_text": max(0, int(embed_text_priority)),
... ...
scripts/perf_cases.json.example renamed to benchmarks/perf_cases.json.example
scripts/benchmark_reranker_1000docs.sh renamed to benchmarks/reranker/benchmark_reranker_1000docs.sh
... ... @@ -8,7 +8,7 @@
8 8 # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/
9 9 #
10 10 # Usage:
11   -# ./scripts/benchmark_reranker_1000docs.sh
  11 +# ./benchmarks/reranker/benchmark_reranker_1000docs.sh
12 12 # Optional env:
13 13 # BATCH_SIZES="24 32 48 64"
14 14 # C1_REQUESTS=4
... ... @@ -85,7 +85,7 @@ run_bench() {
85 85 local c="$2"
86 86 local req="$3"
87 87 local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json"
88   - .venv/bin/python scripts/perf_api_benchmark.py \
  88 + .venv/bin/python benchmarks/perf_api_benchmark.py \
89 89 --scenario rerank \
90 90 --tenant-id "${TENANT_ID}" \
91 91 --reranker-base "${RERANK_BASE}" \
... ...
scripts/benchmark_reranker_gguf_local.py renamed to benchmarks/reranker/benchmark_reranker_gguf_local.py
... ... @@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure:
8 8 - single-request rerank latency
9 9  
10 10 Example:
11   - ./.venv-reranker-gguf/bin/python scripts/benchmark_reranker_gguf_local.py
12   - ./.venv-reranker-gguf-06b/bin/python scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  11 + ./.venv-reranker-gguf/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py
  12 + ./.venv-reranker-gguf-06b/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
13 13 """
14 14  
15 15 from __future__ import annotations
... ...
scripts/benchmark_reranker_random_titles.py renamed to benchmarks/reranker/benchmark_reranker_random_titles.py
... ... @@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for
10 10  
11 11 Example:
12 12 source activate.sh
13   - python scripts/benchmark_reranker_random_titles.py 386
14   - python scripts/benchmark_reranker_random_titles.py 40,80,100
15   - python scripts/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
16   - RERANK_BASE=http://127.0.0.1:6007 python scripts/benchmark_reranker_random_titles.py 200
  13 + python benchmarks/reranker/benchmark_reranker_random_titles.py 386
  14 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100
  15 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
  16 + RERANK_BASE=http://127.0.0.1:6007 python benchmarks/reranker/benchmark_reranker_random_titles.py 200
17 17 """
18 18  
19 19 from __future__ import annotations
... ...
tests/reranker_performance/curl1.sh renamed to benchmarks/reranker/manual/curl1.sh
tests/reranker_performance/curl1_simple.sh renamed to benchmarks/reranker/manual/curl1_simple.sh
tests/reranker_performance/curl2.sh renamed to benchmarks/reranker/manual/curl2.sh
tests/reranker_performance/rerank_performance_compare.sh renamed to benchmarks/reranker/manual/rerank_performance_compare.sh
scripts/patch_rerank_vllm_benchmark_config.py renamed to benchmarks/reranker/patch_rerank_vllm_benchmark_config.py
... ... @@ -73,7 +73,7 @@ def main() -&gt; int:
73 73 p.add_argument(
74 74 "--config",
75 75 type=Path,
76   - default=Path(__file__).resolve().parent.parent / "config" / "config.yaml",
  76 + default=Path(__file__).resolve().parents[2] / "config" / "config.yaml",
77 77 )
78 78 p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True)
79 79 p.add_argument(
... ...
scripts/run_reranker_vllm_instruction_benchmark.sh renamed to benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh
... ... @@ -55,13 +55,13 @@ run_one() {
55 55 local jf="${OUT_DIR}/${backend}_${fmt}.json"
56 56  
57 57 echo "========== ${tag} =========="
58   - "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  58 + "$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
59 59 --backend "$backend" --instruction-format "$fmt"
60 60  
61 61 "${ROOT}/restart.sh" reranker
62 62 wait_health "$backend" "$fmt"
63 63  
64   - if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \
  64 + if ! "$PYTHON" "${ROOT}/benchmarks/reranker/benchmark_reranker_random_titles.py" \
65 65 100,200,400,600,800,1000 \
66 66 --repeat 5 \
67 67 --seed 42 \
... ... @@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact
82 82 run_one qwen3_vllm_score standard
83 83  
84 84 # Restore repo-default-style rerank settings (score + compact).
85   -"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  85 +"$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
86 86 --backend qwen3_vllm_score --instruction-format compact
87 87 "${ROOT}/restart.sh" reranker
88 88 wait_health qwen3_vllm_score compact
... ...
scripts/smoke_qwen3_vllm_score_backend.py renamed to benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
... ... @@ -3,7 +3,7 @@
3 3 Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn).
4 4  
5 5 Usage (from repo root, score venv):
6   - PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py
  6 + PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
7 7  
8 8 Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when
9 9 vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends
... ... @@ -20,8 +20,8 @@ import sys
20 20 import sysconfig
21 21 from pathlib import Path
22 22  
23   -# Repo root on sys.path when run as scripts/smoke_*.py
24   -_ROOT = Path(__file__).resolve().parents[1]
  23 +# Repo root on sys.path when run from benchmarks/reranker/.
  24 +_ROOT = Path(__file__).resolve().parents[2]
25 25 if str(_ROOT) not in sys.path:
26 26 sys.path.insert(0, str(_ROOT))
27 27  
... ...
scripts/benchmark_nllb_t4_tuning.py renamed to benchmarks/translation/benchmark_nllb_t4_tuning.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List, Tuple
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 benchmark_concurrency_case,
21 21 benchmark_serial_case,
22 22 build_environment_info,
... ...
scripts/benchmark_translation_local_models.py renamed to benchmarks/translation/benchmark_translation_local_models.py
... ... @@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence
22 22 import torch
23 23 import transformers
24 24  
25   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  25 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
26 26 if str(PROJECT_ROOT) not in sys.path:
27 27 sys.path.insert(0, str(PROJECT_ROOT))
28 28  
... ...
scripts/benchmark_translation_local_models_focus.py renamed to benchmarks/translation/benchmark_translation_local_models_focus.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 SCENARIOS,
21 21 benchmark_concurrency_case,
22 22 benchmark_serial_case,
... ...
scripts/benchmark_translation_longtext_single.py renamed to benchmarks/translation/benchmark_translation_longtext_single.py
... ... @@ -13,7 +13,7 @@ from pathlib import Path
13 13  
14 14 import torch
15 15  
16   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  16 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
17 17  
18 18 import sys
19 19  
... ...
config/config.yaml
... ... @@ -114,7 +114,7 @@ field_boosts:
114 114 qanchors: 1.0
115 115 enriched_tags: 1.0
116 116 enriched_attributes.value: 1.5
117   - enriched_taxonomy_attributes.value: 0.3
  117 + # enriched_taxonomy_attributes.value: 0.3
118 118 category_name_text: 2.0
119 119 category_path: 2.0
120 120 keywords: 2.0
... ... @@ -195,7 +195,7 @@ query_config:
195 195 - qanchors
196 196 - enriched_tags
197 197 - enriched_attributes.value
198   - - enriched_taxonomy_attributes.value
  198 + # - enriched_taxonomy_attributes.value
199 199 - option1_values
200 200 - option2_values
201 201 - option3_values
... ... @@ -254,7 +254,7 @@ query_config:
254 254 # - qanchors
255 255 # - enriched_tags
256 256 # - enriched_attributes
257   - # - enriched_taxonomy_attributes.value
  257 + # - # enriched_taxonomy_attributes.value
258 258 - min_price
259 259 - compare_at_price
260 260 - image_url
... ...
docs/DEVELOPER_GUIDE.md
... ... @@ -389,7 +389,7 @@ services:
389 389 - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。
390 390 - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。
391 391 - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。
392   -- **运行**:`python -m pytest tests/`;仅单元:`python -m pytest tests/unit/` 或 `-m unit`
  392 +- **运行**:`python -m pytest tests/`;推荐最小回归:`python -m pytest tests/ci -q`;按模块聚焦可直接指定具体测试文件
393 393 - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。
394 394  
395 395 ### 8.3 配置与环境
... ...
docs/QUICKSTART.md
... ... @@ -69,7 +69,7 @@ source activate.sh
69 69 ./run.sh all
70 70 # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all
71 71 # 说明:
72   -# - all = tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web
  72 +# - all = tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web
73 73 # - up 会同时启动 monitor daemon(运行期连续失败自动重启)
74 74 # - reranker 为 GPU 强制模式(资源不足会直接启动失败)
75 75 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU)
... ...
docs/Usage-Guide.md
... ... @@ -126,7 +126,7 @@ cd /data/saas-search
126 126  
127 127 这个脚本会自动:
128 128 1. 创建日志目录
129   -2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web`)
  129 +2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web`)
130 130 3. 写入 PID 到 `logs/*.pid`
131 131 4. 执行健康检查
132 132 5. 启动 monitor daemon(运行期连续失败自动重启)
... ... @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 202 ./scripts/service_ctl.sh restart backend
203 203 sleep 3
204 204 ./scripts/service_ctl.sh status backend
205   -./scripts/evaluation/start_eval.sh.sh batch
  205 +./scripts/evaluation/start_eval.sh batch
206 206 ```
207 207  
208 208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ...
docs/工作总结-微服务性能优化与架构.md
... ... @@ -129,7 +129,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
129 129 - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。
130 130 - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。
131 131 - **命令**:
132   - - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、reranker-fine、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
  132 + - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
133 133 - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。
134 134 - **监控常驻**:
135 135 - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。
... ... @@ -153,12 +153,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
153 153  
154 154 ## 三、性能测试报告摘要
155 155  
156   -以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`scripts/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
  156 +以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`benchmarks/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
157 157  
158 158 **复现命令(四场景×四并发)**:
159 159 ```bash
160 160 cd /data/saas-search
161   -.venv/bin/python scripts/perf_api_benchmark.py \
  161 +.venv/bin/python benchmarks/perf_api_benchmark.py \
162 162 --scenario backend_search,backend_suggest,embed_text,rerank \
163 163 --concurrency-list 1,5,10,20 \
164 164 --duration 20 \
... ... @@ -188,7 +188,7 @@ cd /data/saas-search
188 188  
189 189 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令:
190 190 ```bash
191   -.venv/bin/python scripts/perf_api_benchmark.py \
  191 +.venv/bin/python benchmarks/perf_api_benchmark.py \
192 192 --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \
193 193 --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \
194 194 --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \
... ... @@ -217,7 +217,7 @@ cd /data/saas-search
217 217 | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96|
218 218 | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05|
219 219  
220   -**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
  220 +**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python benchmarks/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
221 221  
222 222 ---
223 223  
... ... @@ -247,5 +247,5 @@ cd /data/saas-search
247 247  
248 248 **关键文件与复现**:
249 249 - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。
250   -- 脚本:`scripts/service_ctl.sh`(启停与监控)、`scripts/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
  250 +- 脚本:`scripts/service_ctl.sh`(启停与监控)、`benchmarks/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
251 251 - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。
... ...
docs/性能测试报告.md
... ... @@ -18,13 +18,13 @@
18 18  
19 19 执行方式:
20 20 - 每组压测持续 `20s`
21   -- 使用统一脚本 `scripts/perf_api_benchmark.py`
  21 +- 使用统一脚本 `benchmarks/perf_api_benchmark.py`
22 22 - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发`
23 23  
24 24 ## 3. 压测工具优化说明(复用现有脚本)
25 25  
26 26 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本:
27   -- `scripts/perf_api_benchmark.py`
  27 +- `benchmarks/perf_api_benchmark.py`
28 28  
29 29 能力:
30 30 - 一条命令执行 `场景列表 x 并发列表` 全矩阵
... ... @@ -33,7 +33,7 @@
33 33 示例:
34 34  
35 35 ```bash
36   -.venv/bin/python scripts/perf_api_benchmark.py \
  36 +.venv/bin/python benchmarks/perf_api_benchmark.py \
37 37 --scenario backend_search,backend_suggest,embed_text,rerank \
38 38 --concurrency-list 1,5,10,20 \
39 39 --duration 20 \
... ... @@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health
106 106  
107 107 ```bash
108 108 cd /data/saas-search
109   -.venv/bin/python scripts/perf_api_benchmark.py \
  109 +.venv/bin/python benchmarks/perf_api_benchmark.py \
110 110 --scenario backend_search,backend_suggest,embed_text,rerank \
111 111 --concurrency-list 1,5,10,20 \
112 112 --duration 20 \
... ... @@ -164,7 +164,7 @@ cd /data/saas-search
164 164 复现命令:
165 165  
166 166 ```bash
167   -.venv/bin/python scripts/perf_api_benchmark.py \
  167 +.venv/bin/python benchmarks/perf_api_benchmark.py \
168 168 --scenario rerank \
169 169 --duration 20 \
170 170 --concurrency-list 1,5,10,20 \
... ... @@ -237,7 +237,7 @@ cd /data/saas-search
237 237 - 使用项目虚拟环境执行:
238 238  
239 239 ```bash
240   -.venv/bin/python scripts/perf_api_benchmark.py -h
  240 +.venv/bin/python benchmarks/perf_api_benchmark.py -h
241 241 ```
242 242  
243 243 ### 10.3 某场景成功率下降
... ... @@ -249,7 +249,7 @@ cd /data/saas-search
249 249  
250 250 ## 11. 关联文件
251 251  
252   -- 压测脚本:`scripts/perf_api_benchmark.py`
  252 +- 压测脚本:`benchmarks/perf_api_benchmark.py`
253 253 - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json`
254 254 - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/`
255 255 - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json`
... ... @@ -280,7 +280,7 @@ cd /data/saas-search
280 280 cd /data/saas-search
281 281 mkdir -p perf_reports/2026-03-12/search_tenant_matrix
282 282 for t in 0 1 2 3 4; do
283   - .venv/bin/python scripts/perf_api_benchmark.py \
  283 + .venv/bin/python benchmarks/perf_api_benchmark.py \
284 284 --scenario backend_search \
285 285 --concurrency-list 1,5,10,20 \
286 286 --duration 20 \
... ...
docs/搜索API对接指南-05-索引接口(Indexer).md
... ... @@ -498,7 +498,7 @@ curl -X GET &quot;http://localhost:6004/indexer/health&quot;
498 498  
499 499 #### 请求示例(完整 curl)
500 500  
501   -> 完整请求体参考 `scripts/test_build_docs_api.py` 中的 `build_sample_request()`。
  501 +> 完整请求体参考 `tests/manual/test_build_docs_api.py` 中的 `build_sample_request()`。
502 502  
503 503 ```bash
504 504 # 单条 SPU 示例(含 spu、skus、options)
... ...
docs/搜索API对接指南-10-接口级压测脚本.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 ## 10. 接口级压测脚本
6 6  
7   -仓库提供统一压测脚本:`scripts/perf_api_benchmark.py`,用于对以下接口做并发压测:
  7 +仓库提供统一压测脚本:`benchmarks/perf_api_benchmark.py`,用于对以下接口做并发压测:
8 8  
9 9 - 后端搜索:`POST /search/`
10 10 - 搜索建议:`GET /search/suggestions`
... ... @@ -18,21 +18,21 @@
18 18  
19 19 ```bash
20 20 # suggest 压测(tenant 162)
21   -python scripts/perf_api_benchmark.py \
  21 +python benchmarks/perf_api_benchmark.py \
22 22 --scenario backend_suggest \
23 23 --tenant-id 162 \
24 24 --duration 30 \
25 25 --concurrency 50
26 26  
27 27 # search 压测
28   -python scripts/perf_api_benchmark.py \
  28 +python benchmarks/perf_api_benchmark.py \
29 29 --scenario backend_search \
30 30 --tenant-id 162 \
31 31 --duration 30 \
32 32 --concurrency 20
33 33  
34 34 # 全链路压测(search + suggest + embedding + translate + rerank)
35   -python scripts/perf_api_benchmark.py \
  35 +python benchmarks/perf_api_benchmark.py \
36 36 --scenario all \
37 37 --tenant-id 162 \
38 38 --duration 60 \
... ... @@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \
45 45 可通过 `--cases-file` 覆盖默认请求模板。示例文件:
46 46  
47 47 ```bash
48   -scripts/perf_cases.json.example
  48 +benchmarks/perf_cases.json.example
49 49 ```
50 50  
51 51 执行示例:
52 52  
53 53 ```bash
54   -python scripts/perf_api_benchmark.py \
  54 +python benchmarks/perf_api_benchmark.py \
55 55 --scenario all \
56 56 --tenant-id 162 \
57   - --cases-file scripts/perf_cases.json.example \
  57 + --cases-file benchmarks/perf_cases.json.example \
58 58 --duration 60 \
59 59 --concurrency 40
60 60 ```
61   -
... ...
docs/相关性检索优化说明.md
... ... @@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
330 330 ./scripts/service_ctl.sh restart backend
331 331 sleep 3
332 332 ./scripts/service_ctl.sh status backend
333   -./scripts/evaluation/start_eval.sh.sh batch
  333 +./scripts/evaluation/start_eval.sh batch
334 334 ```
335 335  
336 336 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ... @@ -895,4 +895,3 @@ rerank_score:0.4784
895 895 rerank_score:0.5849
896 896 "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒",
897 897 "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable"
898   -
... ...
embeddings/README.md
... ... @@ -98,10 +98,10 @@
98 98  
99 99 ### 性能与压测(沿用仓库脚本)
100 100  
101   -- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py`
102   - - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
  101 +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`benchmarks/perf_api_benchmark.py`
  102 + - 示例:`python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
103 103 - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1`
104   - - 自定义请求模板:`--cases-file scripts/perf_cases.json.example`
  104 + - 自定义请求模板:`--cases-file benchmarks/perf_cases.json.example`
105 105 - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。
106 106  
107 107 ### 启动服务
... ...
perf_reports/20260311/reranker_1000docs/report.md
... ... @@ -34,5 +34,5 @@ Workload profile:
34 34 ## Reproduce
35 35  
36 36 ```bash
37   -./scripts/benchmark_reranker_1000docs.sh
  37 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
38 38 ```
... ...
perf_reports/20260317/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3   -Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  3 +Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
4 4  
5 5 Test time: `2026-03-17`
6 6  
... ... @@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In
67 67 Command used:
68 68  
69 69 ```bash
70   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  70 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
71 71 --single \
72 72 --model nllb-200-distilled-600m \
73 73 --source-lang zh \
... ...
perf_reports/20260318/nllb_t4_product_names_ct2/README.md
1 1 # NLLB T4 Product-Name Tuning Summary
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_nllb_t4_tuning.py`](/data/saas-search/scripts/benchmark_nllb_t4_tuning.py)
  4 +- [`benchmarks/translation/benchmark_nllb_t4_tuning.py`](/data/saas-search/benchmarks/translation/benchmark_nllb_t4_tuning.py)
5 5  
6 6 本轮报告:
7 7 - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md)
... ...
perf_reports/20260318/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 完整结果:
7 7 - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
... ... @@ -39,7 +39,7 @@
39 39  
40 40 ```bash
41 41 cd /data/saas-search
42   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  42 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
43 43 --suite extended \
44 44 --disable-cache \
45 45 --serial-items-per-case 256 \
... ...
perf_reports/20260318/translation_local_models_ct2/README.md
1 1 # Local Translation Model Benchmark Report (CTranslate2)
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 本轮 CT2 结果:
7 7 - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
... ... @@ -46,7 +46,7 @@ from datetime import datetime
46 46 from pathlib import Path
47 47 from types import SimpleNamespace
48 48  
49   -from scripts.benchmark_translation_local_models import (
  49 +from benchmarks.translation.benchmark_translation_local_models import (
50 50 SCENARIOS,
51 51 benchmark_extended_scenario,
52 52 build_environment_info,
... ...
perf_reports/20260318/translation_local_models_ct2_focus/README.md
1 1 # Local Translation Model Focused T4 Tuning
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
5 5  
6 6 本轮聚焦结果:
7 7 - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md)
... ...
perf_reports/README.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 | 脚本 | 用途 |
6 6 |------|------|
7   -| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` |
  7 +| `benchmarks/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `benchmarks/perf_cases.json.example` |
8 8  
9 9 历史矩阵示例(并发扫描):
10 10  
... ... @@ -25,10 +25,10 @@
25 25  
26 26 ```bash
27 27 source activate.sh
28   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
29   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
30   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
31   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
  28 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
  29 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
  30 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
  31 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
32 32 ```
33 33  
34 34 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。
... ...
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
... ... @@ -25,7 +25,7 @@ Shared across both backends for this run:
25 25  
26 26 ## Methodology
27 27  
28   -- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
  28 +- Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
29 29 - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
30 30 - Query: default `健身女生T恤短袖`.
31 31 - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
... ... @@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
56 56 ## Tooling added / changed
57 57  
58 58 - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
59   -- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
60   -- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
61   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
  59 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  60 +- `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  61 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
62 62  
63 63 ---
64 64  
... ... @@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
73 73 | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
74 74 | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
75 75 | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
76   -| Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
  76 +| Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
77 77  
78 78 Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
79 79  
... ...
reranker/DEPLOYMENT_AND_TUNING.md
... ... @@ -109,7 +109,7 @@ curl -sS http://127.0.0.1:6007/health
109 109 ### 5.1 使用一键压测脚本
110 110  
111 111 ```bash
112   -./scripts/benchmark_reranker_1000docs.sh
  112 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
113 113 ```
114 114  
115 115 输出目录:
... ...
reranker/GGUF_0_6B_INSTALL_AND_TUNING.md
... ... @@ -144,7 +144,7 @@ qwen3_gguf_06b:
144 144  
145 145 ```bash
146 146 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
147   - scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  147 + benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
148 148 ```
149 149  
150 150 按服务方式启动:
... ...
reranker/GGUF_INSTALL_AND_TUNING.md
... ... @@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1
117 117  
118 118 ```bash
119 119 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
120   - scripts/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
  120 + benchmarks/reranker/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
121 121 ```
122 122  
123 123 它会直接实例化 GGUF backend,输出:
... ... @@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
134 134  
135 135 - Query: `白色oversized T-shirt`
136 136 - Docs: `64` 条商品标题
137   -- 本地脚本:`scripts/benchmark_reranker_gguf_local.py`
  137 +- 本地脚本:`benchmarks/reranker/benchmark_reranker_gguf_local.py`
138 138 - 每组 1 次,重点比较相对趋势
139 139  
140 140 结果:
... ... @@ -195,7 +195,7 @@ n_gpu_layers=999
195 195  
196 196 ```bash
197 197 RERANK_BASE=http://127.0.0.1:6007 \
198   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
  198 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
199 199 ```
200 200  
201 201 得到:
... ... @@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \
206 206  
207 207 ```bash
208 208 RERANK_BASE=http://127.0.0.1:6007 \
209   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
  209 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
210 210 ```
211 211  
212 212 得到:
... ... @@ -276,5 +276,5 @@ offload_kqv: true
276 276 - `config/config.yaml`
277 277 - `scripts/setup_reranker_venv.sh`
278 278 - `scripts/start_reranker.sh`
279   -- `scripts/benchmark_reranker_gguf_local.py`
  279 +- `benchmarks/reranker/benchmark_reranker_gguf_local.py`
280 280 - `reranker/GGUF_INSTALL_AND_TUNING.md`
... ...
reranker/README.md
... ... @@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin
46 46 - `backends/dashscope_rerank.py`:DashScope 云端重排后端
47 47 - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv
48 48 - `scripts/start_reranker.sh`:启动 reranker 服务
49   -- `scripts/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
50   -- `scripts/benchmark_reranker_random_titles.py`:随机标题压测脚本
51   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
  49 +- `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
  50 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`:随机标题压测脚本
  51 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
52 52  
53 53 ## 环境基线
54 54  
... ... @@ -118,7 +118,7 @@ nvidia-smi
118 118 ### 4. Smoke
119 119  
120 120 ```bash
121   -PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
  121 +PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
122 122 ```
123 123  
124 124 ## `jina_reranker_v3`
... ...
scripts/README.md 0 → 100644
... ... @@ -0,0 +1,53 @@
  1 +# Scripts
  2 +
  3 +`scripts/` 现在只保留当前架构下仍然有效的运行、运维、环境和数据处理脚本。
  4 +
  5 +## 当前分类
  6 +
  7 +- 服务编排
  8 + - `service_ctl.sh`
  9 + - `start_backend.sh`
  10 + - `start_indexer.sh`
  11 + - `start_frontend.sh`
  12 + - `start_eval_web.sh`
  13 + - `start_embedding_service.sh`
  14 + - `start_embedding_text_service.sh`
  15 + - `start_embedding_image_service.sh`
  16 + - `start_reranker.sh`
  17 + - `start_translator.sh`
  18 + - `start_tei_service.sh`
  19 + - `start_cnclip_service.sh`
  20 + - `stop.sh`
  21 + - `stop_tei_service.sh`
  22 + - `stop_cnclip_service.sh`
  23 +
  24 +- 环境初始化
  25 + - `create_venv.sh`
  26 + - `init_env.sh`
  27 + - `setup_embedding_venv.sh`
  28 + - `setup_reranker_venv.sh`
  29 + - `setup_translator_venv.sh`
  30 + - `setup_cnclip_venv.sh`
  31 +
  32 +- 数据与索引
  33 + - `create_tenant_index.sh`
  34 + - `build_suggestions.sh`
  35 + - `mock_data.sh`
  36 +
  37 +- 评估与专项工具
  38 + - `evaluation/`
  39 + - `redis/`
  40 + - `debug/`
  41 +
  42 +## 已迁移
  43 +
  44 +- 基准压测与 smoke 脚本:迁到 `benchmarks/`
  45 +- 手工接口试跑脚本:迁到 `tests/manual/`
  46 +
  47 +## 已清理
  48 +
  49 +- 历史备份目录:`indexer__old_2025_11/`
  50 +- 过时壳脚本:`start.sh`
  51 +- Conda 时代残留:`install_server_deps.sh`
  52 +
  53 +后续如果新增脚本,优先放到明确子目录,不再把 benchmark、manual、历史备份直接丢回根 `scripts/`。
... ...
scripts/trace_indexer_calls.sh renamed to scripts/debug/trace_indexer_calls.sh
1 1 #!/bin/bash
2 2 #
3 3 # 排查「谁在调用索引服务」的脚本
4   -# 用法: ./scripts/trace_indexer_calls.sh
  4 +# 用法: ./scripts/debug/trace_indexer_calls.sh
5 5 #
6 6  
7 7 set -euo pipefail
... ...
scripts/indexer__old_2025_11/import_tenant2_csv.py deleted
... ... @@ -1,495 +0,0 @@
1   -#!/usr/bin/env python3
2   -"""
3   -Import tenant2 CSV data into MySQL Shoplazza tables.
4   -
5   -Reads CSV file and generates SQL INSERT statements for SPU and SKU tables.
6   -Each CSV row corresponds to 1 SPU and 1 SKU.
7   -This script is for generating test data for tenant_id=2 from CSV files.
8   -"""
9   -
10   -import sys
11   -import os
12   -import csv
13   -import random
14   -import argparse
15   -import re
16   -from pathlib import Path
17   -from datetime import datetime, timedelta
18   -
19   -# Add parent directory to path
20   -sys.path.insert(0, str(Path(__file__).parent.parent))
21   -
22   -
23   -def escape_sql_string(value: str) -> str:
24   - """
25   - Escape SQL string value (replace single quotes with doubled quotes and handle special characters).
26   -
27   - Args:
28   - value: String value to escape
29   -
30   - Returns:
31   - Escaped string
32   - """
33   - if value is None:
34   - return ''
35   -
36   - # Convert to string and handle None
37   - s = str(value)
38   -
39   - # Replace single quotes with doubled quotes (SQL standard)
40   - s = s.replace("'", "''")
41   -
42   - # Replace backslashes (MySQL escape)
43   - s = s.replace("\\", "\\\\")
44   -
45   - # Remove or replace control characters that can break SQL
46   - # Replace newlines and carriage returns with spaces
47   - s = s.replace("\n", " ").replace("\r", " ")
48   -
49   - # Remove other control characters (except tab)
50   - s = re.sub(r'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]', '', s)
51   -
52   - # Remove null bytes
53   - s = s.replace('\x00', '')
54   -
55   - return s
56   -
57   -
58   -def generate_handle(title: str) -> str:
59   - """
60   - Generate URL-friendly handle from title.
61   -
62   - Args:
63   - title: Product title
64   -
65   - Returns:
66   - URL-friendly handle
67   - """
68   - # Remove special characters, convert to lowercase, replace spaces with hyphens
69   - handle = re.sub(r'[^\w\s-]', '', title.lower())
70   - handle = re.sub(r'[-\s]+', '-', handle)
71   - handle = handle.strip('-')
72   - # Limit length
73   - if len(handle) > 255:
74   - handle = handle[:255]
75   - return handle or 'product'
76   -
77   -
78   -def parse_csv_row(row: dict) -> dict:
79   - """
80   - Parse CSV row and extract fields.
81   -
82   - Args:
83   - row: CSV row dictionary
84   -
85   - Returns:
86   - Parsed data dictionary
87   - """
88   - # Remove quotes from values if present
89   - def clean_value(value):
90   - if value is None:
91   - return ''
92   - value = str(value).strip()
93   - # Remove surrounding quotes
94   - if value.startswith('"') and value.endswith('"'):
95   - value = value[1:-1]
96   - return value
97   -
98   - return {
99   - 'skuId': clean_value(row.get('skuId', '')),
100   - 'name': clean_value(row.get('name', '')),
101   - 'name_pinyin': clean_value(row.get('name_pinyin', '')),
102   - 'create_time': clean_value(row.get('create_time', '')),
103   - 'ruSkuName': clean_value(row.get('ruSkuName', '')),
104   - 'enSpuName': clean_value(row.get('enSpuName', '')),
105   - 'categoryName': clean_value(row.get('categoryName', '')),
106   - 'supplierName': clean_value(row.get('supplierName', '')),
107   - 'brandName': clean_value(row.get('brandName', '')),
108   - 'file_id': clean_value(row.get('file_id', '')),
109   - 'days_since_last_update': clean_value(row.get('days_since_last_update', '')),
110   - 'id': clean_value(row.get('id', '')),
111   - 'imageUrl': clean_value(row.get('imageUrl', ''))
112   - }
113   -
114   -
115   -def generate_spu_data(csv_data: dict, spu_id: int, tenant_id: str = "2") -> dict:
116   - """
117   - Generate SPU data from CSV row.
118   -
119   - Args:
120   - csv_data: Parsed CSV row data
121   - spu_id: SPU ID
122   - tenant_id: Tenant ID (default: "2")
123   -
124   - Returns:
125   - SPU data dictionary
126   - """
127   - # Parse create_time
128   - try:
129   - created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S')
130   - except:
131   - created_at = datetime.now() - timedelta(days=random.randint(1, 365))
132   -
133   - updated_at = created_at + timedelta(days=random.randint(0, 30))
134   -
135   - # Generate handle from title
136   - title = csv_data['name'] or csv_data['enSpuName'] or 'Product'
137   - handle = generate_handle(title)
138   -
139   - # Generate tags from category and brand
140   - tags_parts = []
141   - if csv_data['categoryName']:
142   - tags_parts.append(csv_data['categoryName'])
143   - if csv_data['brandName']:
144   - tags_parts.append(csv_data['brandName'])
145   - tags = ','.join(tags_parts) if tags_parts else ''
146   -
147   - # Generate SEO fields
148   - seo_title = f"{title} - {csv_data['categoryName']}" if csv_data['categoryName'] else title
149   - seo_description = f"购买{csv_data['brandName']}{title}" if csv_data['brandName'] else title
150   - seo_keywords = f"{title},{csv_data['categoryName']},{csv_data['brandName']}" if csv_data['categoryName'] else title
151   -
152   - spu = {
153   - 'id': spu_id,
154   - 'shop_id': 1,
155   - 'shoplazza_id': csv_data['id'] or f"spu-{spu_id}",
156   - 'handle': handle,
157   - 'title': title,
158   - 'brief': csv_data['name'] or '',
159   - 'description': f"<p>{csv_data['name']}</p>" if csv_data['name'] else '',
160   - 'spu': '',
161   - 'vendor': csv_data['supplierName'] or '',
162   - 'vendor_url': '',
163   - 'seo_title': seo_title,
164   - 'seo_description': seo_description,
165   - 'seo_keywords': seo_keywords,
166   - 'image_src': csv_data['imageUrl'] or '',
167   - 'image_width': 800,
168   - 'image_height': 600,
169   - 'image_path': f"products/{spu_id}.jpg",
170   - 'image_alt': title,
171   - 'inventory_policy': '',
172   - 'inventory_quantity': 0,
173   - 'inventory_tracking': '0',
174   - 'published': 1,
175   - 'published_at': created_at.strftime('%Y-%m-%d %H:%M:%S'),
176   - 'requires_shipping': 1,
177   - 'taxable': 0,
178   - 'fake_sales': 0,
179   - 'display_fake_sales': 0,
180   - 'mixed_wholesale': 0,
181   - 'need_variant_image': 0,
182   - 'has_only_default_variant': 0,
183   - 'tags': tags,
184   - 'note': '',
185   - 'category': csv_data['categoryName'] or '',
186   - 'shoplazza_created_at': created_at.strftime('%Y-%m-%d %H:%M:%S'),
187   - 'shoplazza_updated_at': updated_at.strftime('%Y-%m-%d %H:%M:%S'),
188   - 'tenant_id': tenant_id,
189   - 'creator': '1',
190   - 'create_time': created_at.strftime('%Y-%m-%d %H:%M:%S'),
191   - 'updater': '1',
192   - 'update_time': updated_at.strftime('%Y-%m-%d %H:%M:%S'),
193   - 'deleted': 0
194   - }
195   -
196   - return spu
197   -
198   -
199   -def generate_sku_data(csv_data: dict, spu_id: int, sku_id: int, tenant_id: str = "2") -> dict:
200   - """
201   - Generate SKU data from CSV row.
202   -
203   - Args:
204   - csv_data: Parsed CSV row data
205   - spu_id: Associated SPU ID
206   - sku_id: SKU ID (from CSV skuId)
207   - tenant_id: Tenant ID (default: "2")
208   -
209   - Returns:
210   - SKU data dictionary
211   - """
212   - # Parse create_time
213   - try:
214   - created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S')
215   - except:
216   - created_at = datetime.now() - timedelta(days=random.randint(1, 365))
217   -
218   - updated_at = created_at + timedelta(days=random.randint(0, 30))
219   -
220   - # Generate random price
221   - price = round(random.uniform(50, 500), 2)
222   - compare_at_price = round(price * random.uniform(1.2, 1.5), 2)
223   - cost_price = round(price * 0.6, 2)
224   -
225   - # Generate random stock
226   - inventory_quantity = random.randint(0, 100)
227   -
228   - # Generate random weight
229   - weight = round(random.uniform(0.1, 5.0), 2)
230   -
231   - # Use ruSkuName as title, fallback to name
232   - title = csv_data['ruSkuName'] or csv_data['name'] or 'SKU'
233   -
234   - # Use skuId as SKU code
235   - sku_code = csv_data['skuId'] or f"SKU-{sku_id}"
236   -
237   - sku = {
238   - 'id': sku_id,
239   - 'spu_id': spu_id,
240   - 'shop_id': 1,
241   - 'shoplazza_id': f"sku-{sku_id}",
242   - 'shoplazza_product_id': csv_data['id'] or f"spu-{spu_id}",
243   - 'shoplazza_image_id': '',
244   - 'title': title,
245   - 'sku': sku_code,
246   - 'barcode': f"BAR{sku_id:08d}",
247   - 'position': 1,
248   - 'price': price,
249   - 'compare_at_price': compare_at_price,
250   - 'cost_price': cost_price,
251   - 'option1': '',
252   - 'option2': '',
253   - 'option3': '',
254   - 'inventory_quantity': inventory_quantity,
255   - 'weight': weight,
256   - 'weight_unit': 'kg',
257   - 'image_src': csv_data['imageUrl'] or '',
258   - 'wholesale_price': f'[{{"price": {round(price * 0.8, 2)}, "minQuantity": 10}}]',
259   - 'note': '',
260   - 'extend': None, # JSON field, use NULL
261   - 'shoplazza_created_at': created_at.strftime('%Y-%m-%d %H:%M:%S'),
262   - 'shoplazza_updated_at': updated_at.strftime('%Y-%m-%d %H:%M:%S'),
263   - 'tenant_id': tenant_id,
264   - 'creator': '1',
265   - 'create_time': created_at.strftime('%Y-%m-%d %H:%M:%S'),
266   - 'updater': '1',
267   - 'update_time': updated_at.strftime('%Y-%m-%d %H:%M:%S'),
268   - 'deleted': 0
269   - }
270   -
271   - return sku
272   -
273   -
274   -def read_csv_file(csv_file: str) -> list:
275   - """
276   - Read CSV file and return list of parsed rows.
277   -
278   - Args:
279   - csv_file: Path to CSV file
280   -
281   - Returns:
282   - List of parsed CSV data dictionaries
283   - """
284   - csv_data_list = []
285   -
286   - with open(csv_file, 'r', encoding='utf-8') as f:
287   - # Use csv.DictReader to handle quoted fields properly
288   - reader = csv.DictReader(f)
289   - for row in reader:
290   - parsed = parse_csv_row(row)
291   - csv_data_list.append(parsed)
292   -
293   - return csv_data_list
294   -
295   -
296   -def generate_sql_inserts(spus: list, skus: list, output_file: str):
297   - """
298   - Generate SQL INSERT statements.
299   -
300   - Args:
301   - spus: List of SPU data
302   - skus: List of SKU data
303   - output_file: Output file path
304   - """
305   - with open(output_file, 'w', encoding='utf-8') as f:
306   - f.write("-- SPU Data from tenant2 CSV\n")
307   - f.write("INSERT INTO shoplazza_product_spu (\n")
308   - f.write(" id, shop_id, shoplazza_id, handle, title, brief, description, spu,\n")
309   - f.write(" vendor, vendor_url, seo_title, seo_description, seo_keywords,\n")
310   - f.write(" image_src, image_width, image_height, image_path, image_alt,\n")
311   - f.write(" inventory_policy, inventory_quantity, inventory_tracking,\n")
312   - f.write(" published, published_at, requires_shipping, taxable,\n")
313   - f.write(" fake_sales, display_fake_sales, mixed_wholesale, need_variant_image,\n")
314   - f.write(" has_only_default_variant, tags, note, category,\n")
315   - f.write(" shoplazza_created_at, shoplazza_updated_at, tenant_id,\n")
316   - f.write(" creator, create_time, updater, update_time, deleted\n")
317   - f.write(") VALUES\n")
318   -
319   - for i, spu in enumerate(spus):
320   - values = (
321   - f"({spu['id']}, {spu['shop_id']}, '{escape_sql_string(spu['shoplazza_id'])}', "
322   - f"'{escape_sql_string(spu['handle'])}', '{escape_sql_string(spu['title'])}', "
323   - f"'{escape_sql_string(spu['brief'])}', '{escape_sql_string(spu['description'])}', "
324   - f"'{escape_sql_string(spu['spu'])}', '{escape_sql_string(spu['vendor'])}', "
325   - f"'{escape_sql_string(spu['vendor_url'])}', '{escape_sql_string(spu['seo_title'])}', "
326   - f"'{escape_sql_string(spu['seo_description'])}', '{escape_sql_string(spu['seo_keywords'])}', "
327   - f"'{escape_sql_string(spu['image_src'])}', {spu['image_width']}, "
328   - f"{spu['image_height']}, '{escape_sql_string(spu['image_path'])}', "
329   - f"'{escape_sql_string(spu['image_alt'])}', '{escape_sql_string(spu['inventory_policy'])}', "
330   - f"{spu['inventory_quantity']}, '{escape_sql_string(spu['inventory_tracking'])}', "
331   - f"{spu['published']}, '{escape_sql_string(spu['published_at'])}', "
332   - f"{spu['requires_shipping']}, {spu['taxable']}, "
333   - f"{spu['fake_sales']}, {spu['display_fake_sales']}, {spu['mixed_wholesale']}, "
334   - f"{spu['need_variant_image']}, {spu['has_only_default_variant']}, "
335   - f"'{escape_sql_string(spu['tags'])}', '{escape_sql_string(spu['note'])}', "
336   - f"'{escape_sql_string(spu['category'])}', '{escape_sql_string(spu['shoplazza_created_at'])}', "
337   - f"'{escape_sql_string(spu['shoplazza_updated_at'])}', '{escape_sql_string(spu['tenant_id'])}', "
338   - f"'{escape_sql_string(spu['creator'])}', '{escape_sql_string(spu['create_time'])}', "
339   - f"'{escape_sql_string(spu['updater'])}', '{escape_sql_string(spu['update_time'])}', "
340   - f"{spu['deleted']})"
341   - )
342   - f.write(values)
343   - if i < len(spus) - 1:
344   - f.write(",\n")
345   - else:
346   - f.write(";\n\n")
347   -
348   - f.write("-- SKU Data from tenant2 CSV\n")
349   - f.write("INSERT INTO shoplazza_product_sku (\n")
350   - f.write(" id, spu_id, shop_id, shoplazza_id, shoplazza_product_id, shoplazza_image_id,\n")
351   - f.write(" title, sku, barcode, position, price, compare_at_price, cost_price,\n")
352   - f.write(" option1, option2, option3, inventory_quantity, weight, weight_unit,\n")
353   - f.write(" image_src, wholesale_price, note, extend,\n")
354   - f.write(" shoplazza_created_at, shoplazza_updated_at, tenant_id,\n")
355   - f.write(" creator, create_time, updater, update_time, deleted\n")
356   - f.write(") VALUES\n")
357   -
358   - for i, sku in enumerate(skus):
359   - # Handle extend field (JSON, can be NULL)
360   - extend_value = 'NULL' if sku['extend'] is None else f"'{escape_sql_string(sku['extend'])}'"
361   -
362   - values = (
363   - f"({sku['id']}, {sku['spu_id']}, {sku['shop_id']}, '{escape_sql_string(sku['shoplazza_id'])}', "
364   - f"'{escape_sql_string(sku['shoplazza_product_id'])}', '{escape_sql_string(sku['shoplazza_image_id'])}', "
365   - f"'{escape_sql_string(sku['title'])}', '{escape_sql_string(sku['sku'])}', "
366   - f"'{escape_sql_string(sku['barcode'])}', {sku['position']}, "
367   - f"{sku['price']}, {sku['compare_at_price']}, {sku['cost_price']}, "
368   - f"'{escape_sql_string(sku['option1'])}', '{escape_sql_string(sku['option2'])}', "
369   - f"'{escape_sql_string(sku['option3'])}', {sku['inventory_quantity']}, {sku['weight']}, "
370   - f"'{escape_sql_string(sku['weight_unit'])}', '{escape_sql_string(sku['image_src'])}', "
371   - f"'{escape_sql_string(sku['wholesale_price'])}', '{escape_sql_string(sku['note'])}', "
372   - f"{extend_value}, '{escape_sql_string(sku['shoplazza_created_at'])}', "
373   - f"'{escape_sql_string(sku['shoplazza_updated_at'])}', '{escape_sql_string(sku['tenant_id'])}', "
374   - f"'{escape_sql_string(sku['creator'])}', '{escape_sql_string(sku['create_time'])}', "
375   - f"'{escape_sql_string(sku['updater'])}', '{escape_sql_string(sku['update_time'])}', "
376   - f"{sku['deleted']})"
377   - )
378   - f.write(values)
379   - if i < len(skus) - 1:
380   - f.write(",\n")
381   - else:
382   - f.write(";\n")
383   -
384   -
385   -def get_max_ids_from_db(db_config=None):
386   - """
387   - Get maximum IDs from database to avoid primary key conflicts.
388   -
389   - Args:
390   - db_config: Optional database config dict with keys: host, port, database, username, password
391   -
392   - Returns:
393   - tuple: (max_spu_id, max_sku_id) or (0, 0) if cannot connect
394   - """
395   - if not db_config:
396   - return 0, 0
397   -
398   - try:
399   - from utils.db_connector import create_db_connection
400   - from sqlalchemy import text
401   -
402   - db_engine = create_db_connection(
403   - host=db_config['host'],
404   - port=db_config['port'],
405   - database=db_config['database'],
406   - username=db_config['username'],
407   - password=db_config['password']
408   - )
409   -
410   - with db_engine.connect() as conn:
411   - result = conn.execute(text('SELECT MAX(id) FROM shoplazza_product_spu'))
412   - max_spu_id = result.scalar() or 0
413   -
414   - result = conn.execute(text('SELECT MAX(id) FROM shoplazza_product_sku'))
415   - max_sku_id = result.scalar() or 0
416   -
417   - return max_spu_id, max_sku_id
418   - except Exception as e:
419   - print(f"Warning: Could not get max IDs from database: {e}")
420   - return 0, 0
421   -
422   -
423   -def main():
424   - parser = argparse.ArgumentParser(description='Import tenant2 CSV data into MySQL Shoplazza tables')
425   - parser.add_argument('--csv-file', required=True, help='CSV file path')
426   - parser.add_argument('--tenant-id', default='2', help='Tenant ID (default: 2)')
427   - parser.add_argument('--start-spu-id', type=int, default=None, help='Starting SPU ID (default: auto-calculate from DB)')
428   - parser.add_argument('--output', default='tenant2_data.sql', help='Output SQL file (default: tenant2_data.sql)')
429   - parser.add_argument('--db-host', help='Database host (for auto-calculating start IDs)')
430   - parser.add_argument('--db-port', type=int, default=3306, help='Database port (default: 3306)')
431   - parser.add_argument('--db-database', help='Database name (for auto-calculating start IDs)')
432   - parser.add_argument('--db-username', help='Database username (for auto-calculating start IDs)')
433   - parser.add_argument('--db-password', help='Database password (for auto-calculating start IDs)')
434   -
435   - args = parser.parse_args()
436   -
437   - print(f"Reading CSV file: {args.csv_file}")
438   - csv_data_list = read_csv_file(args.csv_file)
439   - print(f"Read {len(csv_data_list)} rows from CSV")
440   -
441   - # Auto-calculate start IDs if not provided and DB config available
442   - start_spu_id = args.start_spu_id
443   - if start_spu_id is None and args.db_host and args.db_database and args.db_username and args.db_password:
444   - print("Auto-calculating start IDs from database...")
445   - db_config = {
446   - 'host': args.db_host,
447   - 'port': args.db_port,
448   - 'database': args.db_database,
449   - 'username': args.db_username,
450   - 'password': args.db_password
451   - }
452   - max_spu_id, max_sku_id = get_max_ids_from_db(db_config)
453   - start_spu_id = max_spu_id + 1
454   - print(f" Max SPU ID in DB: {max_spu_id}")
455   - print(f" Using start SPU ID: {start_spu_id}")
456   - elif start_spu_id is None:
457   - start_spu_id = 1
458   - print(f"Using default start SPU ID: {start_spu_id}")
459   -
460   - # Generate SPU and SKU data
461   - print(f"Generating SPU and SKU data (tenant_id={args.tenant_id})...")
462   - spus = []
463   - skus = []
464   - spu_id = start_spu_id
465   -
466   - for csv_data in csv_data_list:
467   - # Generate SPU
468   - spu = generate_spu_data(csv_data, spu_id, args.tenant_id)
469   - spus.append(spu)
470   -
471   - # Generate SKU - use skuId from CSV as SKU ID
472   - try:
473   - sku_id = int(csv_data['skuId'])
474   - except:
475   - # If skuId is not valid, use a generated ID
476   - sku_id = 1000000 + spu_id
477   -
478   - sku = generate_sku_data(csv_data, spu_id, sku_id, args.tenant_id)
479   - skus.append(sku)
480   -
481   - spu_id += 1
482   -
483   - print(f"Generated {len(spus)} SPUs and {len(skus)} SKUs")
484   -
485   - # Generate SQL file
486   - print(f"Generating SQL file: {args.output}")
487   - generate_sql_inserts(spus, skus, args.output)
488   - print(f"SQL file generated: {args.output}")
489   - print(f" - SPUs: {len(spus)}")
490   - print(f" - SKUs: {len(skus)}")
491   -
492   -
493   -if __name__ == '__main__':
494   - main()
495   -
scripts/indexer__old_2025_11/import_test_data.py deleted
... ... @@ -1,277 +0,0 @@
1   -#!/usr/bin/env python3
2   -"""
3   -Import test data into MySQL Shoplazza tables.
4   -
5   -Reads SQL file generated by generate_test_data.py and imports into MySQL.
6   -"""
7   -
8   -import sys
9   -import os
10   -import argparse
11   -from pathlib import Path
12   -
13   -# Add parent directory to path
14   -sys.path.insert(0, str(Path(__file__).parent.parent))
15   -
16   -from utils.db_connector import create_db_connection, test_connection
17   -
18   -
19   -def import_sql_file(db_engine, sql_file: str):
20   - """
21   - Import SQL file into database using MySQL client (more reliable for large files).
22   -
23   - Args:
24   - db_engine: SQLAlchemy database engine (used to get connection info)
25   - sql_file: Path to SQL file
26   - """
27   - import subprocess
28   - import os
29   - from pathlib import Path
30   -
31   - # Get connection info from engine URL
32   - engine_url = str(db_engine.url)
33   - # Parse: mysql+pymysql://user:pass@host:port/database
34   - import re
35   - match = re.match(r'mysql\+pymysql://([^:]+):([^@]+)@([^:]+):(\d+)/(.+)', engine_url)
36   - if not match:
37   - raise ValueError(f"Cannot parse database URL: {engine_url}")
38   -
39   - username, password, host, port, database = match.groups()
40   -
41   - # Use MySQL client to execute SQL file (more reliable)
42   - sql_file_path = Path(sql_file).absolute()
43   -
44   - # Build mysql command
45   - mysql_cmd = [
46   - 'mysql',
47   - f'-h{host}',
48   - f'-P{port}',
49   - f'-u{username}',
50   - f'-p{password}',
51   - database
52   - ]
53   -
54   - print(f"Executing SQL file using MySQL client...")
55   - print(f" File: {sql_file_path}")
56   - print(f" Database: {host}:{port}/{database}")
57   -
58   - try:
59   - with open(sql_file_path, 'r', encoding='utf-8') as f:
60   - result = subprocess.run(
61   - mysql_cmd,
62   - stdin=f,
63   - capture_output=True,
64   - text=True,
65   - timeout=300 # 5 minute timeout
66   - )
67   -
68   - if result.returncode != 0:
69   - error_msg = result.stderr or result.stdout
70   - print(f"ERROR: MySQL execution failed")
71   - print(f"Error output: {error_msg[:500]}")
72   - raise Exception(f"MySQL execution failed: {error_msg[:200]}")
73   -
74   - print("SQL file executed successfully")
75   - return True
76   -
77   - except FileNotFoundError:
78   - # Fallback to SQLAlchemy if mysql client not available
79   - print("MySQL client not found, falling back to SQLAlchemy...")
80   - return import_sql_file_sqlalchemy(db_engine, sql_file)
81   - except subprocess.TimeoutExpired:
82   - raise Exception("SQL execution timed out after 5 minutes")
83   - except Exception as e:
84   - print(f"Error using MySQL client: {e}")
85   - print("Falling back to SQLAlchemy...")
86   - return import_sql_file_sqlalchemy(db_engine, sql_file)
87   -
88   -
89   -def import_sql_file_sqlalchemy(db_engine, sql_file: str):
90   - """
91   - Fallback method: Import SQL file using SQLAlchemy (for when mysql client unavailable).
92   - """
93   - from sqlalchemy import text
94   -
95   - with open(sql_file, 'r', encoding='utf-8') as f:
96   - sql_content = f.read()
97   -
98   - # Remove comment lines
99   - lines = sql_content.split('\n')
100   - cleaned_lines = []
101   - for line in lines:
102   - stripped = line.lstrip()
103   - if stripped.startswith('--'):
104   - continue
105   - cleaned_lines.append(line)
106   -
107   - sql_content = '\n'.join(cleaned_lines)
108   -
109   - # Split by semicolon - but we need to handle strings properly
110   - # Use a state machine to track string boundaries
111   - statements = []
112   - current = []
113   - in_string = False
114   - i = 0
115   -
116   - while i < len(sql_content):
117   - char = sql_content[i]
118   -
119   - if char == "'":
120   - # Check for escaped quote (two single quotes)
121   - if i + 1 < len(sql_content) and sql_content[i+1] == "'":
122   - current.append("''")
123   - i += 1 # Skip next quote
124   - elif not in_string:
125   - in_string = True
126   - current.append(char)
127   - else:
128   - in_string = False
129   - current.append(char)
130   - else:
131   - current.append(char)
132   -
133   - # Split on semicolon only if not in string
134   - if char == ';' and not in_string:
135   - stmt = ''.join(current).strip()
136   - if stmt and stmt.upper().startswith('INSERT INTO'):
137   - statements.append(stmt)
138   - current = []
139   -
140   - i += 1
141   -
142   - # Handle last statement
143   - if current:
144   - stmt = ''.join(current).strip()
145   - if stmt and stmt.upper().startswith('INSERT INTO'):
146   - statements.append(stmt)
147   -
148   - print(f"Parsed {len(statements)} SQL statements")
149   - print(f"Executing {len(statements)} SQL statements...")
150   -
151   - # Use raw connection to avoid SQLAlchemy parameter parsing
152   - raw_conn = db_engine.raw_connection()
153   - try:
154   - cursor = raw_conn.cursor()
155   - try:
156   - for i, statement in enumerate(statements, 1):
157   - try:
158   - # Execute raw SQL directly using pymysql cursor
159   - cursor.execute(statement)
160   - raw_conn.commit()
161   - if i % 1000 == 0 or i == len(statements):
162   - print(f" [{i}/{len(statements)}] Executed successfully")
163   - except Exception as e:
164   - print(f" [{i}/{len(statements)}] ERROR: {e}")
165   - error_start = max(0, statement.find('VALUES') - 100)
166   - error_end = min(len(statement), error_start + 500)
167   - print(f" Statement context: ...{statement[error_start:error_end]}...")
168   - raise
169   - finally:
170   - cursor.close()
171   - finally:
172   - raw_conn.close()
173   -
174   - return True
175   -
176   -
177   -def verify_import(db_engine, tenant_id: str):
178   - """
179   - Verify imported data.
180   -
181   - Args:
182   - db_engine: SQLAlchemy database engine
183   - tenant_id: Tenant ID to verify
184   - """
185   - from sqlalchemy import text
186   -
187   - with db_engine.connect() as conn:
188   - # Count SPUs
189   - result = conn.execute(text("SELECT COUNT(*) FROM shoplazza_product_spu WHERE tenant_id = :tenant_id"), {"tenant_id": tenant_id})
190   - spu_count = result.scalar()
191   -
192   - # Count SKUs
193   - result = conn.execute(text("SELECT COUNT(*) FROM shoplazza_product_sku WHERE tenant_id = :tenant_id"), {"tenant_id": tenant_id})
194   - sku_count = result.scalar()
195   -
196   - print(f"\nVerification:")
197   - print(f" SPUs: {spu_count}")
198   - print(f" SKUs: {sku_count}")
199   -
200   - return spu_count, sku_count
201   -
202   -
203   -def main():
204   - parser = argparse.ArgumentParser(description='Import test data into MySQL')
205   -
206   - # Database connection
207   - parser.add_argument('--db-host', required=True, help='MySQL host')
208   - parser.add_argument('--db-port', type=int, default=3306, help='MySQL port (default: 3306)')
209   - parser.add_argument('--db-database', required=True, help='MySQL database name')
210   - parser.add_argument('--db-username', required=True, help='MySQL username')
211   - parser.add_argument('--db-password', required=True, help='MySQL password')
212   -
213   - # Import options
214   - parser.add_argument('--sql-file', required=True, help='SQL file to import')
215   - parser.add_argument('--tenant-id', help='Tenant ID to verify (optional)')
216   -
217   - args = parser.parse_args()
218   -
219   - print(f"Connecting to MySQL: {args.db_host}:{args.db_port}/{args.db_database}")
220   -
221   - # Connect to database
222   - try:
223   - db_engine = create_db_connection(
224   - host=args.db_host,
225   - port=args.db_port,
226   - database=args.db_database,
227   - username=args.db_username,
228   - password=args.db_password
229   - )
230   - except Exception as e:
231   - print(f"ERROR: Failed to connect to MySQL: {e}")
232   - return 1
233   -
234   - # Test connection
235   - if not test_connection(db_engine):
236   - print("ERROR: Database connection test failed")
237   - return 1
238   -
239   - print("Database connection successful")
240   -
241   - # Clean existing data if tenant_id provided
242   - if args.tenant_id:
243   - print(f"\nCleaning existing data for tenant_id: {args.tenant_id}")
244   - from sqlalchemy import text
245   - try:
246   - with db_engine.connect() as conn:
247   - # Delete SKUs first (foreign key constraint)
248   - conn.execute(text(f"DELETE FROM shoplazza_product_sku WHERE tenant_id = '{args.tenant_id}'"))
249   - # Delete SPUs
250   - conn.execute(text(f"DELETE FROM shoplazza_product_spu WHERE tenant_id = '{args.tenant_id}'"))
251   - conn.commit()
252   - print("✓ Existing data cleaned")
253   - except Exception as e:
254   - print(f"⚠ Warning: Failed to clean existing data: {e}")
255   - # Continue anyway
256   -
257   - # Import SQL file
258   - print(f"\nImporting SQL file: {args.sql_file}")
259   - try:
260   - import_sql_file(db_engine, args.sql_file)
261   - print("Import completed successfully")
262   - except Exception as e:
263   - print(f"ERROR: Failed to import SQL file: {e}")
264   - import traceback
265   - traceback.print_exc()
266   - return 1
267   -
268   - # Verify import if tenant_id provided
269   - if args.tenant_id:
270   - verify_import(db_engine, args.tenant_id)
271   -
272   - return 0
273   -
274   -
275   -if __name__ == '__main__':
276   - sys.exit(main())
277   -
scripts/indexer__old_2025_11/ingest.sh deleted
... ... @@ -1,92 +0,0 @@
1   -#!/bin/bash
2   -
3   -# Unified data ingestion script for saas-search
4   -# Ingests data from MySQL to Elasticsearch
5   -#
6   -# [LEGACY] 此脚本仅保留用于历史兼容,不建议新流程继续使用。
7   -# 推荐改用:
8   -# 1) ./scripts/create_tenant_index.sh <tenant_id>
9   -# 2) POST /indexer/reindex
10   -
11   -cd "$(dirname "$0")/.."
12   -source /home/tw/miniconda3/etc/profile.d/conda.sh
13   -conda activate searchengine
14   -
15   -GREEN='\033[0;32m'
16   -YELLOW='\033[1;33m'
17   -RED='\033[0;31m'
18   -NC='\033[0m'
19   -
20   -echo -e "${GREEN}========================================${NC}"
21   -echo -e "${GREEN}数据灌入脚本${NC}"
22   -echo -e "${GREEN}========================================${NC}"
23   -
24   -# Load config from .env file if it exists
25   -if [ -f .env ]; then
26   - set -a
27   - source .env
28   - set +a
29   -fi
30   -
31   -# Parameters
32   -TENANT_ID=${1:-""}
33   -RECREATE_INDEX=${2:-"false"}
34   -
35   -DB_HOST=${DB_HOST:-"120.79.247.228"}
36   -DB_PORT=${DB_PORT:-"3316"}
37   -DB_DATABASE=${DB_DATABASE:-"saas"}
38   -DB_USERNAME=${DB_USERNAME:-"saas"}
39   -DB_PASSWORD=${DB_PASSWORD:-"P89cZHS5d7dFyc9R"}
40   -ES_HOST=${ES_HOST:-"http://localhost:9200"}
41   -BATCH_SIZE=${BATCH_SIZE:-500}
42   -
43   -echo -e "\n${YELLOW}Configuration:${NC}"
44   -echo " Tenant ID: $TENANT_ID"
45   -echo " Recreate Index: $RECREATE_INDEX"
46   -echo " MySQL: $DB_HOST:$DB_PORT/$DB_DATABASE"
47   -echo " Elasticsearch: $ES_HOST"
48   -echo " Batch Size: $BATCH_SIZE"
49   -
50   -# Validate parameters
51   -if [ -z "$TENANT_ID" ]; then
52   - echo -e "${RED}ERROR: Tenant ID is required${NC}"
53   - echo "Usage: $0 <tenant_id> [recreate_index]"
54   - echo " tenant_id: Required, tenant ID"
55   - echo " recreate_index: Optional, recreate index if exists (true/false, default: false)"
56   - exit 1
57   -fi
58   -
59   -if [ -z "$DB_PASSWORD" ]; then
60   - echo -e "${RED}ERROR: DB_PASSWORD未设置,请检查.env文件或环境变量${NC}"
61   - exit 1
62   -fi
63   -
64   -# Build command
65   -CMD="python scripts/ingest_shoplazza.py \
66   - --db-host $DB_HOST \
67   - --db-port $DB_PORT \
68   - --db-database $DB_DATABASE \
69   - --db-username $DB_USERNAME \
70   - --db-password $DB_PASSWORD \
71   - --tenant-id $TENANT_ID \
72   - --es-host $ES_HOST \
73   - --batch-size $BATCH_SIZE"
74   -
75   -if [ "$RECREATE_INDEX" = "true" ] || [ "$RECREATE_INDEX" = "1" ]; then
76   - CMD="$CMD --recreate"
77   - echo -e "\n${YELLOW}Warning: Index will be deleted and recreated!${NC}"
78   -fi
79   -
80   -echo -e "\n${YELLOW}Starting data ingestion...${NC}"
81   -eval $CMD
82   -
83   -if [ $? -eq 0 ]; then
84   - echo -e "\n${GREEN}========================================${NC}"
85   - echo -e "${GREEN}数据灌入完成!${NC}"
86   - echo -e "${GREEN}========================================${NC}"
87   -else
88   - echo -e "\n${RED}========================================${NC}"
89   - echo -e "${RED}数据灌入失败!${NC}"
90   - echo -e "${RED}========================================${NC}"
91   - exit 1
92   -fi
scripts/indexer__old_2025_11/ingest_shoplazza.py deleted
... ... @@ -1,146 +0,0 @@
1   -#!/usr/bin/env python3
2   -"""
3   -Shoplazza data ingestion script.
4   -
5   -Loads SPU and SKU data from MySQL and indexes into Elasticsearch using SPU transformer.
6   -"""
7   -
8   -import sys
9   -import os
10   -import argparse
11   -from pathlib import Path
12   -
13   -# Add parent directory to path
14   -sys.path.insert(0, str(Path(__file__).parent.parent))
15   -
16   -from utils.db_connector import create_db_connection
17   -from utils.es_client import ESClient
18   -from indexer.spu_transformer import SPUTransformer
19   -from indexer.mapping_generator import load_mapping, DEFAULT_INDEX_NAME
20   -from indexer.bulk_indexer import BulkIndexer
21   -
22   -
23   -def main():
24   - parser = argparse.ArgumentParser(description='Ingest Shoplazza SPU/SKU data into Elasticsearch')
25   -
26   - # Database connection
27   - parser.add_argument('--db-host', required=True, help='MySQL host')
28   - parser.add_argument('--db-port', type=int, default=3306, help='MySQL port (default: 3306)')
29   - parser.add_argument('--db-database', required=True, help='MySQL database name')
30   - parser.add_argument('--db-username', required=True, help='MySQL username')
31   - parser.add_argument('--db-password', required=True, help='MySQL password')
32   -
33   - # Tenant and index
34   - parser.add_argument('--tenant-id', required=True, help='Tenant ID (required)')
35   - parser.add_argument('--es-host', default='http://localhost:9200', help='Elasticsearch host')
36   -
37   - # Options
38   - parser.add_argument('--recreate', action='store_true', help='Recreate index if exists')
39   - parser.add_argument('--batch-size', type=int, default=500, help='Batch size for indexing (default: 500)')
40   -
41   - args = parser.parse_args()
42   -
43   - print(f"Starting Shoplazza data ingestion for tenant: {args.tenant_id}")
44   -
45   - # Load mapping from JSON file
46   - try:
47   - mapping = load_mapping()
48   - print(f"Loaded mapping configuration")
49   - except Exception as e:
50   - print(f"ERROR: Failed to load mapping: {e}")
51   - return 1
52   -
53   - index_name = DEFAULT_INDEX_NAME
54   -
55   - # Connect to MySQL
56   - print(f"Connecting to MySQL: {args.db_host}:{args.db_port}/{args.db_database}")
57   - try:
58   - db_engine = create_db_connection(
59   - host=args.db_host,
60   - port=args.db_port,
61   - database=args.db_database,
62   - username=args.db_username,
63   - password=args.db_password
64   - )
65   - except Exception as e:
66   - print(f"ERROR: Failed to connect to MySQL: {e}")
67   - return 1
68   -
69   - # Connect to Elasticsearch
70   - es_host = args.es_host
71   - es_username = os.environ.get('ES_USERNAME')
72   - es_password = os.environ.get('ES_PASSWORD')
73   -
74   - print(f"Connecting to Elasticsearch: {es_host}")
75   - if es_username and es_password:
76   - print(f"Using authentication: {es_username}")
77   - es_client = ESClient(hosts=[es_host], username=es_username, password=es_password)
78   - else:
79   - es_client = ESClient(hosts=[es_host])
80   -
81   - if not es_client.ping():
82   - print(f"ERROR: Cannot connect to Elasticsearch at {es_host}")
83   - return 1
84   -
85   - # Create index if needed
86   - if args.recreate:
87   - if es_client.index_exists(index_name):
88   - print(f"Deleting existing index: {index_name}")
89   - if not es_client.delete_index(index_name):
90   - print(f"ERROR: Failed to delete index '{index_name}'")
91   - return 1
92   -
93   - if not es_client.index_exists(index_name):
94   - print(f"Creating index: {index_name}")
95   - if not es_client.create_index(index_name, mapping):
96   - print(f"ERROR: Failed to create index '{index_name}'")
97   - print("Please check the mapping configuration and try again.")
98   - return 1
99   - else:
100   - print(f"Using existing index: {index_name}")
101   -
102   - # Initialize SPU transformer
103   - print(f"Initializing SPU transformer for tenant: {args.tenant_id}")
104   - transformer = SPUTransformer(db_engine, args.tenant_id)
105   -
106   - # Transform data
107   - print("Transforming SPU and SKU data...")
108   - try:
109   - documents = transformer.transform_batch()
110   - print(f"Transformed {len(documents)} SPU documents")
111   - except Exception as e:
112   - print(f"ERROR: Failed to transform data: {e}")
113   - import traceback
114   - traceback.print_exc()
115   - return 1
116   -
117   - if not documents:
118   - print("WARNING: No documents to index")
119   - return 0
120   -
121   - # Bulk index
122   - print(f"Indexing {len(documents)} documents (batch size: {args.batch_size})...")
123   - indexer = BulkIndexer(es_client, index_name, batch_size=args.batch_size)
124   -
125   - try:
126   - results = indexer.index_documents(documents, id_field="spu_id", show_progress=True)
127   - print(f"\nIngestion complete:")
128   - print(f" Success: {results['success']}")
129   - print(f" Failed: {results['failed']}")
130   - print(f" Time: {results.get('elapsed_time', 0):.2f}s")
131   -
132   - if results['failed'] > 0:
133   - print(f"\nWARNING: {results['failed']} documents failed to index")
134   - return 1
135   -
136   - return 0
137   - except Exception as e:
138   - print(f"ERROR: Failed to index documents: {e}")
139   - import traceback
140   - traceback.print_exc()
141   - return 1
142   -
143   -
144   -if __name__ == '__main__':
145   - sys.exit(main())
146   -
scripts/indexer__old_2025_11/recreate_and_import.py deleted
... ... @@ -1,184 +0,0 @@
1   -#!/usr/bin/env python3
2   -"""
3   -重建索引并导入数据的脚本。
4   -
5   -清除旧索引,使用新的mapping重建索引,然后导入数据。
6   -"""
7   -
8   -import sys
9   -import os
10   -import argparse
11   -from pathlib import Path
12   -
13   -# Add parent directory to path
14   -sys.path.insert(0, str(Path(__file__).parent.parent))
15   -
16   -from utils.db_connector import create_db_connection
17   -from utils.es_client import ESClient
18   -from indexer.mapping_generator import load_mapping, delete_index_if_exists, DEFAULT_INDEX_NAME
19   -from indexer.spu_transformer import SPUTransformer
20   -from indexer.bulk_indexer import BulkIndexer
21   -
22   -
23   -def main():
24   - parser = argparse.ArgumentParser(description='重建ES索引并导入数据')
25   -
26   - # Database connection
27   - parser.add_argument('--db-host', help='MySQL host (或使用环境变量 DB_HOST)')
28   - parser.add_argument('--db-port', type=int, help='MySQL port (或使用环境变量 DB_PORT, 默认: 3306)')
29   - parser.add_argument('--db-database', help='MySQL database (或使用环境变量 DB_DATABASE)')
30   - parser.add_argument('--db-username', help='MySQL username (或使用环境变量 DB_USERNAME)')
31   - parser.add_argument('--db-password', help='MySQL password (或使用环境变量 DB_PASSWORD)')
32   -
33   - # Tenant and ES
34   - parser.add_argument('--tenant-id', required=True, help='Tenant ID (必需)')
35   - parser.add_argument('--es-host', help='Elasticsearch host (或使用环境变量 ES_HOST, 默认: http://localhost:9200)')
36   -
37   - # Options
38   - parser.add_argument('--batch-size', type=int, default=500, help='批量导入大小 (默认: 500)')
39   - parser.add_argument('--skip-delete', action='store_true', help='跳过删除旧索引步骤')
40   -
41   - args = parser.parse_args()
42   -
43   - print("=" * 60)
44   - print("重建ES索引并导入数据")
45   - print("=" * 60)
46   -
47   - # 加载mapping
48   - print("\n[1/4] 加载mapping配置...")
49   - try:
50   - mapping = load_mapping()
51   - print(f"✓ 成功加载mapping配置")
52   - except Exception as e:
53   - print(f"✗ 加载mapping失败: {e}")
54   - return 1
55   -
56   - index_name = DEFAULT_INDEX_NAME
57   - print(f"索引名称: {index_name}")
58   -
59   - # 连接Elasticsearch
60   - print("\n[2/4] 连接Elasticsearch...")
61   - es_host = args.es_host or os.environ.get('ES_HOST', 'http://localhost:9200')
62   - es_username = os.environ.get('ES_USERNAME')
63   - es_password = os.environ.get('ES_PASSWORD')
64   -
65   - print(f"ES地址: {es_host}")
66   - if es_username:
67   - print(f"ES用户名: {es_username}")
68   -
69   - try:
70   - if es_username and es_password:
71   - es_client = ESClient(hosts=[es_host], username=es_username, password=es_password)
72   - else:
73   - es_client = ESClient(hosts=[es_host])
74   -
75   - if not es_client.ping():
76   - print(f"✗ 无法连接到Elasticsearch: {es_host}")
77   - return 1
78   - print("✓ Elasticsearch连接成功")
79   - except Exception as e:
80   - print(f"✗ 连接Elasticsearch失败: {e}")
81   - return 1
82   -
83   - # 删除旧索引
84   - if not args.skip_delete:
85   - print("\n[3/4] 删除旧索引...")
86   - if es_client.index_exists(index_name):
87   - print(f"发现已存在的索引: {index_name}")
88   - if delete_index_if_exists(es_client, index_name):
89   - print(f"✓ 成功删除索引: {index_name}")
90   - else:
91   - print(f"✗ 删除索引失败: {index_name}")
92   - return 1
93   - else:
94   - print(f"索引不存在,跳过删除: {index_name}")
95   - else:
96   - print("\n[3/4] 跳过删除旧索引步骤")
97   -
98   - # 创建新索引
99   - print("\n[4/4] 创建新索引...")
100   - try:
101   - if es_client.index_exists(index_name):
102   - print(f"✓ 索引已存在: {index_name},跳过创建")
103   - else:
104   - print(f"创建索引: {index_name}")
105   - if es_client.create_index(index_name, mapping):
106   - print(f"✓ 成功创建索引: {index_name}")
107   - else:
108   - print(f"✗ 创建索引失败: {index_name}")
109   - return 1
110   - except Exception as e:
111   - print(f"✗ 创建索引失败: {e}")
112   - import traceback
113   - traceback.print_exc()
114   - return 1
115   -
116   - # 连接MySQL
117   - print("\n[5/5] 连接MySQL...")
118   - db_host = args.db_host or os.environ.get('DB_HOST')
119   - db_port = args.db_port or int(os.environ.get('DB_PORT', 3306))
120   - db_database = args.db_database or os.environ.get('DB_DATABASE')
121   - db_username = args.db_username or os.environ.get('DB_USERNAME')
122   - db_password = args.db_password or os.environ.get('DB_PASSWORD')
123   -
124   - if not all([db_host, db_database, db_username, db_password]):
125   - print("✗ MySQL连接参数不完整")
126   - print("请提供 --db-host, --db-database, --db-username, --db-password")
127   - print("或设置环境变量: DB_HOST, DB_DATABASE, DB_USERNAME, DB_PASSWORD")
128   - return 1
129   -
130   - print(f"MySQL: {db_host}:{db_port}/{db_database}")
131   - try:
132   - db_engine = create_db_connection(
133   - host=db_host,
134   - port=db_port,
135   - database=db_database,
136   - username=db_username,
137   - password=db_password
138   - )
139   - print("✓ MySQL连接成功")
140   - except Exception as e:
141   - print(f"✗ 连接MySQL失败: {e}")
142   - return 1
143   -
144   - # 导入数据
145   - print("\n[6/6] 导入数据...")
146   - print(f"Tenant ID: {args.tenant_id}")
147   - print(f"批量大小: {args.batch_size}")
148   -
149   - try:
150   - transformer = SPUTransformer(db_engine, args.tenant_id)
151   - print("正在转换数据...")
152   - documents = transformer.transform_batch()
153   - print(f"✓ 转换完成: {len(documents)} 个文档")
154   -
155   - if not documents:
156   - print("⚠ 没有数据需要导入")
157   - return 0
158   -
159   - print(f"正在导入数据到ES (批量大小: {args.batch_size})...")
160   - indexer = BulkIndexer(es_client, index_name, batch_size=args.batch_size)
161   - results = indexer.index_documents(documents, id_field="spu_id", show_progress=True)
162   -
163   - print(f"\n{'='*60}")
164   - print("导入完成!")
165   - print(f"{'='*60}")
166   - print(f"成功: {results['success']}")
167   - print(f"失败: {results['failed']}")
168   - print(f"耗时: {results.get('elapsed_time', 0):.2f}秒")
169   -
170   - if results['failed'] > 0:
171   - print(f"\n⚠ 警告: {results['failed']} 个文档导入失败")
172   - return 1
173   -
174   - return 0
175   - except Exception as e:
176   - print(f"✗ 导入数据失败: {e}")
177   - import traceback
178   - traceback.print_exc()
179   - return 1
180   -
181   -
182   -if __name__ == '__main__':
183   - sys.exit(main())
184   -
scripts/install_server_deps.sh deleted
... ... @@ -1,14 +0,0 @@
1   -#!/bin/bash
2   -
3   -echo "Installing server security dependencies..."
4   -
5   -# Check if we're in a conda environment
6   -if [ -z "$CONDA_DEFAULT_ENV" ]; then
7   - echo "Warning: No conda environment detected. Installing with pip..."
8   - pip install slowapi>=0.1.9 anyio>=3.7.0
9   -else
10   - echo "Installing in conda environment: $CONDA_DEFAULT_ENV"
11   - pip install slowapi>=0.1.9 anyio>=3.7.0
12   -fi
13   -
14   -echo "Dependencies installed successfully!"
15 0 \ No newline at end of file
scripts/reindex_from_remote_tenant_170_to_0.sh deleted
... ... @@ -1,99 +0,0 @@
1   -#!/bin/bash
2   -#
3   -# 从远程 ES 的 search_products_tenant_170 同步 10000 条到本机 search_products_tenant_0。
4   -# 请求发往本机 ES,由本机去拉远程数据;需在本机 elasticsearch.yml 配置 reindex.remote.whitelist。
5   -#
6   -# 用法:
7   -# ./scripts/reindex_from_remote_tenant_170_to_0.sh
8   -#
9   -# 环境变量(可选):
10   -# LOCAL_ES_HOST 本机 ES 地址,用于创建索引和发送 _reindex(默认从 .env 的 ES_HOST 读取,应为本机)
11   -# REMOTE_ES_HOST 远程 ES 地址(默认 http://120.76.41.98:9200)
12   -# REMOTE_ES_USER 远程 ES 用户名(默认 essa)
13   -# REMOTE_ES_PASS 远程 ES 密码(默认 4hOaLaf41y2VuI8y)
14   -# MAX_DOCS 同步条数(默认 10000)
15   -#
16   -
17   -set -e
18   -
19   -cd "$(dirname "$0")/.."
20   -PROJECT_ROOT="$(pwd)"
21   -
22   -# 加载 .env
23   -# shellcheck source=scripts/lib/load_env.sh
24   -source "${PROJECT_ROOT}/scripts/lib/load_env.sh"
25   -load_env_file "${PROJECT_ROOT}/.env"
26   -
27   -# 本机 ES(发 _reindex 请求的目标)
28   -LOCAL_ES_HOST="${LOCAL_ES_HOST:-${ES_HOST:-http://localhost:9200}}"
29   -ES_USERNAME="${ES_USERNAME:-}"
30   -ES_PASSWORD="${ES_PASSWORD:-}"
31   -ES_INDEX_NAMESPACE="${ES_INDEX_NAMESPACE:-}"
32   -
33   -# 远程 ES(数据源)
34   -REMOTE_ES_HOST="${REMOTE_ES_HOST:-http://120.76.41.98:9200}"
35   -REMOTE_ES_USER="${REMOTE_ES_USER:-essa}"
36   -REMOTE_ES_PASS="${REMOTE_ES_PASS:-4hOaLaf41y2VuI8y}"
37   -
38   -MAX_DOCS="${MAX_DOCS:-10000}"
39   -SOURCE_INDEX="search_products_tenant_170"
40   -DEST_INDEX="${ES_INDEX_NAMESPACE}search_products_tenant_0"
41   -MAPPING_FILE="${PROJECT_ROOT}/mappings/search_products.json"
42   -
43   -# 本机 curl 认证
44   -AUTH_PARAM=""
45   -if [ -n "$ES_USERNAME" ] && [ -n "$ES_PASSWORD" ]; then
46   - AUTH_PARAM="-u ${ES_USERNAME}:${ES_PASSWORD}"
47   -fi
48   -
49   -echo "本机 ES: $LOCAL_ES_HOST"
50   -echo "远程 ES: $REMOTE_ES_HOST"
51   -echo "源索引: $SOURCE_INDEX"
52   -echo "目标索引: $DEST_INDEX"
53   -echo "同步条数: $MAX_DOCS"
54   -echo ""
55   -
56   -# 1. 若目标索引不存在,则创建
57   -if ! curl -s $AUTH_PARAM "${LOCAL_ES_HOST}/${DEST_INDEX}" -o /dev/null -w "%{http_code}" | grep -q 200; then
58   - echo "创建目标索引: $DEST_INDEX"
59   - if [ ! -f "$MAPPING_FILE" ]; then
60   - echo "错误: mapping 文件不存在: $MAPPING_FILE"
61   - exit 1
62   - fi
63   - curl -X PUT "${LOCAL_ES_HOST}/${DEST_INDEX}" \
64   - -H "Content-Type: application/json" \
65   - $AUTH_PARAM \
66   - -d @"${MAPPING_FILE}" \
67   - -w "\nHTTP: %{http_code}\n" -s | tail -1
68   - echo ""
69   -else
70   - echo "目标索引已存在: $DEST_INDEX,将写入数据(可能覆盖同 id 文档)"
71   -fi
72   -
73   -# 2. Reindex from remote(JSON 中的密码用 env 传入,避免 shell 转义)
74   -echo "执行 Reindex from remote(最多 $MAX_DOCS 条)..."
75   -export REMOTE_ES_HOST REMOTE_ES_USER REMOTE_ES_PASS SOURCE_INDEX DEST_INDEX MAX_DOCS
76   -# ES 9.x 将 wait_for_completion 放在 query 参数,不在 body
77   -curl -X POST "${LOCAL_ES_HOST}/_reindex?wait_for_completion=true&pretty" \
78   - -H "Content-Type: application/json" \
79   - $AUTH_PARAM \
80   - -d @- <<EOF
81   -{
82   - "max_docs": ${MAX_DOCS},
83   - "source": {
84   - "remote": {
85   - "host": "${REMOTE_ES_HOST}",
86   - "username": "${REMOTE_ES_USER}",
87   - "password": "${REMOTE_ES_PASS}"
88   - },
89   - "index": "${SOURCE_INDEX}",
90   - "size": 500
91   - },
92   - "dest": {
93   - "index": "${DEST_INDEX}"
94   - }
95   -}
96   -EOF
97   -
98   -echo ""
99   -echo "完成。校验条数: curl $AUTH_PARAM '${LOCAL_ES_HOST}/${DEST_INDEX}/_count?pretty' -H 'Content-Type: application/json' -d '{\"query\":{\"match_all\":{}}}'"
scripts/start.sh deleted
... ... @@ -1,10 +0,0 @@
1   -#!/bin/bash
2   -
3   -# Service start entrypoint.
4   -# Delegates to unified service controller.
5   -
6   -set -euo pipefail
7   -
8   -cd "$(dirname "$0")/.."
9   -
10   -./scripts/service_ctl.sh up "$@"
tests/manual/README.md 0 → 100644
... ... @@ -0,0 +1,5 @@
  1 +# Manual Tests
  2 +
  3 +`tests/manual/` 存放需要人工启动依赖服务、手动观察结果或依赖真实外部环境的试跑脚本。
  4 +
  5 +这类脚本不属于 `pytest` 自动回归范围,也不应与 `tests/ci` 的契约测试混为一类。
... ...
scripts/test_build_docs_api.py renamed to tests/manual/test_build_docs_api.py
... ... @@ -4,9 +4,9 @@
4 4  
5 5 用法:
6 6 1. 先启动 Indexer 服务: ./scripts/start_indexer.sh (或 uvicorn api.indexer_app:app --port 6004)
7   - 2. 执行: python scripts/test_build_docs_api.py
  7 + 2. 执行: python tests/manual/test_build_docs_api.py
8 8  
9   - 也可指定地址: INDEXER_URL=http://localhost:6004 python scripts/test_build_docs_api.py
  9 + 也可指定地址: INDEXER_URL=http://localhost:6004 python tests/manual/test_build_docs_api.py
10 10 """
11 11  
12 12 import json
... ... @@ -15,7 +15,7 @@ import sys
15 15 from datetime import datetime, timezone
16 16  
17 17 # 项目根目录
18   -ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
  18 +ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
19 19 sys.path.insert(0, ROOT)
20 20  
21 21 # 默认使用 requests 调真实服务;若未安装则回退到 TestClient
... ... @@ -122,7 +122,7 @@ def main():
122 122 print("\n[错误] 无法连接 Indexer 服务:", e)
123 123 print("请先启动: ./scripts/start_indexer.sh 或 uvicorn api.indexer_app:app --port 6004")
124 124 if HAS_REQUESTS:
125   - print("或使用进程内测试: USE_TEST_CLIENT=1 python scripts/test_build_docs_api.py")
  125 + print("或使用进程内测试: USE_TEST_CLIENT=1 python tests/manual/test_build_docs_api.py")
126 126 sys.exit(1)
127 127 else:
128 128 if not use_http and not HAS_REQUESTS:
... ...
translation/README.md
... ... @@ -12,8 +12,8 @@
12 12 - 启动脚本:[`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh)
13 13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
14 14 - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)
15   -- 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
16   -- 聚焦压测脚本:[`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  15 +- 本地模型压测:[`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  16 +- 聚焦压测脚本:[`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
17 17 - 基线性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
18 18 - CT2 扩展报告:[`perf_reports/20260318/translation_local_models_ct2/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/README.md)
19 19 - CT2 聚焦调优报告:[`perf_reports/20260318/translation_local_models_ct2_focus/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/README.md)
... ... @@ -550,8 +550,8 @@ curl -X POST http://127.0.0.1:6006/translate \
550 550 - 切换到 CTranslate2 后需要重新跑一轮基准,尤其关注 `nllb-200-distilled-600m` 的单条延迟、并发 tail latency 和 `opus-mt-*` 的 batch throughput。
551 551  
552 552 性能脚本:
553   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
554   -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  553 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  554 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
555 555  
556 556 数据集:
557 557 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
... ... @@ -601,14 +601,14 @@ curl -X POST http://127.0.0.1:6006/translate \
601 601  
602 602 ```bash
603 603 cd /data/saas-search
604   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py
  604 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py
605 605 ```
606 606  
607 607 本轮扩展压测复现命令:
608 608  
609 609 ```bash
610 610 cd /data/saas-search
611   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  611 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
612 612 --suite extended \
613 613 --disable-cache \
614 614 --serial-items-per-case 256 \
... ... @@ -620,7 +620,7 @@ cd /data/saas-search
620 620 单模型扩展压测示例:
621 621  
622 622 ```bash
623   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  623 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
624 624 --single \
625 625 --suite extended \
626 626 --model opus-mt-zh-en \
... ... @@ -639,7 +639,7 @@ cd /data/saas-search
639 639 单条请求延迟复现:
640 640  
641 641 ```bash
642   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  642 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
643 643 --single \
644 644 --suite extended \
645 645 --model nllb-200-distilled-600m \
... ...