Commit 7913e2fb4b6fb39cd54c61727066648c316c42cc

Authored by tangwang
1 parent 149dad2b

服务管理和监控

@@ -25,14 +25,17 @@ README 鈭賒撘遣蝡霈斤嚗**蝟餌芋器 @@ -25,14 +25,17 @@ README 鈭賒撘遣蝡霈斤嚗**蝟餌芋器
25 ./scripts/init_env.sh # 隞 .env.example .env嚗銝嚗 25 ./scripts/init_env.sh # 隞 .env.example .env嚗銝嚗
26 source activate.sh 26 source activate.sh
27 27
28 -# 敹嚗ackend/indexer/frontend嚗  
29 -./run.sh 28 +# 絲嚗摰嚗
  29 +./run.sh all # 遠鈭 ./scripts/service_ctl.sh up all
30 30
31 -# 嚗撘嚗  
32 -./scripts/service_ctl.sh start tei cnclip embedding translator reranker  
33 -  
34 -# 31 +# monitor daemon
35 ./scripts/service_ctl.sh status 32 ./scripts/service_ctl.sh status
  33 +
  34 +# 內靘嚗
  35 +./restart.sh all # 遠鈭 ./scripts/service_ctl.sh restart all
  36 +
  37 +# 銝迫嚗 monitor daemon嚗
  38 +./scripts/stop.sh all # 遠鈭 ./scripts/service_ctl.sh down all
36 ``` 39 ```
37 40
38 蝞∠秩提恕銵蛹撘 41 蝞∠秩提恕銵蛹撘
docs/QUICKSTART.md
@@ -64,17 +64,20 @@ source activate.sh @@ -64,17 +64,20 @@ source activate.sh
64 启动与停止: 64 启动与停止:
65 65
66 ```bash 66 ```bash
67 -./run.sh  
68 -# 启动全部能力  
69 -# 追加可选能力服务(显式指定)  
70 -TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip embedding translator reranker 67 +./run.sh all
  68 +# 仅为薄封装:等价于 ./scripts/service_ctl.sh up all
71 # 说明: 69 # 说明:
  70 +# - all = tei cnclip embedding translator reranker backend indexer frontend
  71 +# - up 会同时启动 monitor daemon(运行期连续失败自动重启)
72 # - reranker 为 GPU 强制模式(资源不足会直接启动失败) 72 # - reranker 为 GPU 强制模式(资源不足会直接启动失败)
73 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU) 73 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU)
74 # - cnclip 默认使用 cuda;若显式配置为 cuda 且 GPU 不可用会直接失败(不会自动降级到 cpu) 74 # - cnclip 默认使用 cuda;若显式配置为 cuda 且 GPU 不可用会直接失败(不会自动降级到 cpu)
75 75
  76 +./restart.sh all
  77 +# 仅为薄封装:等价于 ./scripts/service_ctl.sh restart all
  78 +
76 ./scripts/service_ctl.sh status 79 ./scripts/service_ctl.sh status
77 -./scripts/stop.sh 80 +./scripts/stop.sh all # 仅为薄封装:等价于 ./scripts/service_ctl.sh down all
78 ``` 81 ```
79 82
80 服务管理方式(入口职责、默认行为、全量拉起顺序)见: 83 服务管理方式(入口职责、默认行为、全量拉起顺序)见:
@@ -313,7 +316,8 @@ saas-search 以 MySQL 中的店匠标准表为权威数据源: @@ -313,7 +316,8 @@ saas-search 以 MySQL 中的店匠标准表为权威数据源:
313 - `scripts/mock_data.sh`:生成 Tenant1 Mock + Tenant2 CSV 并导入 MySQL 316 - `scripts/mock_data.sh`:生成 Tenant1 Mock + Tenant2 CSV 并导入 MySQL
314 - `scripts/create_tenant_index.sh <tenant_id>`:创建租户 ES 索引结构 317 - `scripts/create_tenant_index.sh <tenant_id>`:创建租户 ES 索引结构
315 - `POST /indexer/reindex`:从 MySQL 全量导入到 ES 318 - `POST /indexer/reindex`:从 MySQL 全量导入到 ES
316 -- `run.sh` / `scripts/stop.sh`:服务启停;`scripts/service_ctl.sh`:start/stop/restart/status 319 +- `run.sh` / `restart.sh` / `scripts/stop.sh`:推荐入口(对应 up/restart/down)
  320 +- `scripts/service_ctl.sh`:`up/down/start/stop/restart/status/monitor/monitor-start/monitor-stop/monitor-status`
317 321
318 更多脚本与验证命令见 `docs/Usage-Guide.md`。 322 更多脚本与验证命令见 `docs/Usage-Guide.md`。
319 323
@@ -516,6 +520,7 @@ curl http://localhost:6007/health @@ -516,6 +520,7 @@ curl http://localhost:6007/health
516 520
517 - `logs/<service>-YYYY-MM-DD.log`(`service_ctl.sh` 按天写入的真实文件) 521 - `logs/<service>-YYYY-MM-DD.log`(`service_ctl.sh` 按天写入的真实文件)
518 - `logs/<service>.log`(指向当天文件的软链,推荐 `tail -F`) 522 - `logs/<service>.log`(指向当天文件的软链,推荐 `tail -F`)
  523 +- `logs/service-monitor.log`(`service_ctl.sh monitor` 运行期健康检查、失败计数、自动重启日志)
519 - `logs/api.log`(backend 进程内日志,按天轮转) 524 - `logs/api.log`(backend 进程内日志,按天轮转)
520 - `logs/backend_verbose.log`(backend 大对象详细日志,按天轮转) 525 - `logs/backend_verbose.log`(backend 大对象详细日志,按天轮转)
521 - `logs/indexer.log`(索引结构化 JSON 日志,按天轮转) 526 - `logs/indexer.log`(索引结构化 JSON 日志,按天轮转)
docs/Usage-Guide.md
@@ -121,14 +121,15 @@ API_PORT=6002 @@ -121,14 +121,15 @@ API_PORT=6002
121 121
122 ```bash 122 ```bash
123 cd /data/saas-search 123 cd /data/saas-search
124 -./run.sh 124 +./run.sh all
125 ``` 125 ```
126 126
127 这个脚本会自动: 127 这个脚本会自动:
128 1. 创建日志目录 128 1. 创建日志目录
129 -2. 启动核心服务(backend/indexer/frontend 129 +2. 按目标启动服务(`all`:`tei cnclip embedding translator reranker backend indexer frontend`
130 3. 写入 PID 到 `logs/*.pid` 130 3. 写入 PID 到 `logs/*.pid`
131 4. 执行健康检查 131 4. 执行健康检查
  132 +5. 启动 monitor daemon(运行期连续失败自动重启)
132 133
133 启动完成后,访问: 134 启动完成后,访问:
134 - **前端界面**: http://localhost:6003 135 - **前端界面**: http://localhost:6003
@@ -136,11 +137,7 @@ cd /data/saas-search @@ -136,11 +137,7 @@ cd /data/saas-search
136 - **API文档**: http://localhost:6002/docs 137 - **API文档**: http://localhost:6002/docs
137 - **索引API**: http://localhost:6004/docs 138 - **索引API**: http://localhost:6004/docs
138 139
139 -可选:全功能模式(同时启动 embedding/translator/reranker/tei/cnclip):  
140 -  
141 -```bash  
142 -TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip embedding translator reranker  
143 -``` 140 +仅启动部分服务示例:`./run.sh backend indexer frontend`
144 141
145 ### 方式2: 统一控制脚本(推荐) 142 ### 方式2: 统一控制脚本(推荐)
146 143
@@ -148,17 +145,20 @@ TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip emb @@ -148,17 +145,20 @@ TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip emb
148 # 查看状态 145 # 查看状态
149 ./scripts/service_ctl.sh status 146 ./scripts/service_ctl.sh status
150 147
151 -# 启动核心服务(默认)  
152 -./scripts/service_ctl.sh start 148 +# 推荐:一键启动 + 监控守护
  149 +./scripts/service_ctl.sh up all
153 150
154 # 启动指定服务 151 # 启动指定服务
  152 +./scripts/service_ctl.sh up backend indexer frontend
  153 +
  154 +# 启动指定服务(不自动拉起 monitor daemon)
155 ./scripts/service_ctl.sh start backend indexer frontend translator reranker tei cnclip 155 ./scripts/service_ctl.sh start backend indexer frontend translator reranker tei cnclip
156 156
157 -# 停止全部服务(含可选服务)  
158 -./scripts/service_ctl.sh stop 157 +# 停止全部服务(含 monitor daemon)
  158 +./scripts/service_ctl.sh down all
159 159
160 -# 重启  
161 -./scripts/service_ctl.sh restart 160 +# 重启全部服务
  161 +./scripts/service_ctl.sh restart all
162 ``` 162 ```
163 163
164 ### 方式3: 分步启动(单环境) 164 ### 方式3: 分步启动(单环境)
@@ -246,34 +246,35 @@ python -m http.server 6003 @@ -246,34 +246,35 @@ python -m http.server 6003
246 246
247 ### 1) 入口脚本职责 247 ### 1) 入口脚本职责
248 248
249 -- `./run.sh`:仅启动核心服务(`backend/indexer/frontend`)。  
250 -- `./restart.sh`:重启逻辑为“先停所有已知服务,再启动核心服务”。  
251 -- `./scripts/stop.sh`:停止所有已知服务。  
252 -- `./scripts/service_ctl.sh`:统一控制器,支持 `start/stop/restart/status`,是唯一推荐入口。 249 +- `./run.sh [all|service...]`:薄封装,直接调用 `./scripts/service_ctl.sh up [all|service...]`。
  250 +- `./restart.sh [all|service...]`:薄封装,直接调用 `./scripts/service_ctl.sh restart [all|service...]`。
  251 +- `./scripts/stop.sh [all|service...]`:薄封装,直接调用 `./scripts/service_ctl.sh down [all|service...]`。
  252 +- `./scripts/service_ctl.sh`:统一控制器,支持 `up/down/start/stop/restart/status/monitor*`(带参数行为完全由此脚本定义)。
253 253
254 ### 2) `service_ctl.sh` 的默认行为 254 ### 2) `service_ctl.sh` 的默认行为
255 255
256 -- `start`(不带服务名):启动核心服务 `backend/indexer/frontend`。  
257 -- `stop`(不带服务名):停止全部已知服务(含可选服务)。  
258 -- `restart`(不带服务名):先停全部,再只启动核心服务。  
259 -- `status`(不带服务名):显示全部已知服务状态。 256 +- `up`:**必须显式指定** 服务或 `all`,并自动启动 monitor daemon。
  257 +- `down`:**必须显式指定** 服务或 `all`,并停止 monitor daemon。
  258 +- `start`:**必须显式指定** 服务或 `all`(不自动拉起 monitor daemon)。
  259 +- `stop`:**必须显式指定** 服务或 `all`;若 monitor daemon 运行会先停止它。
  260 +- `restart`:**必须显式指定** 服务或 `all`。
  261 +- `status`(不带服务名):显示全部已知服务状态 + monitor daemon 状态。
260 262
261 ### 3) 全量服务一键拉起 263 ### 3) 全量服务一键拉起
262 264
263 ```bash 265 ```bash
264 -TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip embedding translator reranker 266 +./scripts/service_ctl.sh up all
265 ``` 267 ```
266 268
267 说明: 269 说明:
268 - `TEI_DEVICE` / `CNCLIP_DEVICE` 统一使用 `cuda|cpu`。 270 - `TEI_DEVICE` / `CNCLIP_DEVICE` 统一使用 `cuda|cpu`。
269 -- 显式把 `tei`、`cnclip` 放在前面,避免 `embedding` 因依赖未就绪启动失败 271 +- `all` 内部已按依赖顺序处理(先 `tei/cnclip` 再 `embedding`)
270 272
271 ### 4) 常用运维命令 273 ### 4) 常用运维命令
272 274
273 ```bash 275 ```bash
274 -# 先重启核心,再拉起可选服务(最常用)  
275 -./restart.sh  
276 -TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip embedding translator reranker 276 +# 一键拉起完整栈(推荐)
  277 +./scripts/service_ctl.sh up all # 或 ./run.sh all
277 278
278 # 查看全量状态 279 # 查看全量状态
279 ./scripts/service_ctl.sh status 280 ./scripts/service_ctl.sh status
@@ -282,17 +283,17 @@ TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip emb @@ -282,17 +283,17 @@ TEI_DEVICE=cuda CNCLIP_DEVICE=cuda ./scripts/service_ctl.sh start tei cnclip emb
282 ./scripts/service_ctl.sh restart embedding 283 ./scripts/service_ctl.sh restart embedding
283 284
284 # 停止全部 285 # 停止全部
285 -./scripts/service_ctl.sh stop 286 +./scripts/service_ctl.sh down all # 或 ./scripts/stop.sh all
286 ``` 287 ```
287 288
288 ### 停止服务 289 ### 停止服务
289 290
290 ```bash 291 ```bash
291 -# 推荐:统一停止  
292 -./scripts/stop.sh 292 +# 推荐:统一停止(示例:全部)
  293 +./scripts/stop.sh all
293 294
294 # 或使用统一控制脚本 295 # 或使用统一控制脚本
295 -./scripts/service_ctl.sh stop 296 +./scripts/service_ctl.sh down all
296 ``` 297 ```
297 298
298 ### 服务端口 299 ### 服务端口
@@ -6,4 +6,4 @@ set -euo pipefail @@ -6,4 +6,4 @@ set -euo pipefail
6 6
7 cd "$(dirname "$0")" 7 cd "$(dirname "$0")"
8 8
9 -./scripts/service_ctl.sh restart 9 +./scripts/service_ctl.sh restart "$@"
@@ -6,4 +6,4 @@ set -euo pipefail @@ -6,4 +6,4 @@ set -euo pipefail
6 6
7 cd "$(dirname "$0")" 7 cd "$(dirname "$0")"
8 8
9 -./scripts/service_ctl.sh start 9 +./scripts/service_ctl.sh up "$@"
scripts/service_ctl.sh
1 #!/bin/bash 1 #!/bin/bash
2 # 2 #
3 # Unified service lifecycle controller for saas-search. 3 # Unified service lifecycle controller for saas-search.
4 -# Supports: start / stop / restart / status 4 +# Supports: up / down / start / stop / restart / status / monitor / monitor-start / monitor-stop / monitor-status
5 # 5 #
6 6
7 set -euo pipefail 7 set -euo pipefail
@@ -16,10 +16,12 @@ mkdir -p &quot;${LOG_DIR}&quot; @@ -16,10 +16,12 @@ mkdir -p &quot;${LOG_DIR}&quot;
16 source "${PROJECT_ROOT}/scripts/lib/load_env.sh" 16 source "${PROJECT_ROOT}/scripts/lib/load_env.sh"
17 17
18 CORE_SERVICES=("backend" "indexer" "frontend") 18 CORE_SERVICES=("backend" "indexer" "frontend")
19 -OPTIONAL_SERVICES=("embedding" "translator" "reranker" "tei" "cnclip") 19 +OPTIONAL_SERVICES=("tei" "cnclip" "embedding" "translator" "reranker")
  20 +FULL_SERVICES=("${OPTIONAL_SERVICES[@]}" "${CORE_SERVICES[@]}")
  21 +STOP_ORDER_SERVICES=("frontend" "indexer" "backend" "reranker" "translator" "embedding" "cnclip" "tei")
20 22
21 all_services() { 23 all_services() {
22 - echo "${CORE_SERVICES[@]} ${OPTIONAL_SERVICES[@]}" 24 + echo "${FULL_SERVICES[@]}"
23 } 25 }
24 26
25 get_port() { 27 get_port() {
@@ -90,24 +92,91 @@ validate_targets() { @@ -90,24 +92,91 @@ validate_targets() {
90 done 92 done
91 } 93 }
92 94
  95 +health_path_for_service() {
  96 + local service="$1"
  97 + case "${service}" in
  98 + backend|indexer|embedding|translator|reranker|tei) echo "/health" ;;
  99 + frontend) echo "/" ;;
  100 + *) echo "" ;;
  101 + esac
  102 +}
  103 +
  104 +monitor_log_file() {
  105 + echo "${LOG_DIR}/service-monitor.log"
  106 +}
  107 +
  108 +monitor_pid_file() {
  109 + echo "${LOG_DIR}/service-monitor.pid"
  110 +}
  111 +
  112 +monitor_targets_file() {
  113 + echo "${LOG_DIR}/service-monitor.targets"
  114 +}
  115 +
  116 +monitor_current_targets() {
  117 + if [ -f "$(monitor_targets_file)" ]; then
  118 + cat "$(monitor_targets_file)" 2>/dev/null || true
  119 + fi
  120 +}
  121 +
  122 +monitor_log_event() {
  123 + local service="$1"
  124 + local level="$2"
  125 + local message="$3"
  126 + local ts line
  127 + ts="$(date '+%F %T')"
  128 + line="[${ts}] [${level}] [${service}] ${message}"
  129 + echo "${line}" | tee -a "$(monitor_log_file)"
  130 +}
  131 +
  132 +require_positive_int() {
  133 + local name="$1"
  134 + local value="$2"
  135 + if ! [[ "${value}" =~ ^[0-9]+$ ]] || [ "${value}" -le 0 ]; then
  136 + echo "[error] invalid ${name}=${value}, must be a positive integer" >&2
  137 + return 1
  138 + fi
  139 +}
  140 +
  141 +service_healthy_now() {
  142 + local service="$1"
  143 + local port
  144 + local path
  145 +
  146 + if [ "${service}" = "tei" ]; then
  147 + port="$(get_port "${service}")"
  148 + is_running_tei_container &&
  149 + curl -sf "http://127.0.0.1:${port}/health" >/dev/null 2>&1
  150 + return
  151 + fi
  152 +
  153 + if [ "${service}" = "cnclip" ]; then
  154 + is_running_by_pid "${service}" || is_running_by_port "${service}"
  155 + return
  156 + fi
  157 +
  158 + port="$(get_port "${service}")"
  159 + path="$(health_path_for_service "${service}")"
  160 + if [ -z "${port}" ] || [ -z "${path}" ]; then
  161 + return 1
  162 + fi
  163 + if ! is_running_by_port "${service}"; then
  164 + return 1
  165 + fi
  166 + curl -sf "http://127.0.0.1:${port}${path}" >/dev/null 2>&1
  167 +}
  168 +
93 wait_for_health() { 169 wait_for_health() {
94 local service="$1" 170 local service="$1"
95 local max_retries="${2:-30}" 171 local max_retries="${2:-30}"
96 local interval_sec="${3:-1}" 172 local interval_sec="${3:-1}"
97 local port 173 local port
98 port="$(get_port "${service}")" 174 port="$(get_port "${service}")"
99 - local path="/health"  
100 -  
101 - case "${service}" in  
102 - backend) path="/health" ;;  
103 - indexer) path="/health" ;;  
104 - frontend) path="/" ;;  
105 - embedding) path="/health" ;;  
106 - translator) path="/health" ;;  
107 - reranker) path="/health" ;;  
108 - tei) path="/health" ;;  
109 - *) return 0 ;;  
110 - esac 175 + local path
  176 + path="$(health_path_for_service "${service}")"
  177 + if [ -z "${path}" ]; then
  178 + return 0
  179 + fi
111 180
112 local i=0 181 local i=0
113 while [ "${i}" -lt "${max_retries}" ]; do 182 while [ "${i}" -lt "${max_retries}" ]; do
@@ -126,7 +195,11 @@ wait_for_stable_health() { @@ -126,7 +195,11 @@ wait_for_stable_health() {
126 local interval_sec="${3:-1}" 195 local interval_sec="${3:-1}"
127 local port 196 local port
128 port="$(get_port "${service}")" 197 port="$(get_port "${service}")"
129 - local path="/health" 198 + local path
  199 + path="$(health_path_for_service "${service}")"
  200 + if [ -z "${path}" ]; then
  201 + return 0
  202 + fi
130 203
131 local i=0 204 local i=0
132 while [ "${i}" -lt "${checks}" ]; do 205 while [ "${i}" -lt "${checks}" ]; do
@@ -142,6 +215,161 @@ wait_for_stable_health() { @@ -142,6 +215,161 @@ wait_for_stable_health() {
142 return 0 215 return 0
143 } 216 }
144 217
  218 +monitor_services() {
  219 + local targets="$1"
  220 + local interval_sec="${MONITOR_INTERVAL_SEC:-10}"
  221 + local fail_threshold="${MONITOR_FAIL_THRESHOLD:-3}"
  222 + local restart_cooldown_sec="${MONITOR_RESTART_COOLDOWN_SEC:-30}"
  223 + local max_restarts_per_hour="${MONITOR_MAX_RESTARTS_PER_HOUR:-6}"
  224 +
  225 + require_positive_int "MONITOR_INTERVAL_SEC" "${interval_sec}"
  226 + require_positive_int "MONITOR_FAIL_THRESHOLD" "${fail_threshold}"
  227 + require_positive_int "MONITOR_RESTART_COOLDOWN_SEC" "${restart_cooldown_sec}"
  228 + require_positive_int "MONITOR_MAX_RESTARTS_PER_HOUR" "${max_restarts_per_hour}"
  229 +
  230 + touch "$(monitor_log_file)"
  231 +
  232 + declare -A fail_streak=()
  233 + declare -A last_restart_epoch=()
  234 + declare -A restart_history=()
  235 +
  236 + monitor_log_event "monitor" "info" "started targets=[${targets}] interval=${interval_sec}s fail_threshold=${fail_threshold} cooldown=${restart_cooldown_sec}s max_restarts_per_hour=${max_restarts_per_hour}"
  237 + trap 'monitor_log_event "monitor" "info" "received stop signal, exiting"; exit 0' INT TERM
  238 +
  239 + while true; do
  240 + local svc
  241 + for svc in ${targets}; do
  242 + if service_healthy_now "${svc}"; then
  243 + if [ "${fail_streak[${svc}]:-0}" -gt 0 ]; then
  244 + monitor_log_event "${svc}" "info" "health recovered after ${fail_streak[${svc}]} consecutive failures"
  245 + fi
  246 + fail_streak["${svc}"]=0
  247 + continue
  248 + fi
  249 +
  250 + fail_streak["${svc}"]=$(( ${fail_streak[${svc}]:-0} + 1 ))
  251 + monitor_log_event "${svc}" "warn" "health check failed (${fail_streak[${svc}]}/${fail_threshold})"
  252 +
  253 + if [ "${fail_streak[${svc}]}" -lt "${fail_threshold}" ]; then
  254 + continue
  255 + fi
  256 +
  257 + local now
  258 + now="$(date +%s)"
  259 + local last
  260 + last="${last_restart_epoch[${svc}]:-0}"
  261 + if [ $((now - last)) -lt "${restart_cooldown_sec}" ]; then
  262 + monitor_log_event "${svc}" "warn" "restart suppressed by cooldown (${restart_cooldown_sec}s)"
  263 + continue
  264 + fi
  265 +
  266 + local t
  267 + local recent_history=""
  268 + local recent_count=0
  269 + for t in ${restart_history[${svc}]:-}; do
  270 + if [ $((now - t)) -lt 3600 ]; then
  271 + recent_history="${recent_history} ${t}"
  272 + recent_count=$((recent_count + 1))
  273 + fi
  274 + done
  275 + restart_history["${svc}"]="${recent_history# }"
  276 +
  277 + if [ "${recent_count}" -ge "${max_restarts_per_hour}" ]; then
  278 + monitor_log_event "${svc}" "error" "restart suppressed by hourly cap (${max_restarts_per_hour}/hour)"
  279 + continue
  280 + fi
  281 +
  282 + monitor_log_event "${svc}" "error" "triggering restart after ${fail_streak[${svc}]} consecutive failures"
  283 + if stop_one "${svc}" && start_one "${svc}"; then
  284 + fail_streak["${svc}"]=0
  285 + last_restart_epoch["${svc}"]="${now}"
  286 + restart_history["${svc}"]="${restart_history[${svc}]:-} ${now}"
  287 + monitor_log_event "${svc}" "info" "restart succeeded"
  288 + else
  289 + last_restart_epoch["${svc}"]="${now}"
  290 + restart_history["${svc}"]="${restart_history[${svc}]:-} ${now}"
  291 + monitor_log_event "${svc}" "error" "restart failed, inspect $(log_file "${svc}")"
  292 + fi
  293 + done
  294 + sleep "${interval_sec}"
  295 + done
  296 +}
  297 +
  298 +is_monitor_daemon_running() {
  299 + local pf
  300 + pf="$(monitor_pid_file)"
  301 + if [ ! -f "${pf}" ]; then
  302 + return 1
  303 + fi
  304 + local pid
  305 + pid="$(cat "${pf}" 2>/dev/null || true)"
  306 + [ -n "${pid}" ] && kill -0 "${pid}" 2>/dev/null
  307 +}
  308 +
  309 +stop_monitor_daemon() {
  310 + local pf
  311 + pf="$(monitor_pid_file)"
  312 + local tf
  313 + tf="$(monitor_targets_file)"
  314 +
  315 + if ! is_monitor_daemon_running; then
  316 + rm -f "${pf}"
  317 + return 0
  318 + fi
  319 +
  320 + local pid
  321 + pid="$(cat "${pf}" 2>/dev/null || true)"
  322 + if [ -n "${pid}" ] && kill -0 "${pid}" 2>/dev/null; then
  323 + echo "[stop] monitor daemon pid=${pid}"
  324 + kill -TERM "${pid}" 2>/dev/null || true
  325 + sleep 1
  326 + if kill -0 "${pid}" 2>/dev/null; then
  327 + kill -KILL "${pid}" 2>/dev/null || true
  328 + fi
  329 + fi
  330 + rm -f "${pf}" "${tf}"
  331 +}
  332 +
  333 +start_monitor_daemon() {
  334 + local targets="$1"
  335 + local pf
  336 + pf="$(monitor_pid_file)"
  337 + local tf
  338 + tf="$(monitor_targets_file)"
  339 +
  340 + local current_targets
  341 + current_targets="$(monitor_current_targets)"
  342 + if is_monitor_daemon_running; then
  343 + if [ "${current_targets}" = "${targets}" ]; then
  344 + echo "[skip] monitor daemon already running (targets=[${targets}])"
  345 + return 0
  346 + fi
  347 + echo "[info] monitor daemon targets changed: [${current_targets}] -> [${targets}]"
  348 + stop_monitor_daemon
  349 + fi
  350 +
  351 + echo "${targets}" > "${tf}"
  352 + nohup "${PROJECT_ROOT}/scripts/service_ctl.sh" monitor ${targets} >/dev/null 2>&1 &
  353 + local pid=$!
  354 + echo "${pid}" > "${pf}"
  355 + echo "[ok] monitor daemon started (pid=${pid}, targets=[${targets}], log=$(monitor_log_file))"
  356 +}
  357 +
  358 +monitor_daemon_status() {
  359 + local running="no"
  360 + local pid="-"
  361 + local targets="-"
  362 +
  363 + if is_monitor_daemon_running; then
  364 + running="yes"
  365 + pid="$(cat "$(monitor_pid_file)" 2>/dev/null || echo "-")"
  366 + targets="$(monitor_current_targets)"
  367 + [ -z "${targets}" ] && targets="-"
  368 + fi
  369 +
  370 + printf "%-14s running=%-3s pid=%-8s targets=%s\n" "service-monitor" "${running}" "${pid}" "${targets}"
  371 +}
  372 +
145 is_running_by_pid() { 373 is_running_by_pid() {
146 local service="$1" 374 local service="$1"
147 local pf 375 local pf
@@ -379,6 +607,88 @@ status_one() { @@ -379,6 +607,88 @@ status_one() {
379 printf "%-10s running=%-3s port=%-6s pid=%s\n" "${service}" "${running}" "${port:--}" "${pid_info}" 607 printf "%-10s running=%-3s port=%-6s pid=%s\n" "${service}" "${running}" "${port:--}" "${pid_info}"
380 } 608 }
381 609
  610 +service_is_running() {
  611 + local service="$1"
  612 + case "${service}" in
  613 + tei)
  614 + is_running_tei_container
  615 + ;;
  616 + cnclip)
  617 + is_running_by_pid "${service}" || is_running_by_port "${service}"
  618 + ;;
  619 + *)
  620 + is_running_by_pid "${service}" || is_running_by_port "${service}"
  621 + ;;
  622 + esac
  623 +}
  624 +
  625 +expand_target_token() {
  626 + local token="$1"
  627 + case "${token}" in
  628 + all)
  629 + echo "$(all_services)"
  630 + ;;
  631 + *)
  632 + echo "${token}"
  633 + ;;
  634 + esac
  635 +}
  636 +
  637 +normalize_targets() {
  638 + local raw="$1"
  639 + declare -A seen=()
  640 + local out=""
  641 + local token svc
  642 + for token in ${raw}; do
  643 + for svc in $(expand_target_token "${token}"); do
  644 + if [ -z "${seen[${svc}]:-}" ]; then
  645 + seen["${svc}"]=1
  646 + out="${out} ${svc}"
  647 + fi
  648 + done
  649 + done
  650 + echo "${out# }"
  651 +}
  652 +
  653 +sort_targets_by_order() {
  654 + local targets="$1"
  655 + shift || true
  656 + local out=""
  657 + local svc
  658 + declare -A want=()
  659 + for svc in ${targets}; do
  660 + want["${svc}"]=1
  661 + done
  662 +
  663 + for svc in "$@"; do
  664 + if [ "${want[${svc}]:-0}" = "1" ]; then
  665 + out="${out} ${svc}"
  666 + unset "want[${svc}]"
  667 + fi
  668 + done
  669 +
  670 + for svc in ${targets}; do
  671 + if [ "${want[${svc}]:-0}" = "1" ]; then
  672 + out="${out} ${svc}"
  673 + unset "want[${svc}]"
  674 + fi
  675 + done
  676 + echo "${out# }"
  677 +}
  678 +
  679 +apply_target_order() {
  680 + local action="$1"
  681 + local targets="$2"
  682 + case "${action}" in
  683 + stop|down)
  684 + sort_targets_by_order "${targets}" "${STOP_ORDER_SERVICES[@]}"
  685 + ;;
  686 + *)
  687 + sort_targets_by_order "${targets}" "${FULL_SERVICES[@]}"
  688 + ;;
  689 + esac
  690 +}
  691 +
382 resolve_targets() { 692 resolve_targets() {
383 local scope="$1" 693 local scope="$1"
384 shift || true 694 shift || true
@@ -389,16 +699,12 @@ resolve_targets() { @@ -389,16 +699,12 @@ resolve_targets() {
389 fi 699 fi
390 700
391 case "${scope}" in 701 case "${scope}" in
392 - start)  
393 - echo "${CORE_SERVICES[@]}" 702 + monitor-stop|monitor-status)
  703 + echo ""
394 ;; 704 ;;
395 - stop|status) 705 + status)
396 echo "$(all_services)" 706 echo "$(all_services)"
397 ;; 707 ;;
398 - restart)  
399 - # Restart with no explicit services uses the same explicit start target order.  
400 - echo "$(resolve_targets start)"  
401 - ;;  
402 *) 708 *)
403 echo "" 709 echo ""
404 ;; 710 ;;
@@ -408,24 +714,39 @@ resolve_targets() { @@ -408,24 +714,39 @@ resolve_targets() {
408 usage() { 714 usage() {
409 cat <<'EOF' 715 cat <<'EOF'
410 Usage: 716 Usage:
  717 + ./scripts/service_ctl.sh up [all|service...]
  718 + ./scripts/service_ctl.sh down [service...]
411 ./scripts/service_ctl.sh start [service...] 719 ./scripts/service_ctl.sh start [service...]
412 ./scripts/service_ctl.sh stop [service...] 720 ./scripts/service_ctl.sh stop [service...]
413 ./scripts/service_ctl.sh restart [service...] 721 ./scripts/service_ctl.sh restart [service...]
414 ./scripts/service_ctl.sh status [service...] 722 ./scripts/service_ctl.sh status [service...]
  723 + ./scripts/service_ctl.sh monitor [service...]
  724 + ./scripts/service_ctl.sh monitor-start [service...]
  725 + ./scripts/service_ctl.sh monitor-stop
  726 + ./scripts/service_ctl.sh monitor-status
415 727
416 Default target set (when no service provided): 728 Default target set (when no service provided):
417 - start -> backend indexer frontend  
418 - stop -> all known services  
419 - restart -> stop all known services, then start with start targets  
420 status -> all known services 729 status -> all known services
  730 + up/start/stop/restart/down/monitor/monitor-start -> must specify services or all
  731 +
  732 +Special targets:
  733 + all -> all known services
421 734
422 -Optional service startup:  
423 - ./scripts/service_ctl.sh start tei cnclip embedding translator reranker  
424 - TEI_DEVICE=cuda|cpu ./scripts/service_ctl.sh start tei  
425 - CNCLIP_DEVICE=cuda|cpu ./scripts/service_ctl.sh start cnclip 735 +Examples:
  736 + ./scripts/service_ctl.sh up all
  737 + ./scripts/service_ctl.sh up backend indexer frontend
  738 + ./scripts/service_ctl.sh restart
  739 + ./scripts/service_ctl.sh monitor-start all
  740 + ./scripts/service_ctl.sh monitor-status
426 741
427 Log retention: 742 Log retention:
428 LOG_RETENTION_DAYS=30 ./scripts/service_ctl.sh start 743 LOG_RETENTION_DAYS=30 ./scripts/service_ctl.sh start
  744 +
  745 +Monitor tuning:
  746 + MONITOR_INTERVAL_SEC=10
  747 + MONITOR_FAIL_THRESHOLD=3
  748 + MONITOR_RESTART_COOLDOWN_SEC=30
  749 + MONITOR_MAX_RESTARTS_PER_HOUR=6
429 EOF 750 EOF
430 } 751 }
431 752
@@ -439,43 +760,95 @@ main() { @@ -439,43 +760,95 @@ main() {
439 shift || true 760 shift || true
440 761
441 load_env_file "${PROJECT_ROOT}/.env" 762 load_env_file "${PROJECT_ROOT}/.env"
442 - local stop_targets=""  
443 - local targets  
444 - # For restart without explicit services, stop everything first, then start  
445 - # with the default start target order.  
446 - if [ "${action}" = "restart" ] && [ "$#" -eq 0 ]; then  
447 - stop_targets="$(resolve_targets stop)"  
448 - fi  
449 - targets="$(resolve_targets "${action}" "$@")"  
450 - if [ -z "${targets}" ]; then  
451 - usage  
452 - exit 1  
453 - fi  
454 - validate_targets "${targets}" 763 + local targets=""
  764 + local monitor_was_running=0
  765 + local monitor_prev_targets=""
  766 +
  767 + case "${action}" in
  768 + monitor-stop|monitor-status)
  769 + ;;
  770 + *)
  771 + targets="$(resolve_targets "${action}" "$@")"
  772 + if [ -z "${targets}" ]; then
  773 + usage
  774 + exit 1
  775 + fi
  776 + targets="$(normalize_targets "${targets}")"
  777 + targets="$(apply_target_order "${action}" "${targets}")"
  778 + if [ -z "${targets}" ]; then
  779 + echo "[error] empty targets after expansion" >&2
  780 + exit 1
  781 + fi
  782 + validate_targets "${targets}"
  783 + ;;
  784 + esac
455 785
456 case "${action}" in 786 case "${action}" in
  787 + up)
  788 + for svc in ${targets}; do
  789 + start_one "${svc}"
  790 + done
  791 + start_monitor_daemon "${targets}"
  792 + ;;
  793 + down)
  794 + stop_monitor_daemon
  795 + for svc in ${targets}; do
  796 + stop_one "${svc}"
  797 + done
  798 + ;;
457 start) 799 start)
458 for svc in ${targets}; do 800 for svc in ${targets}; do
459 start_one "${svc}" 801 start_one "${svc}"
460 done 802 done
461 ;; 803 ;;
462 stop) 804 stop)
  805 + if is_monitor_daemon_running; then
  806 + echo "[info] stopping monitor daemon before manual stop"
  807 + stop_monitor_daemon
  808 + fi
463 for svc in ${targets}; do 809 for svc in ${targets}; do
464 stop_one "${svc}" 810 stop_one "${svc}"
465 done 811 done
466 ;; 812 ;;
467 restart) 813 restart)
468 - for svc in ${stop_targets:-${targets}}; do 814 + local restart_stop_targets
  815 + restart_stop_targets="$(apply_target_order stop "${targets}")"
  816 + if is_monitor_daemon_running; then
  817 + monitor_was_running=1
  818 + monitor_prev_targets="$(monitor_current_targets)"
  819 + [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${targets}"
  820 + stop_monitor_daemon
  821 + fi
  822 + for svc in ${restart_stop_targets}; do
469 stop_one "${svc}" 823 stop_one "${svc}"
470 done 824 done
471 for svc in ${targets}; do 825 for svc in ${targets}; do
472 start_one "${svc}" 826 start_one "${svc}"
473 done 827 done
  828 + if [ "${monitor_was_running}" -eq 1 ]; then
  829 + monitor_prev_targets="$(normalize_targets "${monitor_prev_targets}")"
  830 + monitor_prev_targets="$(apply_target_order monitor "${monitor_prev_targets}")"
  831 + [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${targets}"
  832 + start_monitor_daemon "${monitor_prev_targets}"
  833 + fi
474 ;; 834 ;;
475 status) 835 status)
476 for svc in ${targets}; do 836 for svc in ${targets}; do
477 status_one "${svc}" 837 status_one "${svc}"
478 done 838 done
  839 + monitor_daemon_status
  840 + ;;
  841 + monitor)
  842 + monitor_services "${targets}"
  843 + ;;
  844 + monitor-start)
  845 + start_monitor_daemon "${targets}"
  846 + ;;
  847 + monitor-stop)
  848 + stop_monitor_daemon
  849 + ;;
  850 + monitor-status)
  851 + monitor_daemon_status
479 ;; 852 ;;
480 *) 853 *)
481 usage 854 usage
@@ -7,16 +7,4 @@ set -euo pipefail @@ -7,16 +7,4 @@ set -euo pipefail
7 7
8 cd "$(dirname "$0")/.." 8 cd "$(dirname "$0")/.."
9 9
10 -echo "========================================"  
11 -echo "saas-search 服务启动"  
12 -echo "========================================"  
13 -echo "默认启动核心服务: backend/indexer/frontend"  
14 -echo "可选服务请显式指定:"  
15 -echo " ./scripts/service_ctl.sh start tei cnclip embedding translator reranker"  
16 -echo  
17 -  
18 -./scripts/service_ctl.sh start  
19 -  
20 -echo  
21 -echo "当前服务状态:"  
22 -./scripts/service_ctl.sh status backend indexer frontend tei cnclip embedding translator reranker 10 +./scripts/service_ctl.sh up "$@"
@@ -7,10 +7,4 @@ set -euo pipefail @@ -7,10 +7,4 @@ set -euo pipefail
7 7
8 cd "$(dirname "$0")/.." 8 cd "$(dirname "$0")/.."
9 9
10 -echo "========================================"  
11 -echo "Stopping saas-search services"  
12 -echo "========================================"  
13 -  
14 -./scripts/service_ctl.sh stop  
15 -  
16 -echo "Done." 10 +./scripts/service_ctl.sh down "$@"