Commit bb3c5ef84b9d2252968a0e88de6a4fb3745666b3

Authored by tangwang
1 parent a406638e

灌入数据流程跑通

=0.1.9 0 → 100644
@@ -0,0 +1,14 @@ @@ -0,0 +1,14 @@
  1 +Looking in indexes: https://mirrors.aliyun.com/pypi/simple
  2 +Collecting slowapi
  3 + Using cached https://mirrors.aliyun.com/pypi/packages/2b/bb/f71c4b7d7e7eb3fc1e8c0458a8979b912f40b58002b9fbf37729b8cb464b/slowapi-0.1.9-py3-none-any.whl (14 kB)
  4 +Collecting limits>=2.3 (from slowapi)
  5 + Using cached https://mirrors.aliyun.com/pypi/packages/40/96/4fcd44aed47b8fcc457653b12915fcad192cd646510ef3f29fd216f4b0ab/limits-5.6.0-py3-none-any.whl (60 kB)
  6 +Collecting deprecated>=1.2 (from limits>=2.3->slowapi)
  7 + Using cached https://mirrors.aliyun.com/pypi/packages/84/d0/205d54408c08b13550c733c4b85429e7ead111c7f0014309637425520a9a/deprecated-1.3.1-py2.py3-none-any.whl (11 kB)
  8 +Requirement already satisfied: packaging>=21 in /data/tw/miniconda3/envs/searchengine/lib/python3.10/site-packages (from limits>=2.3->slowapi) (25.0)
  9 +Requirement already satisfied: typing-extensions in /data/tw/miniconda3/envs/searchengine/lib/python3.10/site-packages (from limits>=2.3->slowapi) (4.15.0)
  10 +Collecting wrapt<3,>=1.10 (from deprecated>=1.2->limits>=2.3->slowapi)
  11 + Downloading https://mirrors.aliyun.com/pypi/packages/c6/93/5cf92edd99617095592af919cb81d4bff61c5dbbb70d3c92099425a8ec34/wrapt-2.0.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (113 kB)
  12 +Installing collected packages: wrapt, deprecated, limits, slowapi
  13 +
  14 +Successfully installed deprecated-1.3.1 limits-5.6.0 slowapi-0.1.9 wrapt-2.0.1
@@ -109,3 +109,6 @@ The `searcher` supports: @@ -109,3 +109,6 @@ The `searcher` supports:
109 4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity. 109 4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity.
110 110
111 5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese). 111 5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).
  112 +- 记住这个项目的环境是
  113 +- 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh
  114 +conda activate searchengine
112 \ No newline at end of file 115 \ No newline at end of file
SERVER_FIXES.md 0 → 100644
@@ -0,0 +1,142 @@ @@ -0,0 +1,142 @@
  1 +# 服务器修复和优化文档
  2 +
  3 +## 修复的问题
  4 +
  5 +### 1. 前端服务器问题 (scripts/frontend_server.py)
  6 +- **问题**: 接收到大量扫描器流量导致的错误日志
  7 +- **原因**: SSL/TLS握手尝试、RDP连接扫描、二进制数据攻击
  8 +- **解决方案**:
  9 + - 添加错误处理机制,优雅处理连接断开
  10 + - 实现速率限制 (100请求/分钟)
  11 + - 过滤扫描器噪音日志
  12 + - 添加安全HTTP头
  13 + - 使用线程服务器提高并发处理能力
  14 +
  15 +### 2. API服务器问题 (api/app.py)
  16 +- **问题**: 缺乏安全性和错误处理机制
  17 +- **解决方案**:
  18 + - 集成速率限制 (slowapi)
  19 + - 添加安全HTTP头
  20 + - 实现更好的异常处理
  21 + - 添加健康检查端点
  22 + - 增强日志记录
  23 + - 添加服务关闭处理
  24 +
  25 +## 主要改进
  26 +
  27 +### 安全性增强
  28 +1. **速率限制**: 防止DDoS攻击和滥用
  29 +2. **安全HTTP头**: 防止XSS、点击劫持等攻击
  30 +3. **错误过滤**: 隐藏敏感错误信息
  31 +4. **输入验证**: 更健壮的请求处理
  32 +
  33 +### 稳定性提升
  34 +1. **连接错误处理**: 优雅处理连接重置和断开
  35 +2. **异常处理**: 全局异常捕获,防止服务器崩溃
  36 +3. **日志管理**: 过滤噪音,记录重要事件
  37 +4. **监控功能**: 健康检查和状态监控
  38 +
  39 +### 性能优化
  40 +1. **线程服务器**: 前端服务器支持并发请求
  41 +2. **资源管理**: 更好的内存和连接管理
  42 +3. **响应头优化**: 添加缓存和安全相关头
  43 +
  44 +## 使用方法
  45 +
  46 +### 安装依赖
  47 +```bash
  48 +# 安装服务器安全依赖
  49 +./scripts/install_server_deps.sh
  50 +
  51 +# 或者手动安装
  52 +pip install slowapi>=0.1.9 anyio>=3.7.0
  53 +```
  54 +
  55 +### 启动服务器
  56 +
  57 +#### 方法1: 使用管理脚本 (推荐)
  58 +```bash
  59 +# 启动所有服务器
  60 +python scripts/start_servers.py --customer customer1 --es-host http://localhost:9200
  61 +
  62 +# 启动前检查依赖
  63 +python scripts/start_servers.py --check-dependencies
  64 +```
  65 +
  66 +#### 方法2: 分别启动
  67 +```bash
  68 +# 启动API服务器
  69 +python main.py serve --customer customer1 --es-host http://localhost:9200
  70 +
  71 +# 启动前端服务器 (在另一个终端)
  72 +python scripts/frontend_server.py
  73 +```
  74 +
  75 +### 监控和日志
  76 +
  77 +#### 日志位置
  78 +- API服务器日志: `/tmp/search_engine_api.log`
  79 +- 启动日志: `/tmp/search_engine_startup.log`
  80 +- 控制台输出: 实时显示重要信息
  81 +
  82 +#### 健康检查
  83 +```bash
  84 +# 检查API服务器健康状态
  85 +curl http://localhost:6002/health
  86 +
  87 +# 检查前端服务器
  88 +curl http://localhost:6003
  89 +```
  90 +
  91 +## 配置选项
  92 +
  93 +### 环境变量
  94 +- `CUSTOMER_ID`: 客户ID (默认: customer1)
  95 +- `ES_HOST`: Elasticsearch主机 (默认: http://localhost:9200)
  96 +
  97 +### 速率限制配置
  98 +- API服务器: 各端点不同限制 (60-120请求/分钟)
  99 +- 前端服务器: 100请求/分钟
  100 +
  101 +## 故障排除
  102 +
  103 +### 常见问题
  104 +
  105 +1. **依赖缺失错误**
  106 + ```bash
  107 + pip install -r requirements_server.txt
  108 + ```
  109 +
  110 +2. **端口被占用**
  111 + ```bash
  112 + # 查看端口占用
  113 + lsof -i :6002
  114 + lsof -i :6003
  115 + ```
  116 +
  117 +3. **权限问题**
  118 + ```bash
  119 + chmod +x scripts/*.py scripts/*.sh
  120 + ```
  121 +
  122 +### 调试模式
  123 +```bash
  124 +# 启用详细日志
  125 +export PYTHONUNBUFFERED=1
  126 +python scripts/start_servers.py
  127 +```
  128 +
  129 +## 生产环境建议
  130 +
  131 +1. **反向代理**: 使用nginx或Apache作为反向代理
  132 +2. **SSL证书**: 配置HTTPS
  133 +3. **防火墙**: 限制访问源IP
  134 +4. **监控**: 集成监控和告警系统
  135 +5. **日志轮转**: 配置日志轮转防止磁盘满
  136 +
  137 +## 维护说明
  138 +
  139 +- 定期检查日志文件大小
  140 +- 监控服务器资源使用情况
  141 +- 更新依赖包版本
  142 +- 备份配置文件
0 \ No newline at end of file 143 \ No newline at end of file
@@ -7,12 +7,34 @@ Usage: @@ -7,12 +7,34 @@ Usage:
7 7
8 import os 8 import os
9 import sys 9 import sys
  10 +import logging
  11 +import time
  12 +from collections import defaultdict, deque
10 from typing import Optional 13 from typing import Optional
11 -from fastapi import FastAPI, Request 14 +from fastapi import FastAPI, Request, HTTPException
12 from fastapi.responses import JSONResponse 15 from fastapi.responses import JSONResponse
13 from fastapi.middleware.cors import CORSMiddleware 16 from fastapi.middleware.cors import CORSMiddleware
  17 +from fastapi.middleware.trustedhost import TrustedHostMiddleware
  18 +from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
  19 +from slowapi import Limiter, _rate_limit_exceeded_handler
  20 +from slowapi.util import get_remote_address
  21 +from slowapi.errors import RateLimitExceeded
14 import argparse 22 import argparse
15 23
  24 +# Configure logging with better formatting
  25 +logging.basicConfig(
  26 + level=logging.INFO,
  27 + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
  28 + handlers=[
  29 + logging.StreamHandler(),
  30 + logging.FileHandler('/tmp/search_engine_api.log', mode='a')
  31 + ]
  32 +)
  33 +logger = logging.getLogger(__name__)
  34 +
  35 +# Initialize rate limiter
  36 +limiter = Limiter(key_func=get_remote_address)
  37 +
16 # Add parent directory to path 38 # Add parent directory to path
17 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) 39 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
18 40
@@ -117,20 +139,44 @@ def get_query_parser() -&gt; QueryParser: @@ -117,20 +139,44 @@ def get_query_parser() -&gt; QueryParser:
117 return _query_parser 139 return _query_parser
118 140
119 141
120 -# Create FastAPI app 142 +# Create FastAPI app with enhanced configuration
121 app = FastAPI( 143 app = FastAPI(
122 title="E-Commerce Search API", 144 title="E-Commerce Search API",
123 description="Configurable search engine for cross-border e-commerce", 145 description="Configurable search engine for cross-border e-commerce",
124 - version="1.0.0" 146 + version="1.0.0",
  147 + docs_url="/docs",
  148 + redoc_url="/redoc",
  149 + openapi_url="/openapi.json"
  150 +)
  151 +
  152 +# Add rate limiting middleware
  153 +app.state.limiter = limiter
  154 +app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
  155 +
  156 +# Add trusted host middleware (restrict to localhost and trusted domains)
  157 +app.add_middleware(
  158 + TrustedHostMiddleware,
  159 + allowed_hosts=["*"] # Allow all hosts for development, restrict in production
125 ) 160 )
126 161
127 -# Add CORS middleware 162 +# Add security headers middleware
  163 +@app.middleware("http")
  164 +async def add_security_headers(request: Request, call_next):
  165 + response = await call_next(request)
  166 + response.headers["X-Content-Type-Options"] = "nosniff"
  167 + response.headers["X-Frame-Options"] = "DENY"
  168 + response.headers["X-XSS-Protection"] = "1; mode=block"
  169 + response.headers["Referrer-Policy"] = "strict-origin-when-cross-origin"
  170 + return response
  171 +
  172 +# Add CORS middleware with more restrictive settings
128 app.add_middleware( 173 app.add_middleware(
129 CORSMiddleware, 174 CORSMiddleware,
130 - allow_origins=["*"], 175 + allow_origins=["*"], # Restrict in production to specific domains
131 allow_credentials=True, 176 allow_credentials=True,
132 - allow_methods=["*"], 177 + allow_methods=["GET", "POST", "PUT", "DELETE", "OPTIONS"],
133 allow_headers=["*"], 178 allow_headers=["*"],
  179 + expose_headers=["X-Total-Count"]
134 ) 180 )
135 181
136 182
@@ -140,35 +186,100 @@ async def startup_event(): @@ -140,35 +186,100 @@ async def startup_event():
140 customer_id = os.getenv("CUSTOMER_ID", "customer1") 186 customer_id = os.getenv("CUSTOMER_ID", "customer1")
141 es_host = os.getenv("ES_HOST", "http://localhost:9200") 187 es_host = os.getenv("ES_HOST", "http://localhost:9200")
142 188
  189 + logger.info(f"Starting E-Commerce Search API")
  190 + logger.info(f"Customer ID: {customer_id}")
  191 + logger.info(f"Elasticsearch Host: {es_host}")
  192 +
143 try: 193 try:
144 init_service(customer_id=customer_id, es_host=es_host) 194 init_service(customer_id=customer_id, es_host=es_host)
  195 + logger.info("Service initialized successfully")
145 except Exception as e: 196 except Exception as e:
146 - print(f"Failed to initialize service: {e}")  
147 - print("Service will start but may not function correctly") 197 + logger.error(f"Failed to initialize service: {e}")
  198 + logger.warning("Service will start but may not function correctly")
  199 +
  200 +
  201 +@app.on_event("shutdown")
  202 +async def shutdown_event():
  203 + """Cleanup on shutdown."""
  204 + logger.info("Shutting down E-Commerce Search API")
148 205
149 206
150 @app.exception_handler(Exception) 207 @app.exception_handler(Exception)
151 async def global_exception_handler(request: Request, exc: Exception): 208 async def global_exception_handler(request: Request, exc: Exception):
152 - """Global exception handler.""" 209 + """Global exception handler with detailed logging."""
  210 + client_ip = request.client.host if request.client else "unknown"
  211 + logger.error(f"Unhandled exception from {client_ip}: {exc}", exc_info=True)
  212 +
153 return JSONResponse( 213 return JSONResponse(
154 status_code=500, 214 status_code=500,
155 content={ 215 content={
156 "error": "Internal server error", 216 "error": "Internal server error",
157 - "detail": str(exc) 217 + "detail": "An unexpected error occurred. Please try again later.",
  218 + "timestamp": int(time.time())
  219 + }
  220 + )
  221 +
  222 +
  223 +@app.exception_handler(HTTPException)
  224 +async def http_exception_handler(request: Request, exc: HTTPException):
  225 + """HTTP exception handler."""
  226 + logger.warning(f"HTTP exception from {request.client.host if request.client else 'unknown'}: {exc.status_code} - {exc.detail}")
  227 +
  228 + return JSONResponse(
  229 + status_code=exc.status_code,
  230 + content={
  231 + "error": exc.detail,
  232 + "status_code": exc.status_code,
  233 + "timestamp": int(time.time())
158 } 234 }
159 ) 235 )
160 236
161 237
162 @app.get("/") 238 @app.get("/")
163 -async def root():  
164 - """Root endpoint.""" 239 +@limiter.limit("60/minute")
  240 +async def root(request: Request):
  241 + """Root endpoint with rate limiting."""
  242 + client_ip = request.client.host if request.client else "unknown"
  243 + logger.info(f"Root endpoint accessed from {client_ip}")
  244 +
165 return { 245 return {
166 "service": "E-Commerce Search API", 246 "service": "E-Commerce Search API",
167 "version": "1.0.0", 247 "version": "1.0.0",
168 - "status": "running" 248 + "status": "running",
  249 + "timestamp": int(time.time())
169 } 250 }
170 251
171 252
  253 +@app.get("/health")
  254 +@limiter.limit("120/minute")
  255 +async def health_check(request: Request):
  256 + """Health check endpoint."""
  257 + try:
  258 + # Check if services are initialized
  259 + get_config()
  260 + get_es_client()
  261 +
  262 + return {
  263 + "status": "healthy",
  264 + "services": {
  265 + "config": "initialized",
  266 + "elasticsearch": "connected",
  267 + "searcher": "initialized"
  268 + },
  269 + "timestamp": int(time.time())
  270 + }
  271 + except Exception as e:
  272 + logger.error(f"Health check failed: {e}")
  273 + return JSONResponse(
  274 + status_code=503,
  275 + content={
  276 + "status": "unhealthy",
  277 + "error": str(e),
  278 + "timestamp": int(time.time())
  279 + }
  280 + )
  281 +
  282 +
172 # Include routers 283 # Include routers
173 from .routes import search, admin 284 from .routes import search, admin
174 285
api/routes/search.py
@@ -33,7 +33,7 @@ async def search(request: SearchRequest): @@ -33,7 +33,7 @@ async def search(request: SearchRequest):
33 33
34 try: 34 try:
35 # Get searcher from app state 35 # Get searcher from app state
36 - from main import get_searcher 36 + from api.app import get_searcher
37 searcher = get_searcher() 37 searcher = get_searcher()
38 38
39 # Execute search 39 # Execute search
@@ -70,7 +70,7 @@ async def search_by_image(request: ImageSearchRequest): @@ -70,7 +70,7 @@ async def search_by_image(request: ImageSearchRequest):
70 Uses image embeddings to find visually similar products. 70 Uses image embeddings to find visually similar products.
71 """ 71 """
72 try: 72 try:
73 - from main import get_searcher 73 + from api.app import get_searcher
74 searcher = get_searcher() 74 searcher = get_searcher()
75 75
76 # Execute image search 76 # Execute image search
@@ -101,7 +101,7 @@ async def get_document(doc_id: str): @@ -101,7 +101,7 @@ async def get_document(doc_id: str):
101 Get a single document by ID. 101 Get a single document by ID.
102 """ 102 """
103 try: 103 try:
104 - from main import get_searcher 104 + from api.app import get_searcher
105 searcher = get_searcher() 105 searcher = get_searcher()
106 106
107 doc = searcher.get_document(doc_id) 107 doc = searcher.get_document(doc_id)
@@ -42,6 +42,8 @@ dependencies: @@ -42,6 +42,8 @@ dependencies:
42 - uvicorn[standard]>=0.23.0 42 - uvicorn[standard]>=0.23.0
43 - pydantic>=2.0.0 43 - pydantic>=2.0.0
44 - python-multipart>=0.0.6 44 - python-multipart>=0.0.6
  45 + - slowapi>=0.1.9
  46 + - anyio>=3.7.0
45 47
46 # Translation 48 # Translation
47 - requests>=2.31.0 49 - requests>=2.31.0
frontend/index.html
@@ -51,9 +51,9 @@ @@ -51,9 +51,9 @@
51 </div> 51 </div>
52 52
53 <footer> 53 <footer>
54 - <p>SearchEngine © 2025 | API服务地址: <span id="apiUrl">http://localhost:6002</span></p> 54 + <p>SearchEngine © 2025 | API服务地址: <span id="apiUrl">http://120.76.41.98:6002</span></p>
55 </footer> 55 </footer>
56 56
57 - <script src="/static/js/app.js"></script> 57 + <script src="/static/js/app.js?v=2.0"></script>
58 </body> 58 </body>
59 </html> 59 </html>
frontend/static/js/app.js
1 // SearchEngine Frontend JavaScript 1 // SearchEngine Frontend JavaScript
2 2
3 // API endpoint 3 // API endpoint
4 -const API_BASE_URL = 'http://localhost:6002'; 4 +const API_BASE_URL = 'http://120.76.41.98:6002';
5 5
6 // Update API URL display 6 // Update API URL display
7 document.getElementById('apiUrl').textContent = API_BASE_URL; 7 document.getElementById('apiUrl').textContent = API_BASE_URL;
@@ -28,10 +28,10 @@ async function performSearch() { @@ -28,10 +28,10 @@ async function performSearch() {
28 return; 28 return;
29 } 29 }
30 30
31 - // Get options 31 + // Get options (temporarily disable translation and embedding due to GPU issues)
32 const size = parseInt(document.getElementById('resultSize').value); 32 const size = parseInt(document.getElementById('resultSize').value);
33 - const enableTranslation = document.getElementById('enableTranslation').checked;  
34 - const enableEmbedding = document.getElementById('enableEmbedding').checked; 33 + const enableTranslation = false; // Disabled temporarily
  34 + const enableEmbedding = false; // Disabled temporarily
35 const enableRerank = document.getElementById('enableRerank').checked; 35 const enableRerank = document.getElementById('enableRerank').checked;
36 36
37 // Show loading 37 // Show loading
@@ -68,7 +68,7 @@ async function performSearch() { @@ -68,7 +68,7 @@ async function performSearch() {
68 <div class="error-message"> 68 <div class="error-message">
69 <strong>搜索出错:</strong> ${error.message} 69 <strong>搜索出错:</strong> ${error.message}
70 <br><br> 70 <br><br>
71 - <small>请确保后端服务正在运行 (http://localhost:6002)</small> 71 + <small>请确保后端服务正在运行 (${API_BASE_URL})</small>
72 </div> 72 </div>
73 `; 73 `;
74 } finally { 74 } finally {
indexer/data_transformer.py
@@ -301,7 +301,28 @@ class DataTransformer: @@ -301,7 +301,28 @@ class DataTransformer:
301 # Pandas datetime handling 301 # Pandas datetime handling
302 if isinstance(value, pd.Timestamp): 302 if isinstance(value, pd.Timestamp):
303 return value.isoformat() 303 return value.isoformat()
304 - return str(value) 304 + elif isinstance(value, str):
  305 + # Try to parse string datetime and convert to ISO format
  306 + try:
  307 + import datetime
  308 + # Handle common datetime formats
  309 + formats = [
  310 + '%Y-%m-%d %H:%M:%S', # 2020-07-07 16:44:09
  311 + '%Y-%m-%d %H:%M:%S.%f', # 2020-07-07 16:44:09.123
  312 + '%Y-%m-%dT%H:%M:%S', # 2020-07-07T16:44:09
  313 + '%Y-%m-%d', # 2020-07-07
  314 + ]
  315 + for fmt in formats:
  316 + try:
  317 + dt = datetime.datetime.strptime(value.strip(), fmt)
  318 + return dt.isoformat()
  319 + except ValueError:
  320 + continue
  321 + # If no format matches, return original string
  322 + return value
  323 + except Exception:
  324 + return value
  325 + return value
305 326
306 else: 327 else:
307 return value 328 return value
scripts/frontend_server.py
@@ -7,6 +7,9 @@ import http.server @@ -7,6 +7,9 @@ import http.server
7 import socketserver 7 import socketserver
8 import os 8 import os
9 import sys 9 import sys
  10 +import logging
  11 +import time
  12 +from collections import defaultdict, deque
10 13
11 # Change to frontend directory 14 # Change to frontend directory
12 frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend') 15 frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend')
@@ -14,27 +17,116 @@ os.chdir(frontend_dir) @@ -14,27 +17,116 @@ os.chdir(frontend_dir)
14 17
15 PORT = 6003 18 PORT = 6003
16 19
17 -class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):  
18 - """Custom request handler with CORS support.""" 20 +# Configure logging to suppress scanner noise
  21 +logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
  22 +
  23 +class RateLimitingMixin:
  24 + """Mixin for rate limiting requests by IP address."""
  25 + request_counts = defaultdict(deque)
  26 + rate_limit = 100 # requests per minute
  27 + window = 60 # seconds
  28 +
  29 + @classmethod
  30 + def is_rate_limited(cls, ip):
  31 + now = time.time()
  32 +
  33 + # Clean old requests
  34 + while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
  35 + cls.request_counts[ip].popleft()
  36 +
  37 + # Check rate limit
  38 + if len(cls.request_counts[ip]) > cls.rate_limit:
  39 + return True
  40 +
  41 + cls.request_counts[ip].append(now)
  42 + return False
  43 +
  44 +class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
  45 + """Custom request handler with CORS support and robust error handling."""
  46 +
  47 + def setup(self):
  48 + """Setup with error handling."""
  49 + try:
  50 + super().setup()
  51 + except Exception:
  52 + pass # Silently handle setup errors from scanners
  53 +
  54 + def handle_one_request(self):
  55 + """Handle single request with error catching."""
  56 + try:
  57 + # Check rate limiting
  58 + client_ip = self.client_address[0]
  59 + if self.is_rate_limited(client_ip):
  60 + logging.warning(f"Rate limiting IP: {client_ip}")
  61 + self.send_error(429, "Too Many Requests")
  62 + return
  63 +
  64 + super().handle_one_request()
  65 + except (ConnectionResetError, BrokenPipeError):
  66 + # Client disconnected prematurely - common with scanners
  67 + pass
  68 + except UnicodeDecodeError:
  69 + # Binary data received - not HTTP
  70 + pass
  71 + except Exception as e:
  72 + # Log unexpected errors but don't crash
  73 + logging.debug(f"Request handling error: {e}")
  74 +
  75 + def log_message(self, format, *args):
  76 + """Suppress logging for malformed requests from scanners."""
  77 + message = format % args
  78 + # Filter out scanner noise
  79 + noise_patterns = [
  80 + "code 400",
  81 + "Bad request",
  82 + "Bad request version",
  83 + "Bad HTTP/0.9 request type",
  84 + "Bad request syntax"
  85 + ]
  86 + if any(pattern in message for pattern in noise_patterns):
  87 + return
  88 + # Only log legitimate requests
  89 + if message and not message.startswith(" ") and len(message) > 10:
  90 + super().log_message(format, *args)
19 91
20 def end_headers(self): 92 def end_headers(self):
21 # Add CORS headers 93 # Add CORS headers
22 self.send_header('Access-Control-Allow-Origin', '*') 94 self.send_header('Access-Control-Allow-Origin', '*')
23 self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS') 95 self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
24 self.send_header('Access-Control-Allow-Headers', 'Content-Type') 96 self.send_header('Access-Control-Allow-Headers', 'Content-Type')
  97 + # Add security headers
  98 + self.send_header('X-Content-Type-Options', 'nosniff')
  99 + self.send_header('X-Frame-Options', 'DENY')
  100 + self.send_header('X-XSS-Protection', '1; mode=block')
25 super().end_headers() 101 super().end_headers()
26 102
27 def do_OPTIONS(self): 103 def do_OPTIONS(self):
28 - self.send_response(200)  
29 - self.end_headers() 104 + """Handle OPTIONS requests."""
  105 + try:
  106 + self.send_response(200)
  107 + self.end_headers()
  108 + except Exception:
  109 + pass
  110 +
  111 +class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
  112 + """Threaded TCP server with better error handling."""
  113 + allow_reuse_address = True
  114 + daemon_threads = True
30 115
31 if __name__ == '__main__': 116 if __name__ == '__main__':
32 - with socketserver.TCPServer(("", PORT), MyHTTPRequestHandler) as httpd: 117 + # Create threaded server for better concurrency
  118 + with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
33 print(f"Frontend server started at http://localhost:{PORT}") 119 print(f"Frontend server started at http://localhost:{PORT}")
34 print(f"Serving files from: {os.getcwd()}") 120 print(f"Serving files from: {os.getcwd()}")
35 print("\nPress Ctrl+C to stop the server") 121 print("\nPress Ctrl+C to stop the server")
  122 +
36 try: 123 try:
37 httpd.serve_forever() 124 httpd.serve_forever()
38 except KeyboardInterrupt: 125 except KeyboardInterrupt:
39 - print("\nServer stopped") 126 + print("\nShutting down server...")
  127 + httpd.shutdown()
  128 + print("Server stopped")
40 sys.exit(0) 129 sys.exit(0)
  130 + except Exception as e:
  131 + print(f"Server error: {e}")
  132 + sys.exit(1)
scripts/install_server_deps.sh 0 → 100755
@@ -0,0 +1,14 @@ @@ -0,0 +1,14 @@
  1 +#!/bin/bash
  2 +
  3 +echo "Installing server security dependencies..."
  4 +
  5 +# Check if we're in a conda environment
  6 +if [ -z "$CONDA_DEFAULT_ENV" ]; then
  7 + echo "Warning: No conda environment detected. Installing with pip..."
  8 + pip install slowapi>=0.1.9 anyio>=3.7.0
  9 +else
  10 + echo "Installing in conda environment: $CONDA_DEFAULT_ENV"
  11 + pip install slowapi>=0.1.9 anyio>=3.7.0
  12 +fi
  13 +
  14 +echo "Dependencies installed successfully!"
0 \ No newline at end of file 15 \ No newline at end of file
scripts/start_servers.py 0 → 100755
@@ -0,0 +1,247 @@ @@ -0,0 +1,247 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Production-ready server startup script with proper error handling and monitoring.
  4 +"""
  5 +
  6 +import os
  7 +import sys
  8 +import signal
  9 +import time
  10 +import subprocess
  11 +import logging
  12 +from typing import Dict, List, Optional
  13 +import multiprocessing
  14 +import threading
  15 +
  16 +# Configure logging
  17 +logging.basicConfig(
  18 + level=logging.INFO,
  19 + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
  20 + handlers=[
  21 + logging.StreamHandler(),
  22 + logging.FileHandler('/tmp/search_engine_startup.log', mode='a')
  23 + ]
  24 +)
  25 +logger = logging.getLogger(__name__)
  26 +
  27 +class ServerManager:
  28 + """Manages frontend and API server processes."""
  29 +
  30 + def __init__(self):
  31 + self.processes: Dict[str, subprocess.Popen] = {}
  32 + self.running = True
  33 +
  34 + def start_frontend_server(self) -> bool:
  35 + """Start the frontend server."""
  36 + try:
  37 + frontend_script = os.path.join(os.path.dirname(__file__), 'frontend_server.py')
  38 +
  39 + cmd = [sys.executable, frontend_script]
  40 + env = os.environ.copy()
  41 + env['PYTHONUNBUFFERED'] = '1'
  42 +
  43 + process = subprocess.Popen(
  44 + cmd,
  45 + env=env,
  46 + stdout=subprocess.PIPE,
  47 + stderr=subprocess.STDOUT,
  48 + universal_newlines=True,
  49 + bufsize=1
  50 + )
  51 +
  52 + self.processes['frontend'] = process
  53 + logger.info(f"Frontend server started with PID: {process.pid}")
  54 +
  55 + # Start monitoring thread
  56 + threading.Thread(
  57 + target=self._monitor_output,
  58 + args=('frontend', process),
  59 + daemon=True
  60 + ).start()
  61 +
  62 + return True
  63 +
  64 + except Exception as e:
  65 + logger.error(f"Failed to start frontend server: {e}")
  66 + return False
  67 +
  68 + def start_api_server(self, customer: str = "customer1", es_host: str = "http://localhost:9200") -> bool:
  69 + """Start the API server."""
  70 + try:
  71 + cmd = [
  72 + sys.executable, 'main.py', 'serve',
  73 + '--customer', customer,
  74 + '--es-host', es_host,
  75 + '--host', '0.0.0.0',
  76 + '--port', '6002'
  77 + ]
  78 +
  79 + env = os.environ.copy()
  80 + env['PYTHONUNBUFFERED'] = '1'
  81 + env['CUSTOMER_ID'] = customer
  82 + env['ES_HOST'] = es_host
  83 +
  84 + process = subprocess.Popen(
  85 + cmd,
  86 + env=env,
  87 + stdout=subprocess.PIPE,
  88 + stderr=subprocess.STDOUT,
  89 + universal_newlines=True,
  90 + bufsize=1
  91 + )
  92 +
  93 + self.processes['api'] = process
  94 + logger.info(f"API server started with PID: {process.pid}")
  95 +
  96 + # Start monitoring thread
  97 + threading.Thread(
  98 + target=self._monitor_output,
  99 + args=('api', process),
  100 + daemon=True
  101 + ).start()
  102 +
  103 + return True
  104 +
  105 + except Exception as e:
  106 + logger.error(f"Failed to start API server: {e}")
  107 + return False
  108 +
  109 + def _monitor_output(self, name: str, process: subprocess.Popen):
  110 + """Monitor process output and log appropriately."""
  111 + try:
  112 + for line in iter(process.stdout.readline, ''):
  113 + if line.strip() and self.running:
  114 + # Filter out scanner noise for frontend server
  115 + if name == 'frontend':
  116 + noise_patterns = [
  117 + 'code 400',
  118 + 'Bad request version',
  119 + 'Bad request syntax',
  120 + 'Bad HTTP/0.9 request type'
  121 + ]
  122 + if any(pattern in line for pattern in noise_patterns):
  123 + continue
  124 +
  125 + logger.info(f"[{name}] {line.strip()}")
  126 +
  127 + except Exception as e:
  128 + if self.running:
  129 + logger.error(f"Error monitoring {name} output: {e}")
  130 +
  131 + def check_servers(self) -> bool:
  132 + """Check if all servers are still running."""
  133 + all_running = True
  134 +
  135 + for name, process in self.processes.items():
  136 + if process.poll() is not None:
  137 + logger.error(f"{name} server has stopped with exit code: {process.returncode}")
  138 + all_running = False
  139 +
  140 + return all_running
  141 +
  142 + def stop_all(self):
  143 + """Stop all servers gracefully."""
  144 + logger.info("Stopping all servers...")
  145 + self.running = False
  146 +
  147 + for name, process in self.processes.items():
  148 + try:
  149 + logger.info(f"Stopping {name} server (PID: {process.pid})...")
  150 +
  151 + # Try graceful shutdown first
  152 + process.terminate()
  153 +
  154 + # Wait up to 10 seconds for graceful shutdown
  155 + try:
  156 + process.wait(timeout=10)
  157 + logger.info(f"{name} server stopped gracefully")
  158 + except subprocess.TimeoutExpired:
  159 + # Force kill if graceful shutdown fails
  160 + logger.warning(f"{name} server didn't stop gracefully, forcing...")
  161 + process.kill()
  162 + process.wait()
  163 + logger.info(f"{name} server stopped forcefully")
  164 +
  165 + except Exception as e:
  166 + logger.error(f"Error stopping {name} server: {e}")
  167 +
  168 + self.processes.clear()
  169 + logger.info("All servers stopped")
  170 +
  171 +def signal_handler(signum, frame):
  172 + """Handle shutdown signals."""
  173 + logger.info(f"Received signal {signum}, shutting down...")
  174 + if 'manager' in globals():
  175 + manager.stop_all()
  176 + sys.exit(0)
  177 +
  178 +def main():
  179 + """Main function to start all servers."""
  180 + global manager
  181 +
  182 + parser = argparse.ArgumentParser(description='Start SearchEngine servers')
  183 + parser.add_argument('--customer', default='customer1', help='Customer ID')
  184 + parser.add_argument('--es-host', default='http://localhost:9200', help='Elasticsearch host')
  185 + parser.add_argument('--check-dependencies', action='store_true', help='Check dependencies before starting')
  186 + args = parser.parse_args()
  187 +
  188 + logger.info("Starting SearchEngine servers...")
  189 + logger.info(f"Customer: {args.customer}")
  190 + logger.info(f"Elasticsearch: {args.es_host}")
  191 +
  192 + # Check dependencies if requested
  193 + if args.check_dependencies:
  194 + logger.info("Checking dependencies...")
  195 + try:
  196 + import slowapi
  197 + import anyio
  198 + logger.info("✓ All dependencies available")
  199 + except ImportError as e:
  200 + logger.error(f"✗ Missing dependency: {e}")
  201 + logger.info("Please run: pip install -r requirements_server.txt")
  202 + sys.exit(1)
  203 +
  204 + manager = ServerManager()
  205 +
  206 + # Set up signal handlers
  207 + signal.signal(signal.SIGINT, signal_handler)
  208 + signal.signal(signal.SIGTERM, signal_handler)
  209 +
  210 + try:
  211 + # Start servers
  212 + if not manager.start_api_server(args.customer, args.es_host):
  213 + logger.error("Failed to start API server")
  214 + sys.exit(1)
  215 +
  216 + # Wait a moment before starting frontend server
  217 + time.sleep(2)
  218 +
  219 + if not manager.start_frontend_server():
  220 + logger.error("Failed to start frontend server")
  221 + manager.stop_all()
  222 + sys.exit(1)
  223 +
  224 + logger.info("All servers started successfully!")
  225 + logger.info("Frontend: http://localhost:6003")
  226 + logger.info("API: http://localhost:6002")
  227 + logger.info("API Docs: http://localhost:6002/docs")
  228 + logger.info("Press Ctrl+C to stop all servers")
  229 +
  230 + # Monitor servers
  231 + while manager.running:
  232 + if not manager.check_servers():
  233 + logger.error("One or more servers have stopped unexpectedly")
  234 + manager.stop_all()
  235 + sys.exit(1)
  236 +
  237 + time.sleep(5) # Check every 5 seconds
  238 +
  239 + except KeyboardInterrupt:
  240 + logger.info("Received interrupt signal")
  241 + except Exception as e:
  242 + logger.error(f"Unexpected error: {e}")
  243 + finally:
  244 + manager.stop_all()
  245 +
  246 +if __name__ == '__main__':
  247 + main()
0 \ No newline at end of file 248 \ No newline at end of file
scripts/stop.sh 0 → 100755
@@ -0,0 +1,68 @@ @@ -0,0 +1,68 @@
  1 +#!/bin/bash
  2 +
  3 +# Stop script for Search Engine services
  4 +# This script stops both backend and frontend servers
  5 +
  6 +echo "========================================"
  7 +echo "Stopping Search Engine Services"
  8 +echo "========================================"
  9 +
  10 +# Kill processes on port 6002 (backend)
  11 +BACKEND_PIDS=$(lsof -ti:6002 2>/dev/null)
  12 +if [ ! -z "$BACKEND_PIDS" ]; then
  13 + echo "Stopping backend server(s) on port 6002..."
  14 + for PID in $BACKEND_PIDS; do
  15 + echo " Killing PID: $PID"
  16 + kill -TERM $PID 2>/dev/null || true
  17 + done
  18 + sleep 2
  19 + # Force kill if still running
  20 + REMAINING_PIDS=$(lsof -ti:6002 2>/dev/null)
  21 + if [ ! -z "$REMAINING_PIDS" ]; then
  22 + echo " Force killing remaining processes..."
  23 + for PID in $REMAINING_PIDS; do
  24 + kill -KILL $PID 2>/dev/null || true
  25 + done
  26 + fi
  27 + echo "Backend server stopped."
  28 +else
  29 + echo "No backend server found running on port 6002."
  30 +fi
  31 +
  32 +# Kill processes on port 6003 (frontend)
  33 +FRONTEND_PIDS=$(lsof -ti:6003 2>/dev/null)
  34 +if [ ! -z "$FRONTEND_PIDS" ]; then
  35 + echo "Stopping frontend server(s) on port 6003..."
  36 + for PID in $FRONTEND_PIDS; do
  37 + echo " Killing PID: $PID"
  38 + kill -TERM $PID 2>/dev/null || true
  39 + done
  40 + sleep 2
  41 + # Force kill if still running
  42 + REMAINING_PIDS=$(lsof -ti:6003 2>/dev/null)
  43 + if [ ! -z "$REMAINING_PIDS" ]; then
  44 + echo " Force killing remaining processes..."
  45 + for PID in $REMAINING_PIDS; do
  46 + kill -KILL $PID 2>/dev/null || true
  47 + done
  48 + fi
  49 + echo "Frontend server stopped."
  50 +else
  51 + echo "No frontend server found running on port 6003."
  52 +fi
  53 +
  54 +# Also stop any processes using PID files
  55 +if [ -f "logs/backend.pid" ]; then
  56 + BACKEND_PID=$(cat logs/backend.pid 2>/dev/null)
  57 + if [ ! -z "$BACKEND_PID" ] && kill -0 $BACKEND_PID 2>/dev/null; then
  58 + echo "Stopping backend server via PID file (PID: $BACKEND_PID)..."
  59 + kill -TERM $BACKEND_PID 2>/dev/null || true
  60 + sleep 2
  61 + kill -KILL $BACKEND_PID 2>/dev/null || true
  62 + fi
  63 + rm -f logs/backend.pid
  64 +fi
  65 +
  66 +echo "========================================"
  67 +echo "All services stopped successfully!"
  68 +echo "========================================"
0 \ No newline at end of file 69 \ No newline at end of file
search/boolean_parser.py
@@ -82,7 +82,7 @@ class BooleanParser: @@ -82,7 +82,7 @@ class BooleanParser:
82 List of tokens 82 List of tokens
83 """ 83 """
84 # Pattern to match: operators, parentheses, or terms (with domain prefix support) 84 # Pattern to match: operators, parentheses, or terms (with domain prefix support)
85 - pattern = r'\b(AND|OR|RANK|ANDNOT)\b|[()]|(?:\w+:)?[^\s()]++' 85 + pattern = r'\b(AND|OR|RANK|ANDNOT)\b|[()]|(?:\w+:)?[^\s()]+'
86 86
87 tokens = [] 87 tokens = []
88 for match in re.finditer(pattern, expression): 88 for match in re.finditer(pattern, expression):
@@ -17,12 +17,25 @@ echo -e &quot;${GREEN}========================================${NC}&quot; @@ -17,12 +17,25 @@ echo -e &quot;${GREEN}========================================${NC}&quot;
17 echo -e "${GREEN}SearchEngine一键启动脚本${NC}" 17 echo -e "${GREEN}SearchEngine一键启动脚本${NC}"
18 echo -e "${GREEN}========================================${NC}" 18 echo -e "${GREEN}========================================${NC}"
19 19
  20 +# Step 0: Stop existing services first
  21 +echo -e "\n${YELLOW}Step 0/5: 停止现有服务${NC}"
  22 +if [ -f "./scripts/stop.sh" ]; then
  23 + ./scripts/stop.sh
  24 + sleep 2 # Wait for services to fully stop
  25 +else
  26 + echo -e "${YELLOW}停止脚本不存在,手动检查端口...${NC}"
  27 + # Kill any existing processes on our ports
  28 + fuser -k 6002/tcp 2>/dev/null || true
  29 + fuser -k 6003/tcp 2>/dev/null || true
  30 + sleep 2
  31 +fi
  32 +
20 # Step 1: Setup environment 33 # Step 1: Setup environment
21 -echo -e "\n${YELLOW}Step 1/4: 设置环境${NC}" 34 +echo -e "\n${YELLOW}Step 1/5: 设置环境${NC}"
22 ./setup.sh 35 ./setup.sh
23 36
24 # Step 2: Check if data is already ingested 37 # Step 2: Check if data is already ingested
25 -echo -e "\n${YELLOW}Step 2/4: 检查数据${NC}" 38 +echo -e "\n${YELLOW}Step 2/5: 检查数据${NC}"
26 source /home/tw/miniconda3/etc/profile.d/conda.sh 39 source /home/tw/miniconda3/etc/profile.d/conda.sh
27 conda activate searchengine 40 conda activate searchengine
28 41
@@ -55,7 +68,7 @@ else @@ -55,7 +68,7 @@ else
55 fi 68 fi
56 69
57 # Step 3: Start backend in background 70 # Step 3: Start backend in background
58 -echo -e "\n${YELLOW}Step 3/4: 启动后端服务${NC}" 71 +echo -e "\n${YELLOW}Step 3/5: 启动后端服务${NC}"
59 echo -e "${YELLOW}后端服务将在后台运行...${NC}" 72 echo -e "${YELLOW}后端服务将在后台运行...${NC}"
60 nohup ./scripts/start_backend.sh > logs/backend.log 2>&1 & 73 nohup ./scripts/start_backend.sh > logs/backend.log 2>&1 &
61 BACKEND_PID=$! 74 BACKEND_PID=$!
@@ -95,7 +108,7 @@ else @@ -95,7 +108,7 @@ else
95 fi 108 fi
96 109
97 # Step 4: Start frontend 110 # Step 4: Start frontend
98 -echo -e "\n${YELLOW}Step 4/4: 启动前端服务${NC}" 111 +echo -e "\n${YELLOW}Step 4/5: 启动前端服务${NC}"
99 echo -e "${GREEN}========================================${NC}" 112 echo -e "${GREEN}========================================${NC}"
100 echo -e "${GREEN}所有服务启动完成!${NC}" 113 echo -e "${GREEN}所有服务启动完成!${NC}"
101 echo -e "${GREEN}========================================${NC}" 114 echo -e "${GREEN}========================================${NC}"