怎么在云服务器上部署AI模型？

六步：1)选GPU实例(建议L40S起步)；2)安装NVIDIA驱动和CUDA 12.4；3)安装PyTorch；4)下载模型权重；5)用vLLM启动推理服务；6)配置Nginx反代和HTTPS。

部署大模型需要什么配置？

7B模型T4(16GB)即可；13B模型L40S(48GB)；70B模型H20(96GB)配合量化。如果只是测试，用API调用更省事。

vLLM和Ollama怎么选？

vLLM适合生产：吞吐量高、支持多并发。Ollama适合本地开发：一键安装但并发差。本教程用vLLM。

从零部署一个AI应用：买服务器→配CUDA→跑模型→绑域名全流程

2026-05-03 · AI云服务实战笔记

📖 相关阅读：
← 大模型API价格横评 | 语音识别API接入实战 →

去年第一次部署AI模型时，我对着控制台找了半天不知道怎么选配置。这篇把完整流程整理出来——从买服务器到能通过HTTPS调API，每一步都带命令。

⏱️ 预计时间：首次操作约45-60分钟。熟练后20分钟内可以完成。

第一步：选服务器配置

不同模型对硬件的要求不一样。给个速查表：

模型规模	推荐GPU	显存需求	月费参考
Qwen2.5-1.5B	T4 / CPU	~4GB	¥300-800
Qwen2.5-7B	T4	~16GB	¥1,500-2,500
Qwen2.5-14B	L40S	~28GB	¥3,000-5,000
Qwen2.5-32B	A100 / H20	~60GB	¥6,000-12,000
Qwen2.5-72B(量化)	H20 + AWQ	~38GB	¥8,000-12,000

新手建议从L40S起步——48GB显存能跑13B-32B范围的模型，价格也不离谱。可以在腾讯云GPU实例里选。装系统时选Ubuntu 22.04。

第二步：安装CUDA 12.4

SSH登录后先检查有没有NVIDIA驱动：

nvidia-smi

如果报错需要安装驱动。腾讯云GPU实例一般预装了驱动：

# 添加NVIDIA仓库
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# 安装CUDA 12.4（兼容性最好）
sudo apt install -y cuda-toolkit-12-4

# 验证
nvcc --version
nvidia-smi

⚠️ 注意：nvidia-smi显示的CUDA版本是驱动支持的最高版本，nvcc显示的才是当前激活的版本。

第三步：安装Python环境

# 安装miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
~/miniconda3/bin/conda init bash
source ~/.bashrc

# 创建环境
conda create -n ai python=3.11 -y
conda activate ai

# 安装PyTorch (CUDA 12.4)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# 验证GPU
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.get_device_name(0))"

第四步：用vLLM部署模型

vLLM是目前最主流的生产级推理框架，支持OpenAI兼容API。以Qwen2.5-7B为例：

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

首次运行会自动下载模型权重（约15GB）。看到 Application startup complete 就说明服务起来了。

测试一下API：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"写一个快速排序"}],"max_tokens":500}'

第五步：Nginx反代 + HTTPS

sudo apt install -y nginx certbot python3-certbot-nginx

sudo tee /etc/nginx/sites-available/ai-api << EOF
server {
    listen 80;
    server_name your-domain.com;
    client_max_body_size 100M;
    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header Host \$host;
        proxy_read_timeout 300s;
    }
}
EOF

sudo ln -s /etc/nginx/sites-available/ai-api /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
sudo certbot --nginx -d your-domain.com

第六步：设置systemd自启

sudo tee /etc/systemd/system/vllm.service << EOF
[Unit]
Description=vLLM API Server
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/miniconda3/envs/ai/bin/python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.90
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

✅ 部署完成检查：

systemctl status vllm — 服务运行中
curl http://localhost:8000/health — 健康检查通过
curl https://your-domain.com/v1/models — HTTPS正常
nvidia-smi — 显存占用符合预期

常见问题

Q: 下载模型太慢？

设置HuggingFace镜像：export HF_ENDPOINT=https://hf-mirror.com

Q: CUDA out of memory？

降低 --gpu-memory-utilization 到0.75，或减小 --max-model-len。

Q: 不想手动装环境？

腾讯云HAI服务可以一键部署常见AI模型，在云产品精选里选HAI即可。

← 返回首页