AI Inference
AI Inference Frameworks
Deploy major AI inference frameworks with high availability architecture.
Enable 24/7 uninterrupted AI service operation with CoreLab Cluster.
Supported Frameworks
🚀 vLLM
High-throughput LLM inference engine with PagedAttention.
Optimized for large-scale inference workloads. Multi-node setup with Ray Cluster.
⚡ sglang
Efficient LLM server with RadixAttention.
Prompt caching and dynamic scheduling maximize iterative inference. Multi-node setup with Ray Cluster.
🦙 llama.cpp
Lightweight LLM library supporting CPU/GPU hybrid inference.
Efficiently runs quantized models without GPU. Parallel execution for fast inference. (Recommended for small-scale workloads)
🔥 Ollama
Open-source platform simplifying local LLM execution.
One-click model download, run, and API serving. (Recommended for small-scale workloads)
🔗 LiteLLM
Unified interface for OpenAI, Anthropic, and other LLM APIs.
Multi-model routing, cost optimization, distributed processing. API-level HA integrated with CoreLab.
Inference Server Clustering Setup
| Master Node | Running inference server (vLLM / sglang, etc.) |
| Worker Node | Same model loaded, waiting · Real-time state sync |
| HA Setup | Master Node duplication, Redis duplication (GCS service sync) - Physical, VM available |
| Model Sync | Shared storage or real-time replication for model files |
| Monitoring | CoreLab web console + Inference API health check integration |
Use Cases
💬 AI Chatbot Service
Run LLM-powered customer support chatbots 24/7 with HA setup.
Automatic failover ensures zero service downtime.
📄 Document Analysis Pipeline
Dualize inference servers for large-scale document RAG pipelines.
Instant recovery from standby node during batch processing failures.