HomeAI Inference

AI Inference
AI Inference Frameworks

Deploy major AI inference frameworks with high availability architecture.
Enable 24/7 uninterrupted AI service operation with CoreLab Cluster.

Supported Frameworks

🚀 vLLM

High-throughput LLM inference engine with PagedAttention.
Optimized for large-scale inference workloads. Multi-node setup with Ray Cluster.

⚡ sglang

Efficient LLM server with RadixAttention.
Prompt caching and dynamic scheduling maximize iterative inference. Multi-node setup with Ray Cluster.

🦙 llama.cpp

Lightweight LLM library supporting CPU/GPU hybrid inference.
Efficiently runs quantized models without GPU. Parallel execution for fast inference. (Recommended for small-scale workloads)

🔥 Ollama

Open-source platform simplifying local LLM execution.
One-click model download, run, and API serving. (Recommended for small-scale workloads)

🔗 LiteLLM

Unified interface for OpenAI, Anthropic, and other LLM APIs.
Multi-model routing, cost optimization, distributed processing. API-level HA integrated with CoreLab.

Inference Server Clustering Setup

Master Node Running inference server (vLLM / sglang, etc.)
Worker Node Same model loaded, waiting · Real-time state sync
HA Setup Master Node duplication, Redis duplication (GCS service sync) - Physical, VM available
Model Sync Shared storage or real-time replication for model files
Monitoring CoreLab web console + Inference API health check integration

Use Cases

💬 AI Chatbot Service

Run LLM-powered customer support chatbots 24/7 with HA setup.
Automatic failover ensures zero service downtime.

📄 Document Analysis Pipeline

Dualize inference servers for large-scale document RAG pipelines.
Instant recovery from standby node during batch processing failures.

Inquire about AI Inference →