serving-llms-vllm - AI Research Skills
Description
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
Type
Skill
Ecosystem
Cross-platform
Trust Score
85%
Related Skills
Stable Diffusion WebUI
Feature-rich web interface for Stable Diffusion image generation by AUTOMATIC1111.
LangChain
Comprehensive framework for building LLM-powered applications with chains, agents, and retrieval.
LobeChat
Modern, extensible AI chat framework with plugin ecosystem and multi-model support.
Open WebUI
Self-hosted web UI for LLMs with multi-model support, RAG, and plugin system.