LocalAI - Backends

vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda11-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda12-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

rocm-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f32-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f16-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda11-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda12-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

rocm-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f32-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f16-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

Backend Management

Filter by type:

vllm

vllm-development

cuda11-vllm

cuda12-vllm

rocm-vllm

intel-sycl-f32-vllm

intel-sycl-f16-vllm

cuda11-vllm-development

cuda12-vllm-development

rocm-vllm-development

intel-sycl-f32-vllm-development

intel-sycl-f16-vllm-development

rerankers

rerankers-development

cuda11-rerankers

cuda12-rerankers

intel-sycl-f32-rerankers

intel-sycl-f16-rerankers

rocm-rerankers

cuda11-rerankers-development

cuda12-rerankers-development