LLM
Specs
Frameworks
interact with external tools through the Model Context Protocol (MCP).
- ADK
- go-sdk - The official Go SDK for Model Context Protocol servers and clients. Maintained in collaboration with Google.
- mcphost - A CLI host application that enables Large Language Models (LLMs) to
Serving
- aibrix - Cost-efficient and pluggable Infrastructure components for GenAI inference
- bifrost - Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
- mistral.rs - Blazingly fast LLM inference
- ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models
- LMCache - Supercharge Your LLM with the Fastest KV Cache Layer
MCP Servers
Coding Agents
Tools
- TokenCost - Easy token price estimates for 400+ LLMs. TokenOps projects.
Models
| Creator | Name | Hugging Face | Ollama |
|---|---|---|---|
| Alibaba | Qwen 3 VL | Ollama | |
| BAAI | bge-m3 | HF | Ollama |
| DeepSeek | DeepSeek OCR | HF | Ollama |
| embeddinggemma | Ollama | ||
| gemma3 | Ollama | ||
| gemma3n | Ollama | ||
| vaultgemma | HF | ||
| PaddlePaddle | Paddle OCR | HF | |
| SCB 10X | typhoon-ocr-3b | Ollama | |
| SCB 10X | typhoon-translate-4b | Ollama |
llama-server
bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --fim-qwen-7b-default --host 0.0.0.0 --port 8080bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --gpt-oss-120b-default --host 0.0.0.0 --port 8080lemonade-server
bash
lemonade-server pull user.gemma-3-12b \
--checkpoint unsloth/gemma-3-12b-it-GGUF:Q4_K_M \
--recipe llamacppbash
lemonade-server pull user.qwen3-30b \
--checkpoint unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M \
--recipe llamacppbash
lemonade-server pull user.qwen3-next-80b \
--checkpoint Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M \
--recipe llamacppTTS
- Kokoro
- See available voices: https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX
Request
Curl
bash
curl -u "username:password" -X POST https://example.com/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-1.7B-GGUF", "messages": [{"role": "user", "content": "Hello!"}]}'Python
bash
import os
import httpx
from openai import OpenAI
http_client = httpx.Client(
auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)
client = OpenAI(
base_url=os.getenv("OPENAI_BASE_URL"),
api_key="dummy", # openai sdk requires a dummy API key
http_client=http_client,
)
completion = client.chat.completions.create(
model=os.getenv("MODEL_NAME"),
messages=[
{"role": "user", "content": "Write a short poem about Python programming."}
],
)
print(completion.choices[0].message.content)Streaming
python
import os
import httpx
from openai import OpenAI
http_client = httpx.Client(
auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)
client = OpenAI(
base_url=os.getenv("OPENAI_BASE_URL"),
api_key="dummy", # openai sdk requires a dummy API key
http_client=http_client,
)
stream = client.chat.completions.create(
model=os.getenv("MODEL_NAME"),
messages=[
# {"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about Python programming."}
],
stream=True,
# max_tokens=500,
# temperature=0.7
)
with open("response.txt", "w") as f:
for chunk in stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
f.write(content)
print(content, end="", flush=True)Hardware
Setting up NVIDIA DGX Spark with ggml
bash
bash <(curl -s https://ggml.ai/dgx-spark.sh)Catalogs
Vendors
Google
OpenAI
Demo
Apps
- gallery - A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.