Skip to content

LLM

Specs

  • agents.md - A simple, open format for guiding coding agents, used by over 20k open-source
  • A2A

Frameworks

interact with external tools through the Model Context Protocol (MCP).

  • ADK
  • go-sdk - The official Go SDK for Model Context Protocol servers and clients. Maintained in collaboration with Google.
  • mcphost - A CLI host application that enables Large Language Models (LLMs) to

Serving

  • aibrix - Cost-efficient and pluggable Infrastructure components for GenAI inference
  • bifrost - Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
  • mistral.rs - Blazingly fast LLM inference
  • ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models
  • LMCache - Supercharge Your LLM with the Fastest KV Cache Layer

MCP Servers

Coding Agents

Tools

  • TokenCost - Easy token price estimates for 400+ LLMs. TokenOps projects.

Models

CreatorNameHugging FaceOllama
AlibabaQwen 3 VLOllama
BAAIbge-m3HFOllama
DeepSeekDeepSeek OCRHFOllama
GoogleembeddinggemmaOllama
Googlegemma3Ollama
Googlegemma3nOllama
GooglevaultgemmaHF
PaddlePaddlePaddle OCRHF
SCB 10Xtyphoon-ocr-3bOllama
SCB 10Xtyphoon-translate-4bOllama

llama-server

bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --fim-qwen-7b-default --host 0.0.0.0 --port 8080
bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --gpt-oss-120b-default --host 0.0.0.0 --port 8080

lemonade-server

bash
lemonade-server pull user.gemma-3-12b \
  --checkpoint unsloth/gemma-3-12b-it-GGUF:Q4_K_M  \
  --recipe llamacpp
bash
lemonade-server pull user.qwen3-30b \
  --checkpoint unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M  \
  --recipe llamacpp
bash
lemonade-server pull user.qwen3-next-80b \
  --checkpoint Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M  \
  --recipe llamacpp

TTS

Request

Curl

bash
curl -u "username:password" -X POST https://example.com/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3-1.7B-GGUF", "messages": [{"role": "user", "content": "Hello!"}]}'

Python

bash
import os

import httpx
from openai import OpenAI

http_client = httpx.Client(
    auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)

client = OpenAI(
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key="dummy",  # openai sdk requires a dummy API key
    http_client=http_client,
)

completion = client.chat.completions.create(
    model=os.getenv("MODEL_NAME"),
    messages=[
        {"role": "user", "content": "Write a short poem about Python programming."}
    ],
)

print(completion.choices[0].message.content)

Streaming

python
import os

import httpx
from openai import OpenAI

http_client = httpx.Client(
    auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)

client = OpenAI(
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key="dummy",  # openai sdk requires a dummy API key
    http_client=http_client,
)

stream = client.chat.completions.create(
    model=os.getenv("MODEL_NAME"),
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about Python programming."}
    ],
    stream=True,
    # max_tokens=500,
    # temperature=0.7
)

with open("response.txt", "w") as f:
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            f.write(content)
            print(content, end="", flush=True)

Hardware

Setting up NVIDIA DGX Spark with ggml

bash
bash <(curl -s https://ggml.ai/dgx-spark.sh)

Catalogs

Vendors

Google

OpenAI

Demo

Apps

  • gallery - A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

Tools

Resources