GitHub · Gemini · Claude

FreeLLMAPI: A Self-Hosted Gateway That Pools 14 Free AI APIs Into One OpenAI-Compatible Endpoint

By 沉默王二 · May 30, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Why it matters

For Western developers burning through expensive API credits on Claude, GPT-4, or Gemini, FreeLLMAPI offers a practical escape hatch: a self-hosted gateway that turns 14 free tiers into a single, reliable API endpoint. It's a concrete example of how the open-source community is systematically dismantling the cost barrier to AI tooling — and the router's sophisticated penalty and sticky-session logic makes it production-viable, not just a toy.

Summary

FreeLLMAPI is an open-source, self-hosted API gateway that aggregates the free tiers of 14 major AI model providers — including Google Gemini, Groq, Mistral, Cerebras, SambaNova, OpenRouter, GitHub Models, Cloudflare Workers AI, Cohere, Zhipu, HuggingFace, and NVIDIA NIM — into a single OpenAI-compatible endpoint. Users register for each platform's free API key (no credit card required), configure them in FreeLLMAPI's dashboard, and get a unified Bearer Token. All requests then go to `http://localhost:3001/v1/chat/completions`, and the router automatically selects the best available model.

The router is the project's standout feature. It uses a dynamic penalty system: models that return 429 rate-limit errors get a penalty score that decays over time, automatically sinking them in the priority queue. Rate limiting uses a sliding window algorithm with dual memory and SQLite storage for persistence across restarts. Cooldowns escalate from 2 minutes to 24 hours for repeated failures. A Sticky Session mechanism keeps multi-turn conversations on the same model for 30 minutes, with graceful fallback if that model becomes unavailable. The system retries up to 20 times across all configured providers.

The project has already garnered 6.2k stars on GitHub. The original author, a Chinese developer known as "Silent King Er," reports spending over $400/month on Claude Code and Codex, and built this integration as a cost-saving measure. He also demonstrates integrating FreeLLMAPI with his own open-source Agent CLI tool, PaiCLI, showing how developers can route all their AI tooling through the free tier.

Key takeaways

— FreeLLMAPI aggregates free API quotas from 14 AI providers into a single OpenAI-compatible endpoint.

— Users register for each platform's free key (no credit card required) and configure them in a local dashboard.

— The router uses a dynamic penalty system: models returning 429 errors get a penalty score that decays over time, automatically shifting traffic to healthier models.

— Rate limiting uses a sliding window algorithm with dual memory and SQLite storage for persistence across restarts.

— Cooldowns escalate: first 429 in 24 hours = 2 minutes, second = 10 minutes, third = 1 hour, fourth = 24 hours.

— Sticky Sessions keep multi-turn conversations on the same model for 30 minutes, with graceful fallback if that model becomes unavailable.

— The system retries up to 20 times across all configured providers before returning a 429 error.

— 401 authentication failures do not trigger retries, as switching models won't fix an invalid key.

— The project has 6.2k stars on GitHub and is built by developer tashfeenahmed.

— The original author reports spending over $400/month on Claude Code and Codex, motivating this integration.

Our take

The dynamic penalty routing is a clever piece of engineering: it turns rate limiting from a hard failure into a self-healing load-balancing signal, without requiring any manual configuration.

The escalating cooldown strategy (2 min → 10 min → 1 hr → 24 hr) shows a deep understanding of free-tier behavior — repeated 429s usually mean the daily quota is exhausted, so waiting a full day is more efficient than retrying every few seconds.

Dual memory + SQLite for rate limit tracking is a pragmatic tradeoff: memory for speed, SQLite for crash recovery. It's the kind of detail that separates a hobby project from something you'd actually run in production.

The Sticky Session mechanism solves a real UX problem in multi-turn AI interactions — context coherence — without forcing the user to pin a specific model. The 30-minute window is a reasonable balance between stability and flexibility.

The fact that 401 errors are explicitly excluded from retry logic shows careful error handling: some failures are structural, not transient, and retrying them would just waste time and quota.

FreeLLMAPI's existence signals a broader trend: as AI model APIs proliferate, the value is shifting from the models themselves to the infrastructure that routes, manages, and optimizes access to them.

Concepts & terms

Sliding Window Rate Limiting

A rate-limiting algorithm that tracks requests over a continuous, rolling time window (e.g., 'the last 60 seconds') rather than resetting at fixed intervals. This prevents burst traffic at window boundaries and provides smoother enforcement.

Sticky Session

A routing strategy that ensures all requests from the same multi-turn conversation are sent to the same backend model, preserving context coherence. In FreeLLMAPI, it uses a SHA1 hash of the first user message as a session key, with a 30-minute validity window.

Dynamic Penalty Routing

A load-balancing approach where models accumulate penalty scores for failures (like rate limiting), which decay over time and with successful requests. This automatically shifts traffic away from failing models without manual intervention.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗