Developer Resources

Best Practices for Working with Free LLM APIs

Master the art of building AI-powered applications with free resources. Learn optimization, security, prompt engineering, and more.

Getting Started

Essential setup and configuration

1. Choose the Right Provider for Your Use Case

Consider These Factors:

  • Rate Limits: How many requests do you need per minute/day?
  • Context Window: Do you need to process long documents?
  • Speed: Is real-time response critical?
  • Privacy: Are you handling sensitive data?

Quick Recommendations:

For Speed → Groq
Up to 800+ tokens/sec
For Long Contexts → Gemini 2.0 Flash
1M token context window
For Privacy → Ollama (local)
100% offline capable
For Coding → DeepSeek Coder
Specialized for code generation

2. Implement Proper Error Handling

Free APIs can fail (rate limits, downtime, network issues). Always build resilient code:

import requests
import time
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def call_llm_with_retry(url, headers, data, max_retries=3):
    """Robust API call with exponential backoff"""
    
    # Configure retry strategy
    session = requests.Session()
    retry = Retry(
        total=max_retries,
        backoff_factor=1,  # Wait 1s, 2s, 4s, etc.
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('https://', adapter)
    
    try:
        response = session.post(url, headers=headers, json=data, timeout=30)
        response.raise_for_status()
        return response.json()
    
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 429:
            print("Rate limit hit. Consider switching providers.")
        raise
    
    except requests.exceptions.Timeout:
        print("Request timed out. Provider may be slow.")
        raise
    
    except requests.exceptions.ConnectionError:
        print("Network issue. Check your internet.")
        raise
    
    finally:
        session.close()

# Usage
result = call_llm_with_retry(url, headers, data)

Pro Tip: Implement Provider Fallbacks

Don't rely on a single provider. If Groq hits rate limits, automatically switch to Google AI Studio or OpenRouter. This ensures 99.9% uptime.

3. Secure Your API Keys

Never Do This:

  • ❌ Hardcode keys in source code
  • ❌ Commit keys to Git repositories
  • ❌ Expose keys in frontend JavaScript
  • ❌ Share keys in public forums/Discord
  • ❌ Use same key across projects

Always Do This:

  • ✓ Store keys in environment variables
  • ✓ Use .env files (add to .gitignore)
  • ✓ Rotate keys regularly
  • ✓ Use backend proxy for frontend apps
  • ✓ Monitor key usage for anomalies

Example .env file:

GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxx
GOOGLE_API_KEY=AIzaSyXXXXXXXXXXXXXXXXXXXXXX
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxxxxxx

Optimization Techniques

Maximize your free tier usage

1. Implement Response Caching

Identical prompts = identical responses. Cache aggressively to reduce API calls by 40-70%:

import hashlib
import json
from functools import lru_cache

# Option 1: In-Memory Cache (simple, fast)
@lru_cache(maxsize=1000)
def cached_llm_call(prompt, model="llama-3.3-70b", temp=0.7):
    # Cache based on prompt + parameters
    response = call_api(prompt, model, temp)
    return response

# Option 2: Redis Cache (persistent, scalable)
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def cached_llm_call_redis(prompt, model, temp, ttl=3600):
    # Create cache key
    key = hashlib.md5(f"{prompt}{model}{temp}".encode()).hexdigest()
    
    # Check cache
    cached = r.get(key)
    if cached:
        print("Cache hit!")
        return json.loads(cached)
    
    # Cache miss - call API
    response = call_api(prompt, model, temp)
    r.setex(key, ttl, json.dumps(response))  # Cache for 1 hour
    return response

When to Cache:

  • ✓ FAQ responses (same questions asked repeatedly)
  • ✓ Product descriptions or content generation
  • ✓ Code snippets for common tasks
  • ✓ Sentiment analysis of static data

2. Optimize Prompt Length

Techniques:

  • 1 Remove redundancy: Don't repeat instructions. State once clearly.
  • 2 Truncate context: For chatbots, keep only last 5-10 messages, not entire history.
  • 3 Use abbreviations: "Summarize in 3 bullets" instead of lengthy explanations.
  • 4 Limit output: Set max_tokens to prevent overly long responses.
❌ Inefficient (250 tokens):

"I need you to analyze this product review and tell me if the sentiment is positive, negative, or neutral. Please provide a detailed explanation..."

✓ Optimized (50 tokens):

"Sentiment (positive/negative/neutral): [review text]"

3. Batch Processing

Instead of making 100 API calls for 100 items, process multiple items in a single request:

❌ Inefficient:
for review in reviews:
    sentiment = llm(f"Sentiment: {review}")
# 100 API calls = slow + expensive
✓ Optimized:
batch = "\n".join([f"{i}. {r}" for i,r in enumerate(reviews)])
result = llm(f"Sentiment for each:\n{batch}")
# 1 API call = fast!

Tip: Use JSON output format for easy parsing: "Return as JSON: [{{"id": 1, "sentiment": "positive"}}, ...]"

4. Rate Limit Management

import time
from collections import deque

class RateLimiter:
    def __init__(self, max_calls_per_minute=30):
        self.max_calls = max_calls_per_minute
        self.calls = deque()
    
    def wait_if_needed(self):
        now = time.time()
        
        # Remove calls older than 1 minute
        while self.calls and self.calls[0] < now - 60:
            self.calls.popleft()
        
        # If we've hit the limit, wait
        if len(self.calls) >= self.max_calls:
            sleep_time = 60 - (now - self.calls[0])
            print(f"Rate limit reached. Waiting {sleep_time:.1f}s...")
            time.sleep(sleep_time)
            self.calls.popleft()
        
        self.calls.append(time.time())

# Usage
limiter = RateLimiter(max_calls_per_minute=30)
for prompt in prompts:
    limiter.wait_if_needed()
    result = call_api(prompt)

Prompt Engineering

Get better outputs with smarter prompts

7 Proven Prompting Techniques

1. Be Specific and Clear

Vague:

"Write about AI"

Specific:

"Write a 500-word article explaining how transformers work for a beginner audience"

2. Use Chain-of-Thought

Add "Let's think step by step" or "Explain your reasoning" for complex tasks:

"Calculate 47 × 23. Think step by step."

✓ Output quality improves by 20-30% on reasoning tasks

3. Provide Examples (Few-Shot)

Extract structured data from text.

Example 1:
Input: "John Doe, [email protected], age 30"
Output: {{"name": "John Doe", "email": "[email protected]", "age": 30}}

Example 2:
Input: "Jane Smith, [email protected], age 25"
Output: {{"name": "Jane Smith", "email": "[email protected]", "age": 25}}

Now extract from: "Bob Wilson, [email protected], age 45"

4. Set Constraints

  • • Length: "Answer in 50 words or less"
  • • Format: "Return as JSON" or "Use markdown"
  • • Tone: "Explain like I'm 5" or "Professional business tone"
  • • Structure: "Use bullet points, no paragraphs"

Security Best Practices

Protect your applications and data

Essential Security Checklist

  • Never send passwords, API keys, or credentials to LLMs
  • Sanitize user input to prevent prompt injection
  • Use HTTPS for all API requests
  • Implement rate limiting on your backend
  • Monitor API usage for anomalies
  • Use local models for sensitive data (HIPAA, GDPR compliance)

⚠️ Never Do This

  • ❌ Send user passwords to LLMs for "validation"
  • ❌ Include PII (SSN, credit cards) in prompts
  • ❌ Trust LLM output without validation
  • ❌ Execute code generated by LLMs without review
  • ❌ Allow users to control system prompts directly

Recommended Tools & Libraries

Python

  • LangChain: LLM framework
  • LiteLLM: Unified API interface
  • Haystack: RAG pipelines
  • tiktoken: Token counting

JavaScript

  • Vercel AI SDK: React hooks
  • LangChain.js: JS framework
  • OpenAI SDK: Official client
  • ai: Streaming UI

Testing & Debugging

  • PromptFoo: Prompt testing
  • LangSmith: Observability
  • Weights & Biases: Tracking
  • Helicone: Monitoring