Developer Resources

Best Practices for
Working with Free LLM APIs

Master the art of building AI-powered applications with free resources. Learn optimization, security, prompt engineering, and more.

Getting Started

Setup & basics

Optimization

Maximize efficiency

Prompting

Better outputs

Security

Stay protected

Getting Started

Essential setup and configuration

1. Choose the Right Provider for Your Use Case

Consider These Factors:

Rate Limits: How many requests do you need per minute/day?
Context Window: Do you need to process long documents?
Speed: Is real-time response critical?
Privacy: Are you handling sensitive data?

Quick Recommendations:

For Speed → Groq

Up to 800+ tokens/sec

For Long Contexts → Gemini 2.0 Flash

1M token context window

For Privacy → Ollama (local)

100% offline capable

For Coding → DeepSeek Coder

Specialized for code generation

2. Implement Proper Error Handling

Free APIs can fail (rate limits, downtime, network issues). Always build resilient code:

import requests
import time
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def call_llm_with_retry(url, headers, data, max_retries=3):
    """Robust API call with exponential backoff"""
    
    # Configure retry strategy
    session = requests.Session()
    retry = Retry(
        total=max_retries,
        backoff_factor=1,  # Wait 1s, 2s, 4s, etc.
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('https://', adapter)
    
    try:
        response = session.post(url, headers=headers, json=data, timeout=30)
        response.raise_for_status()
        return response.json()
    
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 429:
            print("Rate limit hit. Consider switching providers.")
        raise
    
    except requests.exceptions.Timeout:
        print("Request timed out. Provider may be slow.")
        raise
    
    except requests.exceptions.ConnectionError:
        print("Network issue. Check your internet.")
        raise
    
    finally:
        session.close()

# Usage
result = call_llm_with_retry(url, headers, data)

Pro Tip: Implement Provider Fallbacks

Don't rely on a single provider. If Groq hits rate limits, automatically switch to Google AI Studio or OpenRouter. This ensures 99.9% uptime.

3. Secure Your API Keys

Never Do This:

❌ Hardcode keys in source code
❌ Commit keys to Git repositories
❌ Expose keys in frontend JavaScript
❌ Share keys in public forums/Discord
❌ Use same key across projects

Always Do This:

✓ Store keys in environment variables
✓ Use .env files (add to .gitignore)
✓ Rotate keys regularly
✓ Use backend proxy for frontend apps
✓ Monitor key usage for anomalies

Example .env file:

GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxx
GOOGLE_API_KEY=AIzaSyXXXXXXXXXXXXXXXXXXXXXX
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxxxxxx

Optimization Techniques

Maximize your free tier usage

1. Implement Response Caching

Identical prompts = identical responses. Cache aggressively to reduce API calls by 40-70%:

import hashlib
import json
from functools import lru_cache

# Option 1: In-Memory Cache (simple, fast)
@lru_cache(maxsize=1000)
def cached_llm_call(prompt, model="llama-3.3-70b", temp=0.7):
    # Cache based on prompt + parameters
    response = call_api(prompt, model, temp)
    return response

# Option 2: Redis Cache (persistent, scalable)
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)

def cached_llm_call_redis(prompt, model, temp, ttl=3600):
    # Create cache key
    key = hashlib.md5(f"{prompt}{model}{temp}".encode()).hexdigest()
    
    # Check cache
    cached = r.get(key)
    if cached:
        print("Cache hit!")
        return json.loads(cached)
    
    # Cache miss - call API
    response = call_api(prompt, model, temp)
    r.setex(key, ttl, json.dumps(response))  # Cache for 1 hour
    return response

When to Cache:

✓ FAQ responses (same questions asked repeatedly)
✓ Product descriptions or content generation
✓ Code snippets for common tasks
✓ Sentiment analysis of static data

2. Optimize Prompt Length

Techniques:

1 Remove redundancy: Don't repeat instructions. State once clearly.
2 Truncate context: For chatbots, keep only last 5-10 messages, not entire history.
3 Use abbreviations: "Summarize in 3 bullets" instead of lengthy explanations.
4 Limit output: Set max_tokens to prevent overly long responses.

❌ Inefficient (250 tokens):

"I need you to analyze this product review and tell me if the sentiment is positive, negative, or neutral. Please provide a detailed explanation..."

✓ Optimized (50 tokens):

"Sentiment (positive/negative/neutral): [review text]"

3. Batch Processing

Instead of making 100 API calls for 100 items, process multiple items in a single request:

❌ Inefficient:

for review in reviews:
    sentiment = llm(f"Sentiment: {review}")
# 100 API calls = slow + expensive

✓ Optimized:

batch = "\n".join([f"{i}. {r}" for i,r in enumerate(reviews)])
result = llm(f"Sentiment for each:\n{batch}")
# 1 API call = fast!

Tip: Use JSON output format for easy parsing: "Return as JSON: [{{"id": 1, "sentiment": "positive"}}, ...]"

4. Rate Limit Management

import time
from collections import deque

class RateLimiter:
    def __init__(self, max_calls_per_minute=30):
        self.max_calls = max_calls_per_minute
        self.calls = deque()
    
    def wait_if_needed(self):
        now = time.time()
        
        # Remove calls older than 1 minute
        while self.calls and self.calls[0] < now - 60:
            self.calls.popleft()
        
        # If we've hit the limit, wait
        if len(self.calls) >= self.max_calls:
            sleep_time = 60 - (now - self.calls[0])
            print(f"Rate limit reached. Waiting {sleep_time:.1f}s...")
            time.sleep(sleep_time)
            self.calls.popleft()
        
        self.calls.append(time.time())

# Usage
limiter = RateLimiter(max_calls_per_minute=30)
for prompt in prompts:
    limiter.wait_if_needed()
    result = call_api(prompt)

Prompt Engineering

Get better outputs with smarter prompts

7 Proven Prompting Techniques

1. Be Specific and Clear

Vague:

"Write about AI"

Specific:

"Write a 500-word article explaining how transformers work for a beginner audience"

2. Use Chain-of-Thought

Add "Let's think step by step" or "Explain your reasoning" for complex tasks:

"Calculate 47 × 23. Think step by step."

✓ Output quality improves by 20-30% on reasoning tasks

3. Provide Examples (Few-Shot)

Extract structured data from text.

Example 1:
Input: "John Doe, [email protected], age 30"
Output: {{"name": "John Doe", "email": "[email protected]", "age": 30}}

Example 2:
Input: "Jane Smith, [email protected], age 25"
Output: {{"name": "Jane Smith", "email": "[email protected]", "age": 25}}

Now extract from: "Bob Wilson, [email protected], age 45"

4. Set Constraints

• Length: "Answer in 50 words or less"
• Format: "Return as JSON" or "Use markdown"
• Tone: "Explain like I'm 5" or "Professional business tone"
• Structure: "Use bullet points, no paragraphs"

Security Best Practices

Protect your applications and data

Essential Security Checklist

Never send passwords, API keys, or credentials to LLMs
Sanitize user input to prevent prompt injection
Use HTTPS for all API requests
Implement rate limiting on your backend
Monitor API usage for anomalies
Use local models for sensitive data (HIPAA, GDPR compliance)

⚠️ Never Do This

❌ Send user passwords to LLMs for "validation"
❌ Include PII (SSN, credit cards) in prompts
❌ Trust LLM output without validation
❌ Execute code generated by LLMs without review
❌ Allow users to control system prompts directly

Recommended Tools & Libraries

Python

• LangChain: LLM framework
• LiteLLM: Unified API interface
• Haystack: RAG pipelines
• tiktoken: Token counting

JavaScript

• Vercel AI SDK: React hooks
• LangChain.js: JS framework
• OpenAI SDK: Official client
• ai: Streaming UI

Testing & Debugging

• PromptFoo: Prompt testing
• LangSmith: Observability
• Weights & Biases: Tracking
• Helicone: Monitoring

Browse Free APIs

Cookie Consent

Best Practices for Working with Free LLM APIs

Getting Started

Optimization

Prompting

Security

Getting Started

1. Choose the Right Provider for Your Use Case

Consider These Factors:

Quick Recommendations:

2. Implement Proper Error Handling

Pro Tip: Implement Provider Fallbacks

3. Secure Your API Keys

Never Do This:

Always Do This:

Optimization Techniques

1. Implement Response Caching

When to Cache:

2. Optimize Prompt Length

Techniques:

3. Batch Processing

4. Rate Limit Management

Prompt Engineering

7 Proven Prompting Techniques

1. Be Specific and Clear

2. Use Chain-of-Thought

3. Provide Examples (Few-Shot)

4. Set Constraints

Security Best Practices

Essential Security Checklist

⚠️ Never Do This

Recommended Tools & Libraries

Python

JavaScript

Testing & Debugging

Best Practices for
Working with Free LLM APIs