Flash API endpoints let you build HTTP APIs with FastAPI that run on Runpod Serverless workers. Use them to deploy production APIs that need GPU or CPU acceleration.
Unlike standalone scripts that run once and return results, this lets you create a persistent endpoint for handling incoming HTTP requests. Each request is processed by a Serverless worker using the same remote functions you’d use in a standalone script.
Flash API endpoints are currently available for local testing only. Run flash run to start the API server on your local machine. Production deployment support is coming in future updates.
Step 1: Initialize a new project
Use the flash init command to generate a structured project template with a preconfigured FastAPI application entry point.
Run this command to initialize a new project directory:
You can also initialize your current directory:
Step 2: Explore the project template
This is the structure of the project template created by flash init:
my_project/
├── main.py # FastAPI application entry point
├── workers/
│ ├── gpu/ # GPU worker example
│ │ ├── __init__.py # FastAPI router
│ │ └── endpoint.py # GPU script with @remote decorated function
│ └── cpu/ # CPU worker example
│ ├── __init__.py # FastAPI router
│ └── endpoint.py # CPU script with @remote decorated function
├── .env # Environment variable template
├── .gitignore # Git ignore patterns
├── .flashignore # Flash deployment ignore patterns
├── requirements.txt # Python dependencies
└── README.md # Project documentation
This template includes:
- A FastAPI application entry point and routers.
- Templates for Python dependencies,
.env, .gitignore, etc.
- Flash scripts (
endpoint.py) for both GPU and CPU workers, which include:
- Pre-configured worker scaling limits using the
LiveServerless() object.
- A
@remote decorated function that returns a response from a worker.
When you start the FastAPI server, it creates API endpoints at /gpu/hello and /cpu/hello, which call the remote function described in their respective endpoint.py files.
Step 3: Install Python dependencies
After initializing the project, navigate into the project directory:
Install required dependencies:
pip install -r requirements.txt
Open the .env template file in a text editor and add your Runpod API key:
# Use your text editor of choice, e.g.
cursor .env
Remove the # symbol from the beginning of the RUNPOD_API_KEY line and replace your_api_key_here with your actual Runpod API key:
RUNPOD_API_KEY=your_api_key_here
# FLASH_HOST=localhost
# FLASH_PORT=8888
# LOG_LEVEL=INFO
Save the file and close it.
Step 5: Start the local API server
Use flash run to start the API server:
Open a new terminal tab or window and test your GPU API using cURL:
curl -X POST http://localhost:8888/gpu/hello \
-H "Content-Type: application/json" \
-d '{"message": "Hello from the GPU!"}'
If you switch back to the terminal tab where you used flash run, you’ll see the details of the job’s progress.
Faster testing with auto-provisioning
For development with multiple endpoints, use --auto-provision to deploy all resources before testing:
flash run --auto-provision
This eliminates cold-start delays by provisioning all serverless endpoints upfront. Endpoints are cached and reused across server restarts, making subsequent runs faster. Resources are identified by name, so the same endpoint won’t be re-deployed if the configuration hasn’t changed.
Step 6: Open the API explorer
Besides starting the API server, flash run also starts an interactive API explorer. Point your web browser at http://localhost:8888/docs to explore the API.
To run remote functions in the explorer:
- Expand one of the functions under GPU Workers or CPU Workers.
- Click Try it out and then Execute.
You’ll get a response from your workers right in the explorer.
Step 7: Customize your API
To customize your API endpoint and functionality:
- Add or edit remote functions in your
endpoint.py files.
- Test the scripts individually by running
python endpoint.py.
- Configure your FastAPI routers by editing the
__init__.py files.
- Add any new endpoints to your
main.py file.
Example: Adding a custom endpoint
To add a new GPU endpoint for image generation:
- Create a new file at
workers/gpu/image_gen.py:
from tetra_rp import remote, LiveServerless, GpuGroup
config = LiveServerless(
name="image-generator",
gpus=[GpuGroup.AMPERE_24],
workersMax=2
)
@remote(
resource_config=config,
dependencies=["diffusers", "torch", "transformers"]
)
def generate_image(prompt: str, width: int = 512, height: int = 512):
import torch
from diffusers import StableDiffusionPipeline
import base64
import io
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
image = pipeline(prompt=prompt, width=width, height=height).images[0]
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
return {"image": img_str, "prompt": prompt}
- Add a route in
workers/gpu/__init__.py:
from fastapi import APIRouter
from .image_gen import generate_image
router = APIRouter()
@router.post("/generate")
async def generate(prompt: str, width: int = 512, height: int = 512):
result = await generate_image(prompt, width, height)
return result
- Include the router in
main.py if not already included.
Load-balanced endpoints
For API endpoints requiring low-latency HTTP access with direct routing, use load-balanced endpoints:
from tetra_rp import LiveLoadBalancer, remote
api = LiveLoadBalancer(name="api-service")
@remote(api, method="POST", path="/api/process")
async def process_data(x: int, y: int):
return {"result": x + y}
@remote(api, method="GET", path="/api/health")
def health_check():
return {"status": "ok"}
# Call functions directly
result = await process_data(5, 3) # → {"result": 8}
Key differences from queue-based endpoints:
- Direct HTTP routing: Requests routed directly to workers, no queue.
- Lower latency: No queuing overhead.
- Custom HTTP methods: GET, POST, PUT, DELETE, PATCH support.
- No automatic retries: Users handle errors directly.
Load-balanced endpoints are ideal for REST APIs, webhooks, and real-time services. Queue-based endpoints are better for batch processing and fault-tolerant workflows.
Next steps