Reliability and data durability

The recovery commitments Kraty makes — RPO, RTO, backup retention, restore drills, and what happens to your data if you leave.

The promises on this page are contractual. They cover what happens when something goes wrong (infrastructure, data, your account) and how your data behaves over time.

This page focuses on what you, the integrating studio, can rely on — recovery targets, backup retention, and the contracts Kraty holds itself to in an incident.

Recovery targets

Metric	Commitment	What it means for you
RPO (recovery point)	15 minutes	Worst-case data loss in a recovery scenario. Anything written more than 15 min before the incident survives.
RTO (recovery time)	4 hours	Worst-case time-to-full-recovery from a Postgres-level incident. API may be degraded (leaderboards rebuild from cold cache) for ~5 min after, but writes return to normal.
Auto-failover	API instances	If a single API instance dies, the load balancer routes around it with no observable downtime.

Smaller incidents (instance crash, region brownout) recover in seconds via the load balancer. The 4h figure is the upper bound for the worst routine scenario (logical Postgres corruption, bad-migration rollback).

Backups

Continuous Postgres backups with point-in-time recovery (PITR) in 5-second granularity, 14-day rolling window. The database can be restored to any moment in the last two weeks.
Daily logical snapshots retained 1 year in a separate storage bucket. The last-line-of-defense backup, used for incidents older than 14 days and as the artifact behind any full data-export request.
Audit-archive bucket is versioned in place — historical audit records can never be silently overwritten.

Production data lives in TBD-at-deploy. Backups live in a separate region.

Restore drills

Kraty runs a quarterly restore drill: a production backup is restored to a fresh database in an isolated environment, and the full integration test suite runs against it. The drill validates the RTO commitment — restore-path breakage is caught before a real incident.

Pre-GA studios can request the most recent drill date.

Webhooks during incidents

If Kraty is degraded but operational, webhook deliveries continue through the normal retry schedule (30s, 2m, 10m, 1h, 6h, 24h up to six attempts). Your receivers see brief silence followed by catch-up traffic when delivery resumes.

If Kraty is fully down, in-flight webhook deliveries pause and resume from where they left off when service returns — none are dropped. See Webhooks → Retries and circuit breaker.

Status communication

For any incident that affects service availability or causes data loss, Kraty will:

Update the status page (TBD-at-deploy) within 10 minutes of detection.
Email designated contacts on your studio's billing account when the incident is declared and again when resolved.
Publish a post-mortem within 5 business days for any customer-impacting incident lasting longer than 30 minutes.

Player data export

You can take your data out of Kraty at any time.

Per-player export is live today: GET /server/v1/players/:externalId/export returns a player's complete record — profile, attempts, grants, inventory, wallet, lobbies. See the GDPR recipe.
Full studio export (a single dump of every table scoped to your studio) is on the roadmap and lands before GA. Until then, request one through your studio admin contact and Kraty will generate it.

Where to find more

Per-endpoint behavior: REST API conventions.
Webhook retry + circuit breaker: Webhooks.
Authentication recovery (lost player secrets, key rotation): Authentication.