Trade-offs & Challenges

The Hard Problems

Resumability isn't a silver bullet. Understanding its limitations is essential for building robust systems.

Challenge 1: What Can't Be Serialized

Some state is inherently ephemeral or external

External API State

Database connections, open file handles, active WebSocket connections, OAuth tokens mid-refresh.

Database cursorsHTTP sessionsFile locks

Time-Sensitive Information

Stock prices, inventory counts, weather data, real-time analytics that change between checkpoint and restore.

Prices ($45.00 → $47.50)Inventory (5 → 0)User online status

User Session Changes

If the user logs out, changes permissions, or their account state changes between sessions.

Permission revokedAccount suspendedProfile updated

Model Version Differences

When restoring to a different model version, learned behaviors or capabilities may differ.

Claude 2 → 3GPT-3.5 → 4Fine-tuned vs base

Mitigation Strategy

Checkpoints should include metadata about external dependencies so the restoration process can validate or refresh them:

typescript
interface CheckpointDependencies {
  externalAPIs: Array<{
    name: string;
    lastVerified: number;
    refreshStrategy: 'revalidate' | 'invalidate' | 'ignore';
  }>;

  timeSensitiveData: Array<{
    key: string;
    value: unknown;
    capturedAt: number;
    staleDuration: number;  // After this many ms, mark as stale
  }>;

  sessionRequirements: {
    userId: string;
    requiredPermissions: string[];
    validUntil?: number;
  };
}

Challenge 2: Validation & Staleness

When is a checkpoint still valid? How do we know?

Staleness Factors

Time Decay

Confidence decreases as checkpoint ages

confidence = originalConfidence × e^(-λt)

After 24 hours, a 0.95 confidence fact might be 0.7

External Events

Known world changes invalidate specific beliefs

if (event.affects(belief)) belief.confidence = 0

API deprecation announcement invalidates integration plans

Dependency Chains

If a parent fact becomes stale, children may too

child.maxConfidence = min(ancestors.confidence)

If 'user is admin' is stale, 'user can delete' is too

Full Restore Strategy

When checkpoint is recent and context hasn't changed.

  • • Load entire checkpoint as-is
  • • Trust all facts and beliefs
  • • Resume from exact point

Partial Restore Strategy

When some facts are stale but structure is valid.

  • • Load structure and relationships
  • • Re-verify time-sensitive facts
  • • Mark stale beliefs for re-evaluation
validation.ts
1function validateCheckpoint(
2 checkpoint: CognitiveCheckpoint,
3 currentContext: Context
4): ValidationResult {
5 const warnings: string[] = [];
6 const invalidations: string[] = [];
7
8 // Check time-based staleness
9 const ageHours = (Date.now() - checkpoint.timestamp) / (1000 * 60 * 60);
10 if (ageHours > 24) {
11 warnings.push(`Checkpoint is ${ageHours.toFixed(1)}h old`);
12 }
13
14 // Check external dependencies
15 for (const fact of checkpoint.epistemicState.facts) {
16 if (fact.source.startsWith('api:')) {
17 const apiName = fact.source.slice(4);
18 if (!currentContext.availableAPIs.includes(apiName)) {
19 invalidations.push(`API ${apiName} no longer available`);
20 }
21 }
22 }
23
24 // Check user context
25 if (checkpoint.metadata?.userId !== currentContext.userId) {
26 invalidations.push('Different user context');
27 }
28
29 return {
30 valid: invalidations.length === 0,
31 warnings,
32 invalidations,
33 suggestedStrategy: invalidations.length > 0
34 ? 'partial'
35 : warnings.length > 0
36 ? 'verify'
37 : 'full',
38 };
39}

Challenge 3: Security & Privacy

Checkpoints contain sensitive cognitive state

Sensitive Data in State

  • •PII from conversations
  • •API keys observed during tool use
  • •Proprietary business logic
  • •Medical/legal information

Access Control Challenges

  • •Who can restore a checkpoint?
  • •Can checkpoints be shared?
  • •Cross-organization transfer
  • •Audit trail requirements

Compliance Considerations

  • •GDPR right to deletion
  • •HIPAA data handling
  • •SOC2 access controls
  • •Data residency requirements

Security Best Practices

Encryption
  • • Encrypt checkpoints at rest (AES-256)
  • • Encrypt in transit (TLS 1.3)
  • • Per-user encryption keys
  • • Key rotation policies
Access Control
  • • Owner-only access by default
  • • Explicit sharing with consent
  • • Time-limited access tokens
  • • Audit logging of all access
secure-checkpoint.ts
1interface SecureCheckpoint extends CognitiveCheckpoint {
2 security: {
3 encryptedAt: number;
4 algorithm: 'AES-256-GCM';
5 keyId: string; // Reference to key in KMS
6
7 accessControl: {
8 owner: string; // User ID
9 allowedUsers: string[]; // Explicit shares
10 expiresAt?: number; // Auto-delete time
11 };
12
13 redaction: {
14 piiRemoved: boolean;
15 sensitiveFieldsHashed: string[];
16 redactionPolicy: string;
17 };
18
19 compliance: {
20 dataClassification: 'public' | 'internal' | 'confidential' | 'restricted';
21 retentionPolicy: string;
22 auditLogId: string;
23 };
24 };
25}
26
27async function storeSecureCheckpoint(
28 checkpoint: CognitiveCheckpoint,
29 options: SecurityOptions
30): Promise<SecureCheckpoint> {
31 // 1. Redact sensitive information
32 const redacted = await redactPII(checkpoint, options.redactionPolicy);
33
34 // 2. Encrypt the content
35 const { ciphertext, keyId } = await encrypt(
36 JSON.stringify(redacted),
37 options.keyId
38 );
39
40 // 3. Store with access controls
41 const secure: SecureCheckpoint = {
42 ...checkpoint,
43 // Replace content with encrypted version
44 nodes: ciphertext as any,
45 security: {
46 encryptedAt: Date.now(),
47 algorithm: 'AES-256-GCM',
48 keyId,
49 accessControl: {
50 owner: options.userId,
51 allowedUsers: [],
52 expiresAt: options.ttl ? Date.now() + options.ttl : undefined,
53 },
54 redaction: {
55 piiRemoved: true,
56 sensitiveFieldsHashed: options.hashFields || [],
57 redactionPolicy: options.redactionPolicy,
58 },
59 compliance: options.compliance,
60 },
61 };
62
63 // 4. Log to audit trail
64 await auditLog.record({
65 action: 'checkpoint_created',
66 checkpointId: checkpoint.id,
67 userId: options.userId,
68 timestamp: Date.now(),
69 });
70
71 return secure;
72}

Challenge 4: Versioning & Schema Evolution

Checkpoints need to survive schema changes

Version Compatibility Matrix

Change TypeBackward CompatibleMigration Strategy
Add optional field YesUse default value when missing
Add required field NoDerive from existing data or mark invalid
Remove field YesIgnore unknown fields on read
Rename field NoMigration function to map old → new
Change field type NoVersion-aware deserialization
Add new node type YesTreat unknown types as generic

Schema Version Tracking

typescript
interface VersionedCheckpoint {
  schemaVersion: number;  // Increment on breaking changes
  minReaderVersion: number;  // Minimum compatible reader

  migrations: Array<{
    fromVersion: number;
    toVersion: number;
    applied: boolean;
  }>;
}

Migration Registry

typescript
const migrations: Migration[] = [
  {
    from: 1, to: 2,
    migrate: (cp) => ({
      ...cp,
      // v2 split 'metadata' into separate fields
      metrics: cp.metadata?.metrics,
      tags: cp.metadata?.tags,
    }),
  },
];

Git-Inspired Versioning

Just like git manages code history, cognitive state can be version-controlled:

  • 1Commits = Checkpoints with parent references
  • 2Branches = Parallel reasoning paths
  • 3Merge = Combining branch insights
  • 4Diff = Incremental checkpoints

Summary: When to Use Resumability

Good Fit

  • ✓Long-running conversations (hours/days)
  • ✓Complex multi-step reasoning tasks
  • ✓Frequent resume/pause patterns
  • ✓Branching exploration needed
  • ✓Cost-sensitive applications

Poor Fit

  • ✗Short, stateless Q&A
  • ✗Highly time-sensitive data required
  • ✗Context changes completely between sessions
  • ✗Strict compliance with no storage
  • ✗Simple tasks where overhead isn't worth it