Version: 1.2

12. Retry Engine & Dead Letter Queue

12.1 Delivery state machine

NotificationDeliveryState (Shumoul.Notification.Contracts):

Values: Pending=0, Processing=1, Succeeded=2, RetryScheduled=3, Retrying=4, Failed=5, DeadLetter=6, Cancelled=7, Expired=8.

12.2 Retry strategies

RetryStrategy enum: None=0, Fixed=1, ExponentialBackoff=2, Linear=3, Immediate=4, implemented by ImmediateRetryStrategy, FixedRetryStrategy, LinearRetryStrategy, ExponentialBackoffRetryStrategy, NoRetryStrategy (Shumoul.Notification.Core/RetryStrategies/), resolved via RetryStrategyProvider.

Strategy	Delay formula
Immediate	`0`
Fixed	`initialDelaySeconds` (constant every attempt)
Linear	`initialDelaySeconds × attemptNumber`
ExponentialBackoff	`initialDelaySeconds × backoffMultiplier^(attemptNumber - 1)`, capped at `maximumDelaySeconds`
None	never retried — first failure goes straight to `Failed`/`DeadLetter`

Which strategy, maxRetries, and which NotificationFailureTypes are retryable at all are configured per channel via Notification Delivery Policies — see that page's seed-defaults table.

12.3 Failure classification

NotificationFailureType: Unknown=0, Temporary=1, Permanent=2, Timeout=3, RateLimit=4, NetworkFailure=5, AuthenticationFailure=6, InvalidRecipient=7, QuotaExceeded=8. A channel-aware classifier inspects the raw provider error and assigns one of these, which the delivery policy then checks against its RetryOn{FailureType} flags to decide retryability (e.g. RetryOnPermanentFailure defaults false on every seeded policy — a permanently-invalid recipient is not worth retrying).

12.4 Atomic claiming

NotificationRetryQueue rows are claimed via an ExecuteUpdateAsync-based atomic claim using a ProcessingToken (Guid?, null = available) — this prevents two concurrent worker instances (e.g. during a rolling deployment) from double-processing the same retry. A row claimed but stuck in Retrying for more than 2 hours is treated as orphaned and repaired (token cleared) by the retry-expire job so it can be reclaimed.

12.5 Hangfire jobs

Every retry/cleanup job now runs through the platform's Background Jobs Framework — job IDs carry a -pipeline suffix, and each is a thin adapter (I{X}PipelineJob) that still calls the same underlying worker class as before the migration. Cron schedules are unchanged from the original delivery-engine design.

Job ID	Interface	Calls	Schedule
`notification-retry-processor-pipeline`	`INotificationRetryProcessorPipelineJob` → `NotificationRetryProcessorJobAdapter`	`INotificationRetryWorker.ProcessDueRetriesAsync`	`/2 * * *` (every 2 min)
`notification-retry-expire-pipeline`	`INotificationRetryExpirePipelineJob`	Expire stuck/stale retries, orphan-token repair	`0 * * * *` (hourly)
`notification-deadletter-cleanup-pipeline`	`INotificationDeadLetterCleanupPipelineJob` → `NotificationDeadLetterCleanupJobAdapter`	`NotificationDeadLetterCleanupWorker.CleanupDeadLettersAsync` + `CleanupStaleRetriesAsync`	`0 2 * * *` (daily, 02:00 UTC)

The legacy job ID retry-failed-notifications (a pre-framework retry sweep) has been de-registered with no replacement — this was an intentional retirement, not an oversight, once the framework's own retry queue took over that responsibility for framework-dispatched notifications. It is unrelated to NotificationHistoryController.RetryFailed (§9.12), which still exists and targets the separate legacy TenantNotificationLog table.

12.6 Dead Letter Queue

NotificationDeadLetter (table NotificationDeadLetters) is the terminal store — see §9.7 for the full field list and API surface.

Status values (NotificationDeadLetterStatus): Pending=0, Requeued=1, Cancelled=2.

Cleanup policy (NotificationRetentionSettings, enforced by notification-deadletter-cleanup-pipeline):

Dead letters older than DeadLetterRetentionDays (default 90) are soft-deleted — except rows still in Pending status, which are retained indefinitely regardless of age (an unresolved failure should never silently disappear from an operator's queue).
Cancelled/expired/succeeded retry-queue rows older than CancelledRetryRetentionDays (default 7) are cleaned up in the same job run.
A third documented setting, for delivery-attempt retention, exists in NotificationRetentionSettings but no worker currently implements cleanup for NotificationDeliveryAttempt rows — this table grows unbounded today. Flag this as an operational planning item, not a bug to silently work around.

12.7 Requeue vs Retry Now — the difference

Action	Where	What it does
`PUT NotificationRetryQueue/RetryNow`	§9.6	Forces an already-scheduled retry to run on the next pipeline tick instead of waiting for its backoff delay
`PUT NotificationDeadLetters/Requeue`	§9.7	Resurrects a terminal dead letter by creating a brand-new `NotificationRetryQueue` row from its preserved `PayloadSnapshot`, with a reset attempt counter

12.8 Simulating retry / dead-letter behavior for testing

See Chapter 17 — Testing Guide § Failure simulation for how to force a channel failure in a lower environment and observe the retry queue and dead-letter transitions described in this chapter end to end.

12.1 Delivery state machine​

12.2 Retry strategies​

12.3 Failure classification​

12.4 Atomic claiming​

12.5 Hangfire jobs​

12.6 Dead Letter Queue​

12.7 Requeue vs Retry Now — the difference​

12.8 Simulating retry / dead-letter behavior for testing​