12. Retry Engine & Dead Letter Queue
12.1 Delivery state machine
NotificationDeliveryState (Shumoul.Notification.Contracts):
Values: Pending=0, Processing=1, Succeeded=2, RetryScheduled=3, Retrying=4, Failed=5, DeadLetter=6, Cancelled=7, Expired=8.
12.2 Retry strategies
RetryStrategy enum: None=0, Fixed=1, ExponentialBackoff=2, Linear=3, Immediate=4, implemented by
ImmediateRetryStrategy, FixedRetryStrategy, LinearRetryStrategy, ExponentialBackoffRetryStrategy,
NoRetryStrategy (Shumoul.Notification.Core/RetryStrategies/), resolved via RetryStrategyProvider.
| Strategy | Delay formula |
|---|---|
| Immediate | 0 |
| Fixed | initialDelaySeconds (constant every attempt) |
| Linear | initialDelaySeconds × attemptNumber |
| ExponentialBackoff | initialDelaySeconds × backoffMultiplier^(attemptNumber - 1), capped at maximumDelaySeconds |
| None | never retried — first failure goes straight to Failed/DeadLetter |
Which strategy, maxRetries, and which NotificationFailureTypes are retryable at all are configured per
channel via Notification Delivery Policies — see
that page's seed-defaults table.
12.3 Failure classification
NotificationFailureType: Unknown=0, Temporary=1, Permanent=2, Timeout=3, RateLimit=4, NetworkFailure=5, AuthenticationFailure=6, InvalidRecipient=7, QuotaExceeded=8. A channel-aware classifier inspects the raw
provider error and assigns one of these, which the delivery policy then checks against its
RetryOn{FailureType} flags to decide retryability (e.g. RetryOnPermanentFailure defaults false on every
seeded policy — a permanently-invalid recipient is not worth retrying).
12.4 Atomic claiming
NotificationRetryQueue rows are claimed via an ExecuteUpdateAsync-based atomic claim using a
ProcessingToken (Guid?, null = available) — this prevents two concurrent worker instances (e.g. during a
rolling deployment) from double-processing the same retry. A row claimed but stuck in Retrying for more
than 2 hours is treated as orphaned and repaired (token cleared) by the retry-expire job so it can be
reclaimed.
12.5 Hangfire jobs
Every retry/cleanup job now runs through the platform's Background Jobs Framework — job IDs carry a
-pipeline suffix, and each is a thin adapter (I{X}PipelineJob) that still calls the same underlying
worker class as before the migration. Cron schedules are unchanged from the original delivery-engine design.
| Job ID | Interface | Calls | Schedule |
|---|---|---|---|
notification-retry-processor-pipeline | INotificationRetryProcessorPipelineJob → NotificationRetryProcessorJobAdapter | INotificationRetryWorker.ProcessDueRetriesAsync | */2 * * * * (every 2 min) |
notification-retry-expire-pipeline | INotificationRetryExpirePipelineJob | Expire stuck/stale retries, orphan-token repair | 0 * * * * (hourly) |
notification-deadletter-cleanup-pipeline | INotificationDeadLetterCleanupPipelineJob → NotificationDeadLetterCleanupJobAdapter | NotificationDeadLetterCleanupWorker.CleanupDeadLettersAsync + CleanupStaleRetriesAsync | 0 2 * * * (daily, 02:00 UTC) |
The legacy job ID retry-failed-notifications (a pre-framework retry sweep) has been de-registered with no
replacement — this was an intentional retirement, not an oversight, once the framework's own retry queue
took over that responsibility for framework-dispatched notifications. It is unrelated to
NotificationHistoryController.RetryFailed (§9.12),
which still exists and targets the separate legacy TenantNotificationLog table.
12.6 Dead Letter Queue
NotificationDeadLetter (table NotificationDeadLetters) is the terminal store — see
§9.7 for the full field list and API surface.
Status values (NotificationDeadLetterStatus): Pending=0, Requeued=1, Cancelled=2.
Cleanup policy (NotificationRetentionSettings, enforced by notification-deadletter-cleanup-pipeline):
- Dead letters older than
DeadLetterRetentionDays(default 90) are soft-deleted — except rows still inPendingstatus, which are retained indefinitely regardless of age (an unresolved failure should never silently disappear from an operator's queue). - Cancelled/expired/succeeded retry-queue rows older than
CancelledRetryRetentionDays(default 7) are cleaned up in the same job run. - A third documented setting, for delivery-attempt retention, exists in
NotificationRetentionSettingsbut no worker currently implements cleanup forNotificationDeliveryAttemptrows — this table grows unbounded today. Flag this as an operational planning item, not a bug to silently work around.
12.7 Requeue vs Retry Now — the difference
| Action | Where | What it does |
|---|---|---|
PUT NotificationRetryQueue/RetryNow | §9.6 | Forces an already-scheduled retry to run on the next pipeline tick instead of waiting for its backoff delay |
PUT NotificationDeadLetters/Requeue | §9.7 | Resurrects a terminal dead letter by creating a brand-new NotificationRetryQueue row from its preserved PayloadSnapshot, with a reset attempt counter |
12.8 Simulating retry / dead-letter behavior for testing
See Chapter 17 — Testing Guide § Failure simulation for how to force a channel failure in a lower environment and observe the retry queue and dead-letter transitions described in this chapter end to end.