Skip to main content
Version: 1.2

12. Retry Engine & Dead Letter Queue

12.1 Delivery state machine

NotificationDeliveryState (Shumoul.Notification.Contracts):

Values: Pending=0, Processing=1, Succeeded=2, RetryScheduled=3, Retrying=4, Failed=5, DeadLetter=6, Cancelled=7, Expired=8.

12.2 Retry strategies

RetryStrategy enum: None=0, Fixed=1, ExponentialBackoff=2, Linear=3, Immediate=4, implemented by ImmediateRetryStrategy, FixedRetryStrategy, LinearRetryStrategy, ExponentialBackoffRetryStrategy, NoRetryStrategy (Shumoul.Notification.Core/RetryStrategies/), resolved via RetryStrategyProvider.

StrategyDelay formula
Immediate0
FixedinitialDelaySeconds (constant every attempt)
LinearinitialDelaySeconds × attemptNumber
ExponentialBackoffinitialDelaySeconds × backoffMultiplier^(attemptNumber - 1), capped at maximumDelaySeconds
Nonenever retried — first failure goes straight to Failed/DeadLetter

Which strategy, maxRetries, and which NotificationFailureTypes are retryable at all are configured per channel via Notification Delivery Policies — see that page's seed-defaults table.

12.3 Failure classification

NotificationFailureType: Unknown=0, Temporary=1, Permanent=2, Timeout=3, RateLimit=4, NetworkFailure=5, AuthenticationFailure=6, InvalidRecipient=7, QuotaExceeded=8. A channel-aware classifier inspects the raw provider error and assigns one of these, which the delivery policy then checks against its RetryOn{FailureType} flags to decide retryability (e.g. RetryOnPermanentFailure defaults false on every seeded policy — a permanently-invalid recipient is not worth retrying).

12.4 Atomic claiming

NotificationRetryQueue rows are claimed via an ExecuteUpdateAsync-based atomic claim using a ProcessingToken (Guid?, null = available) — this prevents two concurrent worker instances (e.g. during a rolling deployment) from double-processing the same retry. A row claimed but stuck in Retrying for more than 2 hours is treated as orphaned and repaired (token cleared) by the retry-expire job so it can be reclaimed.

12.5 Hangfire jobs

Every retry/cleanup job now runs through the platform's Background Jobs Framework — job IDs carry a -pipeline suffix, and each is a thin adapter (I{X}PipelineJob) that still calls the same underlying worker class as before the migration. Cron schedules are unchanged from the original delivery-engine design.

Job IDInterfaceCallsSchedule
notification-retry-processor-pipelineINotificationRetryProcessorPipelineJobNotificationRetryProcessorJobAdapterINotificationRetryWorker.ProcessDueRetriesAsync*/2 * * * * (every 2 min)
notification-retry-expire-pipelineINotificationRetryExpirePipelineJobExpire stuck/stale retries, orphan-token repair0 * * * * (hourly)
notification-deadletter-cleanup-pipelineINotificationDeadLetterCleanupPipelineJobNotificationDeadLetterCleanupJobAdapterNotificationDeadLetterCleanupWorker.CleanupDeadLettersAsync + CleanupStaleRetriesAsync0 2 * * * (daily, 02:00 UTC)

The legacy job ID retry-failed-notifications (a pre-framework retry sweep) has been de-registered with no replacement — this was an intentional retirement, not an oversight, once the framework's own retry queue took over that responsibility for framework-dispatched notifications. It is unrelated to NotificationHistoryController.RetryFailed (§9.12), which still exists and targets the separate legacy TenantNotificationLog table.

12.6 Dead Letter Queue

NotificationDeadLetter (table NotificationDeadLetters) is the terminal store — see §9.7 for the full field list and API surface.

Status values (NotificationDeadLetterStatus): Pending=0, Requeued=1, Cancelled=2.

Cleanup policy (NotificationRetentionSettings, enforced by notification-deadletter-cleanup-pipeline):

  • Dead letters older than DeadLetterRetentionDays (default 90) are soft-deleted — except rows still in Pending status, which are retained indefinitely regardless of age (an unresolved failure should never silently disappear from an operator's queue).
  • Cancelled/expired/succeeded retry-queue rows older than CancelledRetryRetentionDays (default 7) are cleaned up in the same job run.
  • A third documented setting, for delivery-attempt retention, exists in NotificationRetentionSettings but no worker currently implements cleanup for NotificationDeliveryAttempt rows — this table grows unbounded today. Flag this as an operational planning item, not a bug to silently work around.

12.7 Requeue vs Retry Now — the difference

ActionWhereWhat it does
PUT NotificationRetryQueue/RetryNow§9.6Forces an already-scheduled retry to run on the next pipeline tick instead of waiting for its backoff delay
PUT NotificationDeadLetters/Requeue§9.7Resurrects a terminal dead letter by creating a brand-new NotificationRetryQueue row from its preserved PayloadSnapshot, with a reset attempt counter

12.8 Simulating retry / dead-letter behavior for testing

See Chapter 17 — Testing Guide § Failure simulation for how to force a channel failure in a lower environment and observe the retry queue and dead-letter transitions described in this chapter end to end.