Skip to main content
Version: Next

9. Distributed Lock

9.1 The contract

public interface IBackgroundJobDistributedLock
{
Task<IBackgroundJobLockHandle?> TryAcquireAsync(
BackgroundJobLockKey key, TimeSpan timeout, CancellationToken ct = default);
}

public interface IBackgroundJobLockHandle : IAsyncDisposable
{
BackgroundJobLockKey Key { get; }
Task ReleaseAsync(CancellationToken ct = default);
}

BackgroundJobLockKey is a sealed record: (string MethodName, Guid? TenantId), with a computed Resource property: "backgroundjob:{MethodName}[:{TenantId}]".

9.2 Architecture — native Hangfire distributed lock, not a custom implementation

BackgroundJobDistributedLockAdapter (in the Adapters package) implements IBackgroundJobDistributedLockAdapter using Hangfire's own storage-level distributed lock:

storage.GetConnection().AcquireDistributedLock(key.Resource, timeout)

storage is the same JobStorage instance Hangfire itself uses (backed by SQL Server in this platform's deployment) — the lock is a real, storage-backed distributed lock shared with every other Hangfire client pointed at the same storage, not an in-process or custom implementation. On a DistributedLockTimeoutException, the adapter returns null rather than throwing — the caller decides what "could not acquire the lock" means for its own job.

The internal HangfireLockHandle wraps the Hangfire lock's underlying IDisposable and IStorageConnection, using Interlocked.Exchange to make double-dispose safe and idempotent.

9.3 Lock ownership

A lock is scoped by BackgroundJobLockKey.Resource — a string combining a method name and an optional tenant ID. Whoever successfully calls TryAcquireAsync and receives a non-null handle owns that resource string until it calls ReleaseAsync (or disposes the handle). There is no lock renewal/heartbeat concept exposed at this layer — Hangfire's own distributed lock implementation handles timeout semantics internally.

9.4 Why business-level locking is not handled here

This is one of the framework's most important, deliberately-made design decisions, worked out concretely during the highest-risk migration in the framework's history (campaign-workflow-executor — see §14.2).

The problem that could have been "solved" with a distributed lock: two concurrent instances of the same recurring job could both pick up the same pending business row (e.g. the same NotificationCampaignExecution) and both process it — for a job with real-world side effects like sending notifications, that means duplicate, irreversible deliveries.

Why a job-level distributed lock was evaluated and explicitly rejected as the fix: a lock scoped to "this job is running" would be far too coarse. It would serialize every execution of the job against every other execution, even when they touch entirely unrelated rows — for example, blocking a run that's processing tenant A's campaign from starting while a run processing tenant B's completely unrelated campaign is still finishing. This punishes throughput system-wide to prevent a problem that only actually occurs when two runs touch the same specific row.

The actual fix: per-row atomic claiming, owned by whichever domain framework owns the data — ICampaignExecutionRepository.ClaimAsync, an atomic SQL UPDATE ... WHERE Status = Pending AND WorkerToken IS NULL, implemented inside the Notification Framework. Only the row actually being contested is protected; every other row's processing proceeds unimpeded.

The resulting permanent rule: duplicate-execution prevention for a specific business entity is never the Background Jobs Framework's responsibility to solve with IBackgroundJobDistributedLock. It belongs to whichever domain framework owns the row being protected. Implementing a job-level lock as a workaround for a missing per-row claim is a permanently rejected idea — reopening it requires a new ADR.

9.5 What IBackgroundJobDistributedLock is for

Legitimate uses remain — for example, ensuring only one instance of a genuinely global, singleton-shaped operation runs at a time (not a per-row concern, a per-job concern where serializing the whole job really is the correct semantic). None of the six currently migrated jobs uses this primitive today — all six either rely on per-row claiming (retry processor, campaign workflow executor) or are naturally idempotent at the set level (dead-letter cleanup, retry expire, campaign cleanup, campaign scheduler — see §14.2 for each job's specific concurrency model).