If your org's implementation uses single execution Jobs, there's a bug in the platform waiting to bite you and stop your async processing from working! If you want this bug fixed please upvote this idea.
Now the detail:
Scheduling a single execution Job, be that using System.schedule or System.scheduleBatch, where the job's execution time is in the next few minutes can succeed but the job can fail to run when the org is running on an instance that is heavily loaded (we have a support case open for this, 27945318).
For me, this is a critical bug in the platform since it removes any trust that a single execution Job will be processed, especially when implementing "adaptive self-scheduling" (which isn't directly supported by the platform). It also removes trust in the use of System.scheduleBatch with a now number of minutes delay. I have only raised this "idea" because Salesforce Support requested it, to see if others think this needs fixing too.
The technical reason for the failure was expressed to us as follows in our case:
>> We use Quartz (http://www.quartz-scheduler.org/) for this piece of functionality. If the "system.schedule" is called at 2:13:17 (next to Fire Time) and takes more time to call Quartz, due to resource unavailability, then the time specified in the schedule will be in the past, and Quartz will never fire the trigger.
>> The resource availability dependency is the CRUX of Async processing. Thus, if due to reasons, the server gets busy and resource is not available, and it surpasses the NextFireTime of the crontrigger, it will cease the execution of scheduled job and status will be set to 'Waiting'.
Whilst I have sympathy with the Salesforce R&D team in that this is an easy bug to have, it should actually be quite easy to fix (in the "Schedule Management Layer" in Salesforce) by spotting that it has taken too long to get the details to Quartz and to adjust this accordingly to guarantee job execution in the single execution case.
This is the suggested solution (the "idea"):
Consider that the AsyncApexJob entry will have a date/time at which it was inserted, along with the computed next fire time which will be later than that creation date/time; when the job hasn't yet been queued with Quartz, is single execution, was submitted before the execution date/time and has an execution date/time that is now in the past, adjust the execution date/time to be a few seconds in the future when submitting to Quartz, to ensure that it does actually execute, even though in a delayed manner.
To add clarity, this edge-case impacted one of our customer's production orgs for about a day the day after their org was migrated to Winter '21. Clearly the instance was particularly heavily loaded at this time, causing slow processing and revealing this flaw in the platform's design.
This actually caused significant disruption to their operations because our product leverages both System.schedule and System.scheduleBatch with short delays before job execution, necessarily done to allow our product to appropriately perform async processing without exhausting the daily async quotas.
If you, like I, feel that this is a critical issue (since it could randomly impact any org that uses single execution jobs and thus break async processing) that erodes user, implementer and partner trust in the Salesforce Lightning Platform, please upvote this idea.