-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start untangling orchestrator #1739
Draft
mnonnenmacher
wants to merge
10
commits into
main
Choose a base branch
from
untangle-orchestrator
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
18acb53
chore(orchestrator): Fix a function name
mnonnenmacher f5b0f6d
chore(orchestrator): Remove unnecessary whitespace
mnonnenmacher be8a3cb
chore(orchestrator): Remove an unnecessary suppression
mnonnenmacher 3b2f14f
refactor(orchestrator): Inline the `scheduleNextJobs` function
mnonnenmacher 391a733
refactor(orchestrator): Rename `getCurrentOrtRun` to `getOrtRun`
mnonnenmacher 9400db2
refactor(orchestrator): Simplify defining job dependencies
mnonnenmacher 8bb979a
refactor(orchestrator): Extract scheduling logic
mnonnenmacher 5c10474
fix(orchestrator): Make the reporter depend on the analyzer
mnonnenmacher f3b26ae
refactor(orchestrator): Take `OrtRunInfo` into use
mnonnenmacher 3af763d
chore(orchestrator): Remove an unnecessary helper function
mnonnenmacher File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,7 +78,7 @@ private val log = LoggerFactory.getLogger(Orchestrator::class.java) | |
* It creates jobs for the single processing steps and passes them to the corresponding workers. It collects the results | ||
* produced by the workers until the complete ORT result is available or the run has failed. | ||
*/ | ||
@Suppress("LongParameterList", "TooManyFunctions") | ||
@Suppress("TooManyFunctions") | ||
class Orchestrator( | ||
private val db: Database, | ||
private val workerJobRepositories: WorkerJobRepositories, | ||
|
@@ -99,9 +99,8 @@ class Orchestrator( | |
"Repository '${ortRun.repositoryId}' not found." | ||
} | ||
|
||
val context = WorkerScheduleContext(ortRun, workerJobRepositories, publisher, header, emptyMap()) | ||
context to listOf { scheduleConfigWorkerJob(ortRun, header, updateRun = true) } | ||
}.scheduleNextJobs { | ||
scheduleConfigWorkerJob(ortRun, header, updateRun = true) | ||
}.onFailure { | ||
log.warn("Failed to handle 'CreateOrtRun' message.", it) | ||
} | ||
} | ||
|
@@ -111,10 +110,12 @@ class Orchestrator( | |
*/ | ||
fun handleConfigWorkerResult(header: MessageHeader, configWorkerResult: ConfigWorkerResult) { | ||
db.blockingQueryCatching(transactionIsolation = isolationLevel) { | ||
val ortRun = getCurrentOrtRun(configWorkerResult.ortRunId) | ||
val ortRun = getOrtRun(configWorkerResult.ortRunId) | ||
|
||
nextJobsToSchedule(ConfigEndpoint, ortRun.id, header, jobs = emptyMap()) | ||
}.scheduleNextJobs { | ||
createWorkerScheduleContext(ortRun, header) | ||
}.onSuccess { context -> | ||
scheduleNextJobs(context) | ||
}.onFailure { | ||
log.warn("Failed to handle 'ConfigWorkerResult' message.", it) | ||
} | ||
} | ||
|
@@ -248,11 +249,12 @@ class Orchestrator( | |
"ORT run '$ortRunId' not found." | ||
} | ||
|
||
repository.tryComplete(job.id, Clock.System.now(), JobStatus.FAILED)?.let { | ||
nextJobsToSchedule(Endpoint.fromConfigPrefix(workerError.endpointName), job.ortRunId, header) | ||
} | ||
} ?: (createWorkerSchedulerContext(getCurrentOrtRun(ortRunId), header, failed = true) to emptyList()) | ||
}.scheduleNextJobs { | ||
repository.tryComplete(job.id, Clock.System.now(), JobStatus.FAILED) | ||
createWorkerScheduleContext(getOrtRun(ortRunId), header) | ||
} ?: createWorkerScheduleContext(getOrtRun(ortRunId), header, failed = true) | ||
}.onSuccess { context -> | ||
scheduleNextJobs(context) | ||
}.onFailure { | ||
log.warn("Failed to handle 'WorkerError' message.", it) | ||
} | ||
} | ||
|
@@ -265,23 +267,26 @@ class Orchestrator( | |
log.info("Handling a lost schedule for ORT run {}.", lostSchedule.ortRunId) | ||
|
||
db.blockingQueryCatching(transactionIsolation = isolationLevel) { | ||
val ortRun = getCurrentOrtRun(lostSchedule.ortRunId) | ||
val context = createWorkerSchedulerContext(ortRun, header) | ||
val ortRun = getOrtRun(lostSchedule.ortRunId) | ||
val context = createWorkerScheduleContext(ortRun, header) | ||
|
||
if (context.jobs.isNotEmpty()) { | ||
fetchNextJobs(context) | ||
if (context.jobs.isEmpty()) { | ||
scheduleConfigWorkerJob(ortRun, header, updateRun = false) | ||
null | ||
} else { | ||
context to listOf { scheduleConfigWorkerJob(ortRun, header, updateRun = false) } | ||
context | ||
} | ||
}.scheduleNextJobs { | ||
}.onSuccess { context -> | ||
context?.let { scheduleNextJobs(context) } | ||
}.onFailure { | ||
log.warn("Failed to handle 'LostSchedule' message.", it) | ||
} | ||
} | ||
|
||
/** | ||
* Obtain the [OrtRun] with the given [ortRunId] of fail with an exception if it does not exist. | ||
*/ | ||
private fun getCurrentOrtRun(ortRunId: Long): OrtRun = | ||
private fun getOrtRun(ortRunId: Long): OrtRun = | ||
requireNotNull(ortRunRepository.get(ortRunId)) { | ||
"ORT run '$ortRunId' not found." | ||
} | ||
|
@@ -332,52 +337,20 @@ class Orchestrator( | |
val job = workerJobRepositories.updateJobStatus(endpoint, message.jobId, status) | ||
if (issues.isNotEmpty()) ortRunRepository.update(job.ortRunId, issues = issues.asPresent()) | ||
|
||
nextJobsToSchedule(endpoint, job.ortRunId, header) | ||
}.scheduleNextJobs { | ||
createWorkerScheduleContext(getOrtRun(job.ortRunId), header) | ||
}.onSuccess { context -> | ||
scheduleNextJobs(context) | ||
}.onFailure { | ||
log.warn("Failed to handle '{}' message.", message::class.java.simpleName, it) | ||
} | ||
} | ||
|
||
/** | ||
* Determine the next jobs that can be scheduled after a job for the given [endpoint] for the run with the given | ||
* [ortRunId] has completed. Use the given [header] to send messages to the worker endpoints. Optionally, | ||
* accept a map with the [jobs] that have been run. Return a list with the new jobs to schedule and the current | ||
* [WorkerScheduleContext]. | ||
*/ | ||
private fun nextJobsToSchedule( | ||
endpoint: Endpoint<*>, | ||
ortRunId: Long, | ||
header: MessageHeader, | ||
jobs: Map<String, WorkerJob>? = null | ||
): Pair<WorkerScheduleContext, List<JobScheduleFunc>> { | ||
log.info("Handling a completed job for endpoint '{}' and ORT run {}.", endpoint.configPrefix, ortRunId) | ||
|
||
val ortRun = getCurrentOrtRun(ortRunId) | ||
val scheduleContext = createWorkerSchedulerContext(ortRun, header, workerJobs = jobs) | ||
|
||
return fetchNextJobs(scheduleContext) | ||
} | ||
|
||
/** | ||
* Convenience function to evaluate and process this [Result] with information about the next jobs to be scheduled. | ||
* If the result is successful, actually trigger the jobs. Otherwise, call the given [onFailure] function with the | ||
* exception that occurred. | ||
*/ | ||
private fun Result<Pair<WorkerScheduleContext, List<JobScheduleFunc>>>.scheduleNextJobs( | ||
onFailure: (Throwable) -> Unit | ||
) { | ||
onSuccess { (context, schedules) -> | ||
scheduleCreatedJobs(context, schedules) | ||
} | ||
[email protected] { onFailure(it) } | ||
} | ||
|
||
/** | ||
* Create a [WorkerScheduleContext] for the given [ortRun] and message [header] with the given [failed] flag. | ||
* The context is initialized with the status of all jobs for this run, either from the given [workerJobs] | ||
* parameter or by loading the job status from the database. | ||
*/ | ||
private fun createWorkerSchedulerContext( | ||
private fun createWorkerScheduleContext( | ||
ortRun: OrtRun, | ||
header: MessageHeader, | ||
failed: Boolean = false, | ||
|
@@ -390,23 +363,41 @@ class Orchestrator( | |
return WorkerScheduleContext(ortRun, workerJobRepositories, publisher, header, jobs, failed) | ||
} | ||
|
||
/** | ||
* Trigger the scheduling of the given new [createdJobs] for the ORT run contained in the given [context]. This | ||
* also includes sending corresponding messages to the worker endpoints. | ||
*/ | ||
private fun scheduleCreatedJobs(context: WorkerScheduleContext, createdJobs: CreatedJobs) { | ||
// TODO: Handle errors during job scheduling. | ||
/** Schedule the next jobs for the current ORT run based on the current state of the run. */ | ||
private fun scheduleNextJobs(context: WorkerScheduleContext) { | ||
val configuredJobs = WorkerScheduleInfo.entries.filterTo(mutableSetOf()) { | ||
it.isConfigured(context.jobConfigs()) | ||
} | ||
|
||
val jobInfos = configuredJobs.mapNotNull { | ||
context.jobs[it.endpoint.configPrefix]?.let { job -> | ||
it to WorkerJobInfo(job.id, job.status) | ||
} | ||
}.toMap() | ||
|
||
val ortRunInfo = OrtRunInfo(context.ortRun.id, context.failed, configuredJobs, jobInfos) | ||
|
||
createdJobs.forEach { it() } | ||
val nextJobs = ortRunInfo.getNextJobs() | ||
|
||
if (createdJobs.isEmpty() && !context.hasRunningJobs()) { | ||
nextJobs.forEach { info -> | ||
info.createJob(context)?.let { job -> | ||
// TODO: Handle errors during job scheduling. | ||
info.publishJob(context, job) | ||
context.workerJobRepositories.updateJobStatus( | ||
info.endpoint, | ||
job.id, | ||
JobStatus.SCHEDULED, | ||
finished = false | ||
) | ||
} | ||
} | ||
|
||
if (nextJobs.isEmpty() && !context.hasRunningJobs()) { | ||
cleanupJobs(context.ortRun.id) | ||
|
||
val ortRunStatus = when { | ||
context.isFailed() -> OrtRunStatus.FAILED | ||
|
||
context.isFinishedWithIssues() -> OrtRunStatus.FINISHED_WITH_ISSUES | ||
|
||
else -> OrtRunStatus.FINISHED | ||
} | ||
|
||
|
@@ -466,11 +457,6 @@ class Orchestrator( | |
) | ||
} | ||
|
||
/** | ||
* Type definition to represent a list of jobs that have been created and must be scheduled. | ||
*/ | ||
typealias CreatedJobs = List<JobScheduleFunc> | ||
|
||
/** | ||
* Create an [Issue] object representing an error that occurred in any [Endpoint]. | ||
*/ | ||
|
@@ -480,12 +466,3 @@ fun <T : Any> Endpoint<T>.createErrorIssue(): Issue = Issue( | |
message = "The $configPrefix worker failed due to an unexpected error.", | ||
severity = Severity.ERROR | ||
) | ||
|
||
/** | ||
* Return a [Pair] with the given [scheduleContext] and the list of jobs that can be scheduled in the current phase | ||
* of the affected ORT run. | ||
*/ | ||
private fun fetchNextJobs( | ||
scheduleContext: WorkerScheduleContext | ||
): Pair<WorkerScheduleContext, List<JobScheduleFunc>> = | ||
scheduleContext to WorkerScheduleInfo.entries.mapNotNull { it.createAndScheduleJobIfPossible(scheduleContext) } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
/* | ||
* Copyright (C) 2024 The ORT Server Authors (See <https://github.com/eclipse-apoapsis/ort-server/blob/main/NOTICE>) | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* https://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
* | ||
* SPDX-License-Identifier: Apache-2.0 | ||
* License-Filename: LICENSE | ||
*/ | ||
|
||
package org.eclipse.apoapsis.ortserver.orchestrator | ||
|
||
import org.eclipse.apoapsis.ortserver.model.JobStatus | ||
|
||
/** A class to store the required information to determine which jobs can be run. */ | ||
internal class OrtRunInfo( | ||
/** The ORT run ID. */ | ||
val id: Long, | ||
|
||
/** Whether the config worker has failed. */ | ||
val configWorkerFailed: Boolean, | ||
|
||
/** The jobs configured to run in this ORT run. */ | ||
val configuredJobs: Set<WorkerScheduleInfo>, | ||
|
||
/** Status information for already created jobs. */ | ||
val jobInfos: Map<WorkerScheduleInfo, WorkerJobInfo> | ||
) { | ||
/** Get the next jobs that can be run. */ | ||
fun getNextJobs(): Set<WorkerScheduleInfo> = WorkerScheduleInfo.entries.filterTo(mutableSetOf()) { canRun(it) } | ||
|
||
/** Return true if the job can be run. */ | ||
private fun canRun(info: WorkerScheduleInfo): Boolean = | ||
isConfigured(info) && | ||
!wasScheduled(info) && | ||
canRunIfPreviousJobFailed(info) && | ||
info.dependsOn.all { isCompleted(it) } && | ||
info.runsAfterTransitively.none { isPending(it) } | ||
|
||
/** Return true if no previous job has failed or if the job is configured to run after a failure. */ | ||
private fun canRunIfPreviousJobFailed(info: WorkerScheduleInfo): Boolean = info.runAfterFailure || !isFailed() | ||
|
||
/** Return true if the job has been completed. */ | ||
private fun isCompleted(info: WorkerScheduleInfo): Boolean = jobInfos[info]?.status?.final == true | ||
|
||
/** Return true if the job is configured to run. */ | ||
private fun isConfigured(info: WorkerScheduleInfo): Boolean = info in configuredJobs | ||
|
||
/** Return true if any job has failed. */ | ||
private fun isFailed(): Boolean = configWorkerFailed || jobInfos.any { it.value.status == JobStatus.FAILED } | ||
|
||
/** Return true if the job is pending execution. */ | ||
private fun isPending(info: WorkerScheduleInfo): Boolean = | ||
isConfigured(info) && | ||
!isCompleted(info) && | ||
canRunIfPreviousJobFailed(info) && | ||
info.dependsOn.all { wasScheduled(it) || isPending(it) } | ||
|
||
/** Return true if the job has been scheduled. */ | ||
private fun wasScheduled(info: WorkerScheduleInfo): Boolean = jobInfos.containsKey(info) | ||
} | ||
|
||
/** A class to store information of a worker job required by [OrtRunInfo]. */ | ||
internal class WorkerJobInfo( | ||
/** The job ID. */ | ||
val id: Long, | ||
|
||
/** The job status. */ | ||
val status: JobStatus | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like the idea to extract the scheduling logic to a dedicated class, I have some problems with the current implementation:
OrtRunInfo
is meaningless in this context and rather reminds of a data model class.WorkerScheduleContext
is unclear.Orchestrator
now creates aWorkerScheduleContext
, and with the help of this context, anOrtRunInfo
. This is because the latter has its own state derived from the context (this is not really untangling). It would be better ifOrtRunInfo
was stateless and only implemented the scheduling strategy. ThegetNextJobs()
function could be passed aWorkerScheduleContext
info object and obtain all required information from there.