Replies: 4 comments
-
Runtime selectionA very common scenario is when there is a requirement to run a job in a particular type of runtime but which exact one to pick is not specified. The scheduler then has to pick one arbitrarily. For example, the "baseline on QEMU" config above says that the runtime should be of the "lava" type but it doesn't specify which LAVA lab. So the scheduler needs to rely on some logic to pick one and run the job there. A trivial implementation for this would be to randomly pick one within the pool of available labs.
Then the next step is about providing additional clues to the scheduler to let it make a better guess when picking one runtime within the specified type. For example, we may have 10 LAVA labs and 5 Kubernetes clusters. The Runtime class implementation could have methods to help handle this kind of logic too, for example a It is however still a bit simplistic as the same job can cause a different load in different runtimes. Say, Kubernetes cluster might be tuned for a particular kind of job (high RAM, CPU or storage) and LAVA labs may have a different number of platform instances (QEMU or any other type of hardware). An ideal, generic way to handle this might be to have another method e.g. For example, with a pool of 3 runtimes
|
Beta Was this translation helpful? Give feedback.
-
Platform criteriaWithin a particular type of runtime, each job can have some criteria for which "platform" to run on. The example in this discussion just mentions It would then be up to each runtime implementation to find out the criteria for its platforms. A simple approach for LAVA labs would be to keep the YAML device types configuration from the legacy system which contains the CPU architecture for each device type, and this config would probably be loaded by the Runtime itself rather than a global KernelCI core config. It could also query the LAVA API and keep a cache of the online devices etc. like the legacy system, but this would be all abstracted behind the Runtime implementation. For Kubernetes, the nodes information could be retrieved dynamically and some YAML config could also be added if needed.
|
Beta Was this translation helpful? Give feedback.
-
Multiple job runsAdding redundancy to the jobs can be very useful as showing the differences between results when running the same job several times either exactly in the same environment or in multiple different ones can greatly help when investigating issues. It also makes the system more robust against infrastructure issues, as while we could have some mechanism to resubmit jobs it's not always obvious when they fail to run or they may timeout much later. This could be done simply by adding a field in the scheduler config to tell it to schedule the same job multiple times. The intention when specifying "loose" criteria is that the job can be run on a variety of platforms so I would expect the scheduler implementation to try and pick very different ones. If the intention was to run a particular job multiple times exactly the same way then the runtime name could be specified along with very narrow criteria (e.g. exact device type name). However, with the ideas suggested in the previous topics, the tendency would be to just pick the runtime and platform that would result in lowest load so potentially always the same. For example, if the job needs to be run on any x86 platform and one lab has lots of a particular kind of x86 hardware then it'll most likely be running all these jobs. One way to deal with this would be to pass the criteria associated with the platforms used in previous jobs to the Runtime implementation and have a way to specify a constraint on the number of runs. Based on the original example: scheduler:
- job: baseline-x86
event:
channel: node
name: kbuild-gcc-10-x86
result: pass
runtime:
type: lava
criteria:
arch: x86_64
runs:
number: 3
variant: device_type The scheduler would schedule the first job like before, then the second time it would add a constraint that the In this case, one grey area is if no Runtime can schedule jobs with a different device type but could run it again on the same one. I guess we might have "strict" variations where it's required to have different ones or an error is raised, and "permissive" variations when if only one device type is available then it runs all the jobs anyway. |
Beta Was this translation helpful? Give feedback.
-
Job statisticsIn addition to the Runtime specific implementations, the scheduler itself could keep track of when it submits a job, when it starts running and when it completes using node state changes. Then it could accumulate this data for each job signature, essentially the set of parameters used when generating and submitting the job. This could be stored in the API too or in a local database used by the scheduler, or added directly as meta-data to the nodes. Then the scheduler could somehow get an estimate of how long a job would wait until it gets started and how long it would take to complete, and maybe also some indication of the load caused by the job if this could be retrieved from the Runtime. It could then combine this with the other suggested ways of dynamically assessing the availability of each runtime when picking one. This may in fact be more helpful when Runtimes themselves can't get this information. In the worst case where no information can be retrieved from a Runtime (e.g. |
Beta Was this translation helpful? Give feedback.
-
The legacy pipeline system had everything hard-coded in YAML configuration: which git branches to monitor, which kernels to build for each branch, which tests to run for each build on which device and in which lab. While this makes the load very deterministic, it also means a lot of manual curation and a sub-optimal use of the available resources. It's also rather difficult to comprehend and maintain.
It wasn't really designed to be like this, rather it evolved from a smaller objective to build just some plain defconfigs and run some boot tests in LAVA. Then we added support for multiple compilers, extended test coverage, added various config fragments for specialised builds, added filters to remove issues in some corner cases. So it "works" but it really needs a fresh start.
Transitioning to the new API & Pipeline provides a natural opportunity to come up with a better configuration mechanism. Now, we don't just have fixed concepts of builds and runtime tests but generic "jobs". And instead of filters with implicit dependencies between builds and tests we have a "scheduler" which has its own YAML configuration to describe such things. This is the part of the pipeline that decides which jobs get run in which runtimes, and we can implement any logic we want there.
The initial inputs are events received from the API, typically whenever some node data changes, and the YAML scheduler configuration. For example:
This tells the scheduler that when receiving an event about a successful x86 build, it should run a baseline test on QEMU in a LAVA lab.
Now we can discuss how to extend things based on that. Some of it will be required in order to match the legacy system's coverage (e.g. run that test on multiple platforms and not just QEMU without having to duplicate the whole config entry), but I think that's already well understood. The really important aspect which the new API enables is to go beyond this and have a more effective way of achieving test coverage.
API Roadmap issue: kernelci/kernelci-api#349
Beta Was this translation helpful? Give feedback.
All reactions