Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lib lpc55 rng seed #1820

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Conversation

flihp
Copy link
Contributor

@flihp flihp commented Jul 11, 2024

This PR implements a new strategy to improve the quality of the initial seeding and reseeding of the PRNG hosted in the rng_driver task.

@flihp flihp requested review from cbiffle and labbott as code owners July 11, 2024 16:53
@flihp flihp marked this pull request as draft July 11, 2024 16:53
@flihp
Copy link
Contributor Author

flihp commented Jul 11, 2024

I've marked this as a draft and it will remain a draft till:

  • it currently crash loops if one of the dice_* features hasn't been enabled in the kernel
  • has only been tested on rot-carrier & lpc55xpresso dev boards, has not yet been tested on a proper gimlet

The bulk of work here will remain unchanged as I address these last few points so reading over this while still in draft won't be wasted time.

@labbott
Copy link
Collaborator

labbott commented Jul 11, 2024

current draft LGTM at a high level, I'll take another pass after testing

@flihp flihp force-pushed the lib-lpc55-rng-seed branch from a527d0e to 509e53d Compare July 12, 2024 05:23
@flihp
Copy link
Contributor Author

flihp commented Jul 12, 2024

I've updated the two commits that pull data from the stage0-handoff region. They now handle failures to pull data from these regions w/ an Option for the two seed components. If these end up being None then the error written in the ringbuf and that component is left out of the seed. I'm not sure if this is the right balance between flexibility and security but it'll keep the task from crash looping on lpc55s w/ DICE disabled.

@flihp flihp marked this pull request as ready for review July 12, 2024 16:04
@andrewjstone
Copy link
Contributor

I've updated the two commits that pull data from the stage0-handoff region. They now handle failures to pull data from these regions w/ an Option for the two seed components. If these end up being None then the error written in the ringbuf and that component is left out of the seed. I'm not sure if this is the right balance between flexibility and security but it'll keep the task from crash looping on lpc55s w/ DICE disabled.

I understand why you did this. However, this strategy makes me somewhat nervous that we'll accidentally squash an error and leave ourselves vulnerable in production.

Is DICE always going to sometimes be disabled on the lpc55s for dev use? Or is this a temporary state of affairs? If the latter, we should open an issue to revert this code when DICE is no longer ever disabled. If the former, we still need to prevent the crash looping. In that case, can we add an extra check where we return an option that matches on any specific error due to DICE being disabled and then confirms that DICE is actually disabled. If DICE is not disabled then we should panic rather than return None.

@flihp
Copy link
Contributor Author

flihp commented Jul 15, 2024

Yes, that's what I meant when I said I'm not sure if this is the right balance between flexibility and security. I probably shouldn't have flipped off the Draft bit since I'm still trying to come up with reasonable guard rails here. That said we only ship this task on dev boards currently.

Is DICE always going to sometimes be disabled on the lpc55s for dev use?

I spent some time these last few days trying to come to a decision on this point. My current thinking is that we can't assume that DICE is enabled at all. Most of this is because it can be turned on / off at will which (I think) means we have to make the strongest RNG seed the default and produce errors / panic unless the app.toml explicitly opts into a weaker RNG seed.

So my current thinking is that I'll introduce a new feature(s) to the RNG task that controls whether or not missing / unavailable components of the seed are hard errors. This will have to deal with conditional availability of the stage0-handoff region in build.rs as well as augmenting the Option stuff I added at the end of last week to reject None values (your feedback focuses mostly on this poitn). I've just starting to prototype this so it's not yet clear but I think this will work best if this feature is enabled by default, requiring selective disabling / explicit opt-in to weaken the seed.

@labbott
Copy link
Collaborator

labbott commented Jul 15, 2024

Is DICE always going to sometimes be disabled on the lpc55s for dev use?

These days it's less likely as everyone is probably working on a sled which has been setup or a rot-carrier which also usually gets embootleby'd . The only people who are likely to be running with DICE off are those of us who are working off of boards that have not been setup or testing something very specific.

@flihp flihp force-pushed the lib-lpc55-rng-seed branch from 509e53d to 6542743 Compare July 16, 2024 04:52
@flihp
Copy link
Contributor Author

flihp commented Jul 16, 2024

I've added a feature to gate inclusion of these components in the various RNG seeds. If this feature is enabled and the task can't access one of these memory regions then we record the failure in the ringbuf and panic. This isn't what we want, but it at least "fails closed" as they say. I'd prefer this feature were enabled by default so as to require the app.toml to disable it explicitly. This doesn't currently work (unless I screwed something up). Alternatively this flag could have been a negative one (enabling the flag disables the feature) but those get confusing.

I plan to take some time over the next day or so to get a rough idea of the level of effort associated with supporting default features for task Cargo.tomls since it would be convenient here. If folks think the current approach is sufficient I don't see a need to hold this PR up, but if there are lingering concerns over possible foot guns I'll be weighing the level of effort of task default features against turning the feature flag introduced here into a negative one.

Copy link
Contributor

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from the peanut gallery. Really pleasant code to read <3

drv/lpc55-rng/src/config.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Show resolved Hide resolved
@flihp flihp marked this pull request as draft July 22, 2024 17:19
@flihp flihp force-pushed the lib-lpc55-rng-seed branch from 6542743 to 9f7600a Compare July 23, 2024 04:42
@flihp
Copy link
Contributor Author

flihp commented Jul 23, 2024

This has been successfully tested on a gimlet, psc & sidecar. To do so required adding the rng_driver task to an oxide-rot-1 image. The app-dev.toml / oxide-rot-1-dev image is lowest risk.

The remaining work to unblock this PR is in #1830

@flihp flihp force-pushed the lib-lpc55-rng-seed branch 3 times, most recently from e36e985 to 46caef9 Compare August 19, 2024 17:32
@flihp flihp force-pushed the lib-lpc55-rng-seed branch 3 times, most recently from aafbf22 to 5c28b5f Compare September 10, 2024 04:39
Copy link
Contributor

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More peanuts.

lib/lpc55-rng/src/lib.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
drv/lpc55-rng/src/main.rs Outdated Show resolved Hide resolved
This is intended to enable testing the HRNG on the LPC55 independet of
the rng_driver task.
Implementing `BlockRngCore` for `Lpc55Core` got us an implementation of
RngCore for free. Unfortunately it cost us the ability to return errors.
This wasn't a good tradeoff. Better to implement `RngCore` directly.
Not sure what I was thinking here.
The `ReseedingRng` can handle us pulling more than 4 bytes out at a
time.
flihp and others added 2 commits September 17, 2024 17:51
where:
- N > 0
- `HRNG_N(count)` represents count bytes taken from the hardware RNG
- `PRNG_N(count)` represents count bytes taken from the Nth generation
  of the PRNG

This commit changes our algorithm for constructing the seed `SEED_N` for
the PRNG instance `PRNG_N` from:

```
SEED_N = HRNG(32)
```

to:

```
SEED_N = sha3_256(PRNG_N-1(32) | HRNG(32))
```

We use `sha3_256` as a mixing function to combine these two components
of the seed though the implementation is generic over the digest w/
constraints on the length.
@flihp flihp force-pushed the lib-lpc55-rng-seed branch from 5c28b5f to 6507771 Compare September 17, 2024 18:23
flihp and others added 3 commits September 17, 2024 19:12
Previously we constructed the initial seed as 32 bytes from the hardware
RNG:

```
SEED_0 = HRNG(32)
```

This commit includes a seed value constructed by the measured boot
implementation from the DICE CDI. This is passed through to the RNG task
using the `stage0-handoff` mechanism. This 32 byte value is now
extracted and mixed with 32 bytes from the HRNG to construct SEED_0:

```
SEED_0 = sha3_256(DICE_SEED | HRNG(32))
```

Use of this feature is gated by the `dice-seed` feature.
Platforms assigned a unique serial number can include this string in the
initial seed to ensure uniqueness in the bit stream produced by the RNG.

We now construct the intial seed as:

```
SEED_0 = sha3_256(DICE_SEED | SN | HRNG(32))
```

Extracting the Platform Id / serial number from the platform identity
cert required exposing the relevant module from the lib-dice crate. We
also add additional constants to the template module that are required
to know the length of the platform id string at compile time. Finally
this feature is gated by the same `dice-seed` feature used for the seed
derived by measured boot for simplicity.
@flihp flihp force-pushed the lib-lpc55-rng-seed branch from 6507771 to 6036d0e Compare September 17, 2024 19:18
@flihp flihp marked this pull request as ready for review September 23, 2024 17:30
uses = ["rng", "pmc"]
start = true
stacksize = 2200
stacksize = 4200
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this'd make it the highest stack usage in the system, above control-plane-agent. Do we have a sense for what caused this? (Asking because I've been winding up playing stack size whack-a-mole lately.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read more of the code, I can answer this question at least partially: the Server type used in the RNG server is large, and it's being created on the stack and passed around.

It's large primarily due to the Sha3 embedded within it, which I recently noticed in an unrelated case was about 2 kiB.

The easiest way to fix this would be to allocate a static Sha3 instance using ClaimOnceCell (from lib/static-cell) and then pass it into the server constructor to take ownership of (via a &'static mut).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to use mutable-statics because Sha3 doesn't have a const constructor (per discussion in chat): b43295c

/// Create a new Lpc55Rng instance after powering on, enabling the clocks
/// and reseting the underlying HRNG.
pub fn new(syscon: &Syscon) -> Self {
let pmc = unsafe { &*PMC::ptr() };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest having the PMC passed in (like the Syscon is) to make this dependency more obvious (and lose the unsafe!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the ptr() function on PMC & RNG definitely qualifies as holding the lpc55-pac crate wrong. Should be addressed by bce3889.


Lpc55Rng {
pmc,
rng: unsafe { &*RNG::ptr() },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as on line 25)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bce3889 as well.

let mut filled = 0;
while filled < dst.len() {
if self.pmc.pdruncfg0.read().pden_rng().bits() {
return Err(RngError::PoweredOff.into());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this looks like the only way this operation can fail. Given that the RNG is powered on in new, finding it powered off would be a surprising program bug. Perhaps this should be a panic? Otherwise I expect all callers to wind up unwrapping this at some point... could be wrong though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on whether to panic or return the error here a few times. Given the change requested in #1820 (comment) new will now take ownership of instances of the PMC & RNG types which makes it clear that we don't intend for them to change behind our backs. Returning the error was part of an attempt to be resilient to this, but it's so far from the common case that it's not worth making things less ergonomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by: c29dffa

let mut mixer = H::default();
if let Some(seed) = seed {
// mix platform unique seed derived by measured boot
Digest::update(&mut mixer, seed.as_bytes());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the explicit qualification on Digest:: -- does mixer.update(seed.as_bytes()) not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type we're using for the digest implements multiple update functions. One's for the Digest trait like we're expecting here but the other is for the Update trait. So I think this is just a disambiguation.

) -> Option<T> {
use core::slice;

// Safety: This memory is setup by code executed before hubris and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This safety comment isn't actually correct, it assumes that region is valid (but the caller can pass any region it likes -- the fields are public). This function should probably be unsafe in its current form, since it's basically equivalent to ptr::read.

You could make it safe if the caller couldn't pass any arbitrary region -- maybe hardcode which region, or identify it with an enum, or....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's bad. Thanks for the recommendations. I went w/ hiding the details of the memory regions behind an enum. Seems to do what you suggest: cf2704f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize how badly this change affected stack usage ... working on that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of the stack bloat from the change in cf2704f is mitigated in bf092d3.

Digest::update(&mut mixer, buf.as_ref());

// create initial instance of the SeedableRng from the seed
let inner = T::from_seed(mixer.finalize_fixed_reset().into());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you might see lighter stack usage here by using one of the finalize_into operations, instead of returning the array. It might also not change anything. Hard to tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems to be thwarted by T::from_seed taking ownership of the seed. If we keep a GenericArray in the ReseedingRng to hold the digest we end up having to clone it in try_fill_bytes. If we keep a [u8; 32] in the ReseedingRng the compiler will do the copy for us. This ends up taking up more stack not less. I didn't commit this experiment to this PR branch but put a wip here if you think it could be made to work.

Not sure if having the server take ownership of the PMC & RNG instances
is optimal but it seems to most accurately reflect our intent.

This resolves:
- oxidecomputer#1820 (comment)
- oxidecomputer#1820 (comment)
This prevents the caller from attempting to deserialize data from arbitrary
memory regions.
Decorating `HandoffDataRegion::load_data` with the `#[inline(always])`
attribute mitigates the increase in stack usage from cf2704f
This reduces stack usage by ~74 bytes.
We do this to move the parameterization of the ReseedingRng to the top
level / `main` and for simplification. Our stack usage after this commit
is unchanged as we explicitly create the inputs to the `ReseedingRng` in
a dedicated scope.
This requires restructuring the ReseedingRng constructor a bit but it
saves us 400 bytes of stack.
@flihp flihp requested a review from cbiffle September 26, 2024 19:46
@@ -283,13 +284,22 @@ fn main() -> ! {
)
};

let mixer = mutable_statics! {
static mut MIXER: [Sha3_256; 1] = [Sha3_256::new; _];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change I had in mind -- I'm surprised it didn't reduce stack usage more! Oh well. It's still an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants