Remove doubling of buffer size in realiseEndpoint() #1482

mjrgh · 2015-12-25T02:35:38Z

realiseEndpoint() was unnecessarily allocating twice the buffer space for each endpoint buffer. This was presumably for the sake of the hardware SIE's double-buffering (EVEN/ODD) system, but that's vestigial - this implementation doesn't use the double-buffering capability at all, leaving the ODDRST bit in the CTL register always set. The double-size allocation is a pure waste of memory.

Remove doubling of buffer size in realiseEndpoint()

odinthenerd · 2016-01-18T14:18:47Z

I recently noticed the same oddity in them implementation, there are also a lot of places where the even bit is checked or used in macros. I would prefer that the code be modified to allow double buffering or a t least leave it as an option. You can't get good throughput with single buffering.

Double buffering endpoint 0 seems pretty useless to me but since the ODDRST affects all endpoints the code for EP0 would need to change significantly. One could always just use one EP0 buffer and update the pointer in the BDT once every ISR. This would have the nice side effect of not having to keep track of the data0/data1-ness of packets (just leave it configured in the BDT). While were on the topic of BDT the

uint8_t * endpoint_buffer[(NUMBER_OF_PHYSICAL_ENDPOINTS - 2) * 2];
uint8_t * endpoint_buffer_iso[2*2];

arrays are useless, these pointers are already contained in the BDT (albeit in the form of a uint32_t for some reason).

While were on the topic of saving RAM, I don't see the use of putting the callbacks in an array epCallback[0] = &USBHAL::EP1_OUT_callback;. A switch statement would not use much more flash than the initialization of the array and would save 120 bytes of RAM. Generally RAM seems to be the more common bottleneck. Plus removing the array would give link time optimization more of a chance at understanding the code which will inevitably save flash in the long run.

I do not own any of the boards supported by this file, I got it running on an MKL27Z with a little hacking but I don't feel comfortable testing against that. How is testing usually done? Does someone have all the boards and is willing to do some testing for me ;)

mjrgh · 2016-01-18T18:22:41Z

I actually did look into implementing the hardware double-buffering scheme, but I decided it wasn't worth the effort. The hardware maintains opaque internal state about the even/odd parity that the software must keep in sync with - if you get out of sync, the hardware starts ignoring BDT hand-offs. I'm sure it would be possible to use with a careful design, but with the current HAL design there are too many edge cases where I couldn't keep the bookkeeping working. I think a ground-up redesign of the HAL would be needed.
The big thing that makes me think the even/odd scheme is useless, though, is that you could much more easily implement a software buffer rotation scheme. The hardware is perfectly happy to let you change the buffer pointer in the BDT any time the MCU owns the BDT, so you can just keep your own set of buffers - even/odd or a ring of N buffers - and rotate through them as you hand off buffers to the SIE for processing, just by updating the pointer in the BDT.
And for what it's worth, the reference implementation from Kinetis doesn't use the even/odd buffering either. :) I think the people who designed the hardware decided it wasn't worth the effort.

This would have the nice side effect of not having to keep track of the > data0/data1-ness of packets (just leave it configured in the BDT).
You'd think! But sadly, not quite. The data0/data1 bit has some edge cases on EP0 because of SETUP sequencing. Stall conditions also require resets that get out of step with the even/odd parity. The even/odd parity really does have to be maintained as a separate bit of state from the data0/data1 state if you're going to use the SIE double buffering.
If the SIE hardware design had unified data0/data1 with even/odd parity, then the double-buffering scheme might actually have been worth using, since that data0/data1 business is a huge pain. But unfortunately they didn't.
While were on the topic of BDT the endpoint_bufffer[] and endpoint_buffer_iso[] arrays are useless...
They are indeed redundant as far as I can see. The KL25Z documentation explicitly says that the SIE will leave the BDT buffer pointers unchanged throughout transactions, and this seems borne out in my observations, so the BDT pointer seems to be a safe place to keep it. In my overhauled version (https://developer.mbed.org/users/mjr/code/USBDevice/) I kept the endpoint_buffer[] array anyway, although I got rid of the unnecessarily and confusingly separated iso buffer (I just unified endpoint 2 into the main array), and removed the "*2" in the sizes because of the lack of double-buffering support anywhere else in the code.
How is testing usually done?
I have no idea, and it's the reason I'm not too keen to try to get my overhaul pushed into the main branch. It's working great for me but I wouldn't want to foist it off on anyone else without some extensive formal testing.
Regards,Mike
Date: Mon, 18 Jan 2016 06:18:56 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

I recently noticed the same oddity in them implementation, there are also a lot of places where the even bit is checked or used in macros. I would prefer that the code be modified to allow double buffering or a t least leave it as an option. You can't get good throughput with single buffering.

Double buffering endpoint 0 seems pretty useless to me but since the ODDRST affects all endpoints the code for EP0 would need to change significantly. One could always just use one EP0 buffer and update the pointer in the BDT once every ISR. This would have the nice side effect of not having to keep track of the data0/data1-ness of packets (just leave it configured in the BDT). While were on the topic of BDT the

uint8_t * endpoint_buffer[(NUMBER_OF_PHYSICAL_ENDPOINTS - 2) * 2];
uint8_t * endpoint_buffer_iso[2*2];

arrays are useless, these pointers are already contained in the BDT (albeit in the form of a uint32_t for some reason).

While were on the topic of saving RAM, I don't see the use of putting the callbacks in an array epCallback[0] = &USBHAL::EP1_OUT_callback;. A switch statement would not use much more flash than the initialization of the array and would save 120 bytes of RAM. Generally RAM seems to be the more common bottleneck. Plus removing the array would give link time optimization more of a chance at understanding the code which will inevitably save flash in the long run.

I do not own any of the boards supported by this file, I got it running on an MKL27Z with a little hacking but I don't feel comfortable testing against that. How is testing usually done? Does someone have all the boards and is willing to do some testing for me ;)

—
Reply to this email directly or view it on GitHub.

odinthenerd · 2016-01-18T19:17:41Z

Yeah the reference implementation from Kinetis really sucks in my opinion.

[data0/1-ness]
Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't you just invert the data01 flags in both BDTs? Coming out of a stall the processor should own both BDTs right?

By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall.

[your branch]
Great work! I tried to get full throughput on an MKL27Z last week and noticed a lot of weird bugs, I will look through your work and see if I found anything you didn't and vice versa. I ended up getting about 1 megabyte per second on most hubs but some USB 3.0 hubs didn't do more than 400k.
"After spending some time with the code and seeing how many clear concurrency errors it has, it became rather amazing that it ever worked at all" I've had this experience time and again with embedded code and mbed is no exception.
Besides race conditions mailbox like "one-to-clear" bits are also a super common source of bugs. In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp lines 692, 725 and 735 I think you are mistakenly clearing all flags rather than just the one.

I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably.

mjrgh · 2016-01-18T19:54:34Z

Yeah the reference implementation from Kinetis really sucks in my opinion.
Well, I haven't tried to run it or build anything with it, but it was at least useful to have a "second opinion" to look at. And it does seem more fully implemented than the mbed version.
By the way how do you test if your handling a stall right? I set a break point > in the stall handler and tried to screw up the physical communication enough > to make one happen but I either killed everything or it recovered without a stall.
That's unclear to me too! I've observed stalls happen in testing, but the conditions that trigger them (other than the ones explicitly triggered on EP0 by USBDevice.cpp code itself) are mysterious to me. The place I've seen them occur most frequently is when the host comes out of sleep/suspend mode - in my four-endpoint project, I've had one or two of the endpoints get un-stall signals from the host right after a resume. But it doesn't happen consistently, so it's not just a routine part of suspend/resume processing.
Empirically, the stall handling in my branch version seems to work smoothly and consistently with a Win 8.1 host. I'm just doing exactly what the Kinetis reference implementation is doing, becasue there's no documentation I can find for the "right" way to handle this. The Kinetis impl seems as close to a proper specification as I can find.
In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp > lines 692, 725 and 735 I think you are mistakenly clearing all flags > rather than just the one.
You're right - good catch. I'll have to look at those more closely. I did notice that the mbed code (wrongly) does the |= in every case with the istat flags, and I tried to change them all, but now I'm not sure if I just missed those or if they were empirically problematic to change. Looking at them now I suspect they'll work the same either way: the way this code is structured, it handles every istat conditon in one pass through the ISR routine except for reset, which it handles as a one-off and then returns. So I think it won't matter whether we reset the flags individually or just clear the whole register, since ultimately we're going to visit a case that clears every flag that's set anyway. But assuming I'm not missing anything, it would definitely be cleaner in the code to have the code actually do what it says it's doing!
I will look through your work and see if I found anything you didn't and vice versa.
That would be great - thanks! Definitely good to have more than one set of eyes on something this tricky.
Regards,Mike
Date: Mon, 18 Jan 2016 11:17:52 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

Yeah the reference implementation from Kinetis really sucks in my opinion.

[data0/1-ness]

Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't you just invert the data01 flags in both BDTs? Coming out of a stall the processor should own both BDTs right?

By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall.

[your branch]

Great work! I tried to get full throughput on an MKL27Z last week and noticed a lot of weird bugs, I will look through your work and see if I found anything you didn't and vice versa. I ended up getting about 1 megabyte per second on most hubs but some USB 3.0 hubs didn't do more than 400k.

"After spending some time with the code and seeing how many clear concurrency errors it has, it became rather amazing that it ever worked at all" I've had this experience time and again with embedded code and mbed is no exception.

Besides race conditions mailbox like "one-to-clear" bits are also a super common source of bugs. In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp lines 692, 725 and 735 I think you are mistakenly clearing all flags rather than just the one.

I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably.

—
Reply to this email directly or view it on GitHub.

mjrgh · 2016-01-18T20:36:51Z

[data0/1-ness]
Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't > you just invert the data01 flags in both BDTs?
Maybe; I'm not sure. The problem I had was that I couldn't keep my even/odd parity bookkeeping on the software side in sync with what the hardware was thinking. I'd always get into a state where I'd hand off what I thought was the right buffer to the hardware, and the SIE would just ignore me. That would bring everything to a halt - the hardware just sits there ignoring you silently if you have the parity wrong. I really gave it the college try, but I couldn't figure out how to reliably predict the hardware's notion of the current parity. I managed to get it going well enough that I could get through a SETUP and descriptor exchange some of the time but not most of the time. That's what made me give up on it - it seemed like even when things were going smoothly, without stalls or errors or any exception conditions, the hardware's notion of parity would still get out of sync with mine. Maybe I missed something basic; I find it hard to know because this part of the hardware is practically undocumented. And when I saw that even the reference implementation permanently sets the ODDRST bit, I decided I didn't need to be the pioneer here. :)
For what it's worth, the way I'd pursue double- or multi-buffering would be a heck of a lot simpler. I'd use a separate private data structure (not the BDT) to set up two or more buffers for each endpoint where you want buffering. Anywhere we submit a BDT entry to the SIE for processing, set the BDT buffer pointer to the ring buffer element that you want to make active. It should be equally as fast as the hardware even/odd scheme, in that it just takes one instruction (the 32-bit write to do the pointer update) to rotate to a new buffer. This generalizes a lot better than the even/odd scheme since you can allow for write-ahead and read-behind with any number of active buffers. "Any number" is still going to be small given the 16K RAM size of the KL25Z, but it could be useful for some applications to have more than two buffers available on an endpoint. And if you really just want straight double-buffering, you could still use this approach with N=2 and never have to worry about the ineffable hardware parity state.
Regards,Mike
Date: Mon, 18 Jan 2016 11:17:52 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

Yeah the reference implementation from Kinetis really sucks in my opinion.

[data0/1-ness]

Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't you just invert the data01 flags in both BDTs? Coming out of a stall the processor should own both BDTs right?

By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall.

[your branch]

Great work! I tried to get full throughput on an MKL27Z last week and noticed a lot of weird bugs, I will look through your work and see if I found anything you didn't and vice versa. I ended up getting about 1 megabyte per second on most hubs but some USB 3.0 hubs didn't do more than 400k.

"After spending some time with the code and seeing how many clear concurrency errors it has, it became rather amazing that it ever worked at all" I've had this experience time and again with embedded code and mbed is no exception.

Besides race conditions mailbox like "one-to-clear" bits are also a super common source of bugs. In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp lines 692, 725 and 735 I think you are mistakenly clearing all flags rather than just the one.

I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably.

—
Reply to this email directly or view it on GitHub.

odinthenerd · 2016-01-18T21:25:50Z

[handling all interrupt flags any way]
Yes the chances of this screwing anything up are super slim but if you were to get really pedantic you could argue that one of these flags could be set after istat is locally buffered and before one of the |= and therefore lost.

[double buffering]
Just hacking around last week I got double buffering working by simply clearing ODDRST after setup and then implementing it for my bulk endpoint but its a super hack and will need to be rewritten before production code. The advantage I see with double buffering is that the max interrupt latentcy tolerated while still maintaining maximum throughput is twice as long because the hardware has two buffers it can send before it runs out.

[better synchronization]
I think allowing stuff to access the buffers/sfrs outside of the ISR context is just wrong. Rather than giving every IN endpoint a buffer I would allocate a few community buffers which also hold info like byte count and endpoint and make two thread safe queues one acting as a free list and one acting as an output queue. This way a public call to write would just fill a buffer, push it to the out queue and then set the usb ISR flag manually which in turn would actually send the buffer. This way you don't need any further synchronization in the internals of the HAL. The problem with disabling the usb ISR alone as a synchronization mechanism is that interrupts could potentially be off for a while. Lets say we have the USB interrupt at a high priority and have a few uarts and I2C and stuff with long timeouts running at a lower priority. Well these low priority interrupts can't delay usb transmission directly but if a lot of them occur while we are in a critical section in a fuction called from main() then bam priority inversion.

Don't get me wrong though your solution is way better than what it replaced and its probably close to the best you can do without rewriting the API.

If the overhead of calling the ISR needlessly when the output buffers are full any way you could also look if the buffer is empty and only set the the interrupt flag if it is (because otherwise the interrupt will go off in the near future any way). I think that would be pretty close to the maximum conceivable throughput.

mjrgh · 2016-01-18T22:26:05Z

but if you were to get really pedantic you could argue that one of these > flags could be set after istat is locally buffered and before one of the |= and > therefore lost.
It's not entirely clear from the documentation, but I'm pretty sure that can't happen - I think it needs to be guaranteed that the SIE won't modify the status mask while an interrupt is in progress.
Just hacking around last week I got double buffering working by simply > clearing ODDRST after setup and then implementing it for my bulk endpoint
You had better results than I did, then! Maybe you should try nailing down the final details on that and publishing it on a fork so that other high-throughput projects can at least try it out.
I think allowing stuff to access the buffers/sfrs outside of the > ISR context is just wrong.
I actually tried doing a revamp based on that principle - it seemed like it would be a lot cleaner and eliminate all of the critical sections. I got fairly far along with that, and it's a good approach for the basic stuff, but it started to get increasingly hairy as I handled all of the details. I ultimately decided that it's incongruous with the design of the USBDevice.cpp portable layer, and abandoned it because I didn't want to reimplement that. The snag is that there are too many places where application context code needs to initiate something when there's no incoming interrupt to pick up work from a queue. But if I were doing a ground-up redesign of the whole thing including the USBDevice portable layer, I think I'd try to start from this principle and see if it could be made to work.
This way a public call to write would just fill a buffer, push it to the out > queue and then set the usb ISR flag manually which in turn would actually > send the buffer.
Can you get the USB interrupt triggered manually, though? Writing to the ISTAT register only clears bits, so you can't trigger an interrupt that way. Did you have something else in mind? If it did work, it might solve at least part of the design issues I had with it (the part about getting a USB interrupt initiated at will from application context).
Regards,Mike
Date: Mon, 18 Jan 2016 13:26:01 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

[handling all interrupt flags any way]

Yes the chances of this screwing anything up are super slim but if you were to get really pedantic you could argue that one of these flags could be set after istat is locally buffered and before one of the |= and therefore lost.

[double buffering]

Just hacking around last week I got double buffering working by simply clearing ODDRST after setup and then implementing it for my bulk endpoint but its a super hack and will need to be rewritten before production code. The advantage I see with double buffering is that the max interrupt latentcy tolerated while still maintaining maximum throughput is twice as long because the hardware has two buffers it can send before it runs out.

[better synchronization]

I think allowing stuff to access the buffers/sfrs outside of the ISR context is just wrong. Rather than giving every IN endpoint a buffer I would allocate a few community buffers which also hold info like byte count and endpoint and make two thread safe queues one acting as a free list and one acting as an output queue. This way a public call to write would just fill a buffer, push it to the out queue and then set the usb ISR flag manually which in turn would actually send the buffer. This way you don't need any further synchronization in the internals of the HAL. The problem with disabling the usb ISR alone as a synchronization mechanism is that interrupts could potentially be off for a while. Lets say we have the USB interrupt at a high priority and have a few uarts and I2C and stuff with long timeouts running at a lower priority. Well these low priority interrupts can't delay usb transmission directly but if a lot of them occur while we are in a critical section in a fuction calle
d from main() then bam priority inversion.

Don't get me wrong though your solution is way better than what it replaced and its probably close to the best you can do without rewriting the API.

If the overhead of calling the ISR needlessly when the output buffers are full any way you could also look if the buffer is empty and only set the the interrupt flag if it is (because otherwise the interrupt will go off in the near future any way). I think that would be pretty close to the maximum conceivable throughput.

—
Reply to this email directly or view it on GitHub.

odinthenerd · 2016-01-18T23:41:11Z

[Can you get the USB interrupt triggered manually, though]
good point, I haven't tried it on this chip yet, maybe it's not possible. It has worked on LPC's in the past. Looking at the generic arm documentation setting the correspndng bit in NVIC_ISPR should generate an interrupt but the freescale docs say nothing so ill try it tomorrow.

[istat flags don't change inside ISR]
I have seen the flags change when stepping through the ISR and I can see in my log that the TOKDNE bit gets set again during an ISR so I'm pretty certain that this is a potential race condition, although bound to be super super rare.

[ODDRST success]
Like I said its a super hack, I will eventually need to get this working, my current design is super throughput critical and I will open source whatever I come up with. It could be on kvasir.io though if I have to make big API changes.

mjrgh · 2016-01-19T00:13:48Z

[istat flags don't change inside ISR]
I have seen the flags change when stepping through the ISR and > I can see in my log that the TOKDNE bit gets set again during an ISR > so I'm pretty certain that this is a potential race condition, although > bound to be super super rare.
I'm skeptical that this is that poorly defined in the hardware - if it were a real hardware race condition, I think it would happen frequently enough to be noticeable.
My reading of section 35.4.13 in the reference manual is that the time sequencing is well defined, but I could be reading too much into it. USB0_STAT is a FIFO that allows the SIE to be processing a transaction while an interrupt is in progress:
"Clearing the TOKDNE bit in the ISTAT register causes the SIE to update STAT with the contents of the next STAT value. If the data in the STAT holding register is valid, theSIE immediately reasserts to [sic] TOKDNE interrupt."
I suspect this is what you've seen in your logs - clearing TOKDNE in ISTAT triggers this FIFO queuing action, which will set the TOKDNE bit again and line up the next interrupt. It sounds to me like this is properly thought out and has deterministic behavior. I think the combination of the FIFO and the repeated interrupt should avoid any race condition. The FIFO means that a new incoming TOKDNE during ISR handling of a previous one won't "overwrite" the old one - the new one won't be asserted until the current one is explicitly acknowledged by the ISR (when it clears the TOKDNE ISTAT bit). And there won't be a lost interrupt when a packet comes in while the ISR is working on the old one, as a new interrupt will be fired as soon as the current ISR returns. It does mean that you have to be careful to clear TOKDNE exactly once per ISR call, as clearing that bit has this complex side effect (it's not just a passive register write).
Regards,Mike
Date: Mon, 18 Jan 2016 15:41:21 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

[Can you get the USB interrupt triggered manually, though]

good point, I haven't tried it on this chip yet, maybe it's not possible. It has worked on LPC's in the past. Looking at the generic arm documentation setting the correspndng bit in NVIC_ISPR should generate an interrupt but the freescale docs say nothing so ill try it tomorrow.

[istat flags don't change inside ISR]

I have seen the flags change when stepping through the ISR and I can see in my log that the TOKDNE bit gets set again during an ISR so I'm pretty certain that this is a potential race condition, although bound to be super super rare.

[ODDRST success]

Like I said its a super hack, I will eventually need to get this working, my current design is super throughput critical and I will open source whatever I come up with. It could be on kvasir.io though if I have to make big API changes.

—
Reply to this email directly or view it on GitHub.

odinthenerd · 2016-01-19T08:45:37Z

[It does mean that you have to be careful to clear TOKDNE exactly once per ISR call, as clearing that bit has this complex side effect (it's not just a passive register write).] Thats what I'm getting at, if you |= rather than = somewhere else you may be clearing it twice per ISR.

odinthenerd · 2016-01-19T16:49:02Z

I can confirm that setting the NVIC_ISPR bit to generate an interrupt in software works.

I have also looked at making my double buffering hack pretty enough that other people can use it. The problem is that I am not going through the upper layers but just handling packets in my code injected right into the HAL. As soon as I go through upper layers the buffer even/oddness gets lost. Ok I could read STAT again and figure it out but it just seems like a lot of work to fight with the API. At then end of the day if someone is super throughput concerned then this API is not for them.

One thing I particularly don't understand about the API is the why the HAL notifys upper layers about a packet which has arrived and then the upper layer calls the endpointReadResult function to actually read it. Why not read it in the HAL and pass the packet to the upper layer, that would save indirect calls and information loss like even/odd-ness.

I also consider the fact that setup packets are decoded into a data structure and then referenced from all over the place bad design. The upper layer does look at the last OUT packet when an IN request comes in as is the case for USBCallback_requestCompleted(), however the way this is done is just plain ugly and totally breaks encapsulation. If one were to model USBDevice and USBCdc etc. as what they truly are, state machines, then none of this 'magic state which is private but actually more like protected but can also be accessed by an interrupt which finds it through a static instance pointer at any time' is needed and the object lifetimes of information like line coding are suddenly well defined. Sure we need to keep some state around in order to handle control transfers http://www.beyondlogic.org/usbnutshell/usb4.shtml but lets model it as a SM and only keep a pointer to the last data packet throughout its logical lifetime. If we are recycling buffers from a pool any way this is actually not that hard.

mjrgh · 2016-01-19T22:19:08Z

I can confirm that setting the NVIC_ISPR bit to generate an
interrupt in software works.

Great - sounds like your all-ISR based design works as far as that goes.

Pondering this more, though, I'm not sure there's actually any benefit to generating an interrupt artificially. The reason to handle everything in the ISR is that there are no threading issues, right? Because you can't get another of the same interrupt while you're already handling one. The thing is, that's just the same as creating a critical section. I think it might actually be detrimental to do the synthetic interrupts because it could create race conditions (I'm thinking missed interrupts) when the SIE hardware wants to generate a hard interrupt while you're handling a soft one. It also just strikes me as less efficient in that you have to go through that whole 7-branch if-else tree in the ISR to see what work there is to do, even though you already know exactly what work you wanted to accomplish because you initiated the fake interrupt. So I'm thinking that the ISR isn't really the right abstraction point here. I think the basic idea is on the right track, though, that you want to unify all (or most) of the buffer operations into common code rather than having gobs of separate ISR and application code.

One thing I particularly don't understand about the API is the why > the HAL notifys upper layers about a packet which has arrived and then > the upper layer calls the endpointReadResult function to actually read it.
I'm afraid I don't know the history of the API. My guess is that the original designers wanted to create a socket-like read/write interface model because everyone's familiar with that approach from other contexts. It's not a perfect fit, but for simple apps it hides most of the complexity of USB pretty well.
It's also possible that this API model was originally developed for one of the other MCU platforms, and the KL25Z HAL had to be wedged into the existing model after the fact. I find this plausible because there are some pretty hacky bits that look like workarounds for things where the KL25Z model and the abstract model didn't mesh well (e.g., the way the address is set). I've only worked with the KL25Z, but I browsed some of the other platform code for comparison purposes when I was trying to fix my concurrency problems, and it looks like there are some quite different hardware interfaces on the other MCUs.
I also consider the fact that setup packets are decoded into a data structure > and then referenced from all over the place bad design. If one were to model > USBDevice and USBCdc etc. as what they truly are, state machines...
I couldn't agree more. That part seems incredibly fragile to me, but at least it works for now. I suspect that it'll fall apart if there's ever a USB 4 because its order of operations is so brittle. The USB 3 problem I ran into a couple of years ago probably wouldn't have been an issue if the EP0 protocol engine was more flexible.
It probably wouldn't even be that gigantic a job to rewrite the whole thing. USBDevice.cpp is big but most of it is pretty simple, just going through switch cases to figure out which type of descriptor to return, etc. The only tricky bits are the sequencing of SETUP packets and their IN/OUT data packets, and I do think a proper state machine model of this would collapse a lot of the complexity down to something much more sensible. You could probably rewrite USBDevice on top of the existing HAL as a first step, and then figure out what could be done to improve the HAL model as the next step once all of that EP0 sequencing is in its proper place and not split across the HAL interface.
Regards,Mike
Date: Tue, 19 Jan 2016 08:49:12 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

I can confirm that setting the NVIC_ISPR bit to generate an interrupt in software works.

I have also looked at making my double buffering hack pretty enough that other people can use it. The problem is that I am not going through the upper layers but just handling packets in my code injected right into the HAL. As soon as I go through upper layers the buffer even/oddness gets lost. Ok I could read STAT again and figure it out but it just seems like a lot of work to sight with the API. At then end of the day if someone is super throughput concerned then this API is not for them.

One thing I particularly don't understand about the API is the why the HAL notifys upper layers about a packet which has arrived and then the upper layer calls the endpointReadResult function to actually read it. Why not read it in the HAL and pass the packet to the upper layer, that would save indirect calls and information loss like even/odd-ness.

I also consider the fact that setup packets are decoded into a data structure and then referenced from all over the place bad design. The upper layer does look at the last OUT packet when an IN request comes in as is the case for USBCallback_requestCompleted(), however the way this is done is just plain ugly and totally breaks encapsulation. If one were to model USBDevice and USBCdc etc. as what they truly are, state machines, then none of this 'magic state which is private but actually more like protected but can also be accessed by an interrupt which finds it through a static instance pointer at any time' is needed and the object lifetimes of information like line coding are suddenly well defined. Sure we need to keep some state around in order to handle control transfers http://www.beyondlogic.org/usbnutshell/usb4.shtml but lets model it as a SM and only keep a pointer to the last data packet throughou
t its logical lifetime. If we are recycling buffers from a pool any way this is actually not that hard.

—
Reply to this email directly or view it on GitHub.

odinthenerd · 2016-01-19T23:19:02Z

[software setting interrupts]
I have never had problems with this in the past on other chips. Since we are not clearing any other bits the ISR routine should never miss anything.
The advantage over a critical section is that there is no possibility of priority inversion so worst case ISR latency is far more deterministic. At the end of the day 'deterministic' usually beats 'fast' because in most cases the worst case scenario is everything in embedded. If we have a 10% chance of not meeting our latency requirements when sending another chunk over the usb bus we almost half throughput because we forfit the rest of the (statistically) 1 - 18 following frame slots. USB dosn't care that we were fast the other 90% of the time.

If we think of the USBHAL as being chip specific and the USBDevice as being purely stuff that is in the standard then there is a very limited set of data flow paths from USBDevice to the USBHAL.

The obvious flow path would be packets, if we make a data structure with a buffer, a size and an endpoint number and push them to a queue and enable the interrupt then we don't actually need any information about the USBHAL, we just need to know where to push it to.

Besides packets we have things like stallEndpoint, setAddress, addEndpoint etc. All of these are essentially the result of a control packet going upstream. If we make the USBDevice handler return a sort of hardware command variant downstream I think all other special cases could be propagated to the USBHAL without the USBDevice actually having to know anything about it.

Breaking the recursive dependency between USBDevice and USBHAL means we can easily pass the USBDevice as a template parameter to the USBHAL, make all the member functions and data static and hello static dispatch. Eliminating virtual functions should shrink the FLASH footprint enormously! I implemented a USB CDC for a bootloader on an LPC1768 which turned out to be almost 10x smaller than the mbed equivalent. If the USBDevice were to take the "packet buffer and queue pool" as a template parameter (kind of like an allocator in the STL) the user would have far more control over the amount of buffer space used.

@salkinium I see you have some experience with the STM32F072, can you comment on the feasibility of this kind of a HAL with it (totally self interested as I will need to get USB working on that chip as well in the future, all the LPC11U, LPC15 and LPC17 I have worked with should be implementable with the proposed HAL)

salkinium · 2016-01-19T23:22:57Z

Forwarding to @ekiwi who has experience with the STM32F072.

odinthenerd · 2016-01-19T23:23:02Z

Ok correction we do need to know which NVIC_ISPR bit to set but that could be resolved using a traits class or the linker.

odinthenerd · 2016-01-20T17:48:03Z

According to this site: http://www.usbmadesimple.co.uk/ums_3.htm
"Control Transfer

This is a bi-directional transfer which uses both an IN and an OUT endpoint. Each control transfer is made up of from 2 to several transactions.

It is divided into three stages.

The SETUP stage carries 8 bytes called the Setup packet. This defines the request, and specifies whether and how much data should be transferred in the DATA stage.

The DATA stage is optional. If present, it always starts with a transaction containing a DATA1. The type of transaction then alternates between DATA0 and DATA1 until all the required data has been transferred.

The STATUS stage is a transaction containing a zero-length DATA1 packet. If the DATA stage was IN then the STATUS stage is OUT, and vice versa.

Control transfers are used for initial configuration of the device by the host, using Endpoint 0 OUT and Endpoint 0 IN, which are reserved for this purpose. They may be used (on the same endpoints) after configuration as part of the device-specific control protocol, if required.

The max packet size for the data stage is 8 bytes at low speed, 8, 16, 32 or 64 at full Speed and 64 for high speed."

seems to suggest that data0 data1 -ness does not actually always alternate but "starts over" at Data1 with each stage change in a control transfer. If this is true its maybe why you experienced difficulties with double buffering. I think control transfers need to be modeled as a SM in the Hal since the USBDevice should not need to know about data0/1-ness.

Multi packet transfers from host to device are also theoretically possible so the USBDevice needs to accept an array of packets before handling everything and pushing an ack downstream.

The only thing that seems to break my HAL concept so far is GET_STATUS -> ENDPOINT_RECIPIENT. This should naturally be handled in USBDevice since decoding it at least is not hardware specific. The response however is hardware specific in that we need information about the endpoint stall state. I could send a packet downstream with everything except the stall state and then have one 'hardware command' be "write stall status into buffer with pointer x". This seems very dirty though (although still not as bad as the current mbed implementation).

mjrgh · 2016-01-20T18:11:58Z

... seems to suggest that data0 data1 -ness does not actually always > alternate but "starts over" at Data1 with each stage change in a control > transfer. If this is true its maybe why you experienced difficulties with > double buffering.
It's definitely true - it's one of the cases I mentioned for why data0/data1 doesn't just alternate on every packet and thus can't treated as equivalent to even/odd buffer parity in the KL25Z SIE. But I didn't assume such an equivalence, and I had the necessary code for the special SETUP sequencing. The failure modes in my tests, far as I could tell, had to do with keeping the software even/odd bookkeeping in sync with the SIE even/odd bookkeeping. The SIE has internal state for even/odd that you can't interrogate and that doesn't seem to be 100% predictable. It probably is 100% predictable in actuality, but there seem to at least a few special cases where it's not just a matter of flipping the bit on every packet. And unfortunately the documentation is extremely sketchy on this point. That's one of the big reasons I looked to the Kinetis reference implementation to see if they knew what was going on inside the SIE's little buffer parity brain, but sadly they didn't implement it either so it shed no new light.
I think control transfers need to be modeled as a SM in the Hal since the > USBDevice should not need to know about data0/1-ness.
That's basically the way the HAL is set up now. The generic USBDevice code handles the packet parsing, which is device-independent, and the HAL implementation code deals with the data0/data1 state. It makes the state maintenance a bit convoluted and confusing because you have both parts of the code independently tracking the SETUP state, but the data0/data1 part is generic enough that the HAL code only has to know the type of the packet to keep the state straight and doesn't actually have to do any parsing. So I think this part of the existing code is about as efficiently factored as it can be.
The refactoring that would probably help the most vis-a-vis the SETUP handling would probably be to invert the layering so that the USBDevice code doesn't treat EP0 as a read/write pipe but just has a parser and state machine that the HAL can call. The HAL could then drive the procedure part of the process, which I think would simplify the threading and make it easier to see what's going on (and debug problems).
Regards,Mike
Date: Wed, 20 Jan 2016 09:48:14 -0800
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [mbed] Remove doubling of buffer size in realiseEndpoint() (#1482)

According to this site: http://www.usbmadesimple.co.uk/ums_3.htm

"Control Transfer