-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove doubling of buffer size in realiseEndpoint() #1482
Conversation
realiseEndpoint() was unnecessarily allocating twice the buffer space for each endpoint buffer. This was presumably for the sake of the hardware SIE's double-buffering (EVEN/ODD) system, but that's vestigial - this implementation doesn't use the double-buffering capability at all, leaving the ODDRST bit in the CTL register always set. The double-size allocation is a pure waste of memory.
Remove doubling of buffer size in realiseEndpoint()
I recently noticed the same oddity in them implementation, there are also a lot of places where the even bit is checked or used in macros. I would prefer that the code be modified to allow double buffering or a t least leave it as an option. You can't get good throughput with single buffering. Double buffering endpoint 0 seems pretty useless to me but since the ODDRST affects all endpoints the code for EP0 would need to change significantly. One could always just use one EP0 buffer and update the pointer in the BDT once every ISR. This would have the nice side effect of not having to keep track of the data0/data1-ness of packets (just leave it configured in the BDT). While were on the topic of BDT the uint8_t * endpoint_buffer[(NUMBER_OF_PHYSICAL_ENDPOINTS - 2) * 2];
uint8_t * endpoint_buffer_iso[2*2]; arrays are useless, these pointers are already contained in the BDT (albeit in the form of a uint32_t for some reason). While were on the topic of saving RAM, I don't see the use of putting the callbacks in an array I do not own any of the boards supported by this file, I got it running on an MKL27Z with a little hacking but I don't feel comfortable testing against that. How is testing usually done? Does someone have all the boards and is willing to do some testing for me ;) |
I actually did look into implementing the hardware double-buffering scheme, but I decided it wasn't worth the effort. The hardware maintains opaque internal state about the even/odd parity that the software must keep in sync with - if you get out of sync, the hardware starts ignoring BDT hand-offs. I'm sure it would be possible to use with a careful design, but with the current HAL design there are too many edge cases where I couldn't keep the bookkeeping working. I think a ground-up redesign of the HAL would be needed.
I recently noticed the same oddity in them implementation, there are also a lot of places where the even bit is checked or used in macros. I would prefer that the code be modified to allow double buffering or a t least leave it as an option. You can't get good throughput with single buffering. Double buffering endpoint 0 seems pretty useless to me but since the ODDRST affects all endpoints the code for EP0 would need to change significantly. One could always just use one EP0 buffer and update the pointer in the BDT once every ISR. This would have the nice side effect of not having to keep track of the data0/data1-ness of packets (just leave it configured in the BDT). While were on the topic of BDT the uint8_t * endpoint_buffer[(NUMBER_OF_PHYSICAL_ENDPOINTS - 2) * 2]; arrays are useless, these pointers are already contained in the BDT (albeit in the form of a uint32_t for some reason). While were on the topic of saving RAM, I don't see the use of putting the callbacks in an array epCallback[0] = &USBHAL::EP1_OUT_callback;. A switch statement would not use much more flash than the initialization of the array and would save 120 bytes of RAM. Generally RAM seems to be the more common bottleneck. Plus removing the array would give link time optimization more of a chance at understanding the code which will inevitably save flash in the long run. I do not own any of the boards supported by this file, I got it running on an MKL27Z with a little hacking but I don't feel comfortable testing against that. How is testing usually done? Does someone have all the boards and is willing to do some testing for me ;) — |
Yeah the reference implementation from Kinetis really sucks in my opinion. [data0/1-ness] By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall. [your branch] I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably. |
Yeah the reference implementation from Kinetis really sucks in my opinion. [data0/1-ness] Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't you just invert the data01 flags in both BDTs? Coming out of a stall the processor should own both BDTs right? By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall. [your branch] Great work! I tried to get full throughput on an MKL27Z last week and noticed a lot of weird bugs, I will look through your work and see if I found anything you didn't and vice versa. I ended up getting about 1 megabyte per second on most hubs but some USB 3.0 hubs didn't do more than 400k. "After spending some time with the code and seeing how many clear concurrency errors it has, it became rather amazing that it ever worked at all" I've had this experience time and again with embedded code and mbed is no exception. Besides race conditions mailbox like "one-to-clear" bits are also a super common source of bugs. In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp lines 692, 725 and 735 I think you are mistakenly clearing all flags rather than just the one. I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably. — |
Yeah the reference implementation from Kinetis really sucks in my opinion. [data0/1-ness] Oh yeah, stall, damn! Just off the top of my head in the unstall handler couldn't you just invert the data01 flags in both BDTs? Coming out of a stall the processor should own both BDTs right? By the way how do you test if your handling a stall right? I set a break point in the stall handler and tried to screw up the physical communication enough to make one happen but I either killed everything or it recovered without a stall. [your branch] Great work! I tried to get full throughput on an MKL27Z last week and noticed a lot of weird bugs, I will look through your work and see if I found anything you didn't and vice versa. I ended up getting about 1 megabyte per second on most hubs but some USB 3.0 hubs didn't do more than 400k. "After spending some time with the code and seeing how many clear concurrency errors it has, it became rather amazing that it ever worked at all" I've had this experience time and again with embedded code and mbed is no exception. Besides race conditions mailbox like "one-to-clear" bits are also a super common source of bugs. In fact in your code https://developer.mbed.org/users/mjr/code/USBDevice/file/d684a8ce5d88/USBDevice/USBHAL_KL25Z.cpp lines 692, 725 and 735 I think you are mistakenly clearing all flags rather than just the one. I think if you were to do a HAL rewrite getting rid of all indirect function calls using compile time polymorphism would also shrink the size considerably. — |
[handling all interrupt flags any way] [double buffering] [better synchronization] Don't get me wrong though your solution is way better than what it replaced and its probably close to the best you can do without rewriting the API. If the overhead of calling the ISR needlessly when the output buffers are full any way you could also look if the buffer is empty and only set the the interrupt flag if it is (because otherwise the interrupt will go off in the near future any way). I think that would be pretty close to the maximum conceivable throughput. |
[handling all interrupt flags any way] Yes the chances of this screwing anything up are super slim but if you were to get really pedantic you could argue that one of these flags could be set after istat is locally buffered and before one of the |= and therefore lost. [double buffering] Just hacking around last week I got double buffering working by simply clearing ODDRST after setup and then implementing it for my bulk endpoint but its a super hack and will need to be rewritten before production code. The advantage I see with double buffering is that the max interrupt latentcy tolerated while still maintaining maximum throughput is twice as long because the hardware has two buffers it can send before it runs out. [better synchronization] I think allowing stuff to access the buffers/sfrs outside of the ISR context is just wrong. Rather than giving every IN endpoint a buffer I would allocate a few community buffers which also hold info like byte count and endpoint and make two thread safe queues one acting as a free list and one acting as an output queue. This way a public call to write would just fill a buffer, push it to the out queue and then set the usb ISR flag manually which in turn would actually send the buffer. This way you don't need any further synchronization in the internals of the HAL. The problem with disabling the usb ISR alone as a synchronization mechanism is that interrupts could potentially be off for a while. Lets say we have the USB interrupt at a high priority and have a few uarts and I2C and stuff with long timeouts running at a lower priority. Well these low priority interrupts can't delay usb transmission directly but if a lot of them occur while we are in a critical section in a fuction calle Don't get me wrong though your solution is way better than what it replaced and its probably close to the best you can do without rewriting the API. If the overhead of calling the ISR needlessly when the output buffers are full any way you could also look if the buffer is empty and only set the the interrupt flag if it is (because otherwise the interrupt will go off in the near future any way). I think that would be pretty close to the maximum conceivable throughput. — |
[Can you get the USB interrupt triggered manually, though] [istat flags don't change inside ISR] [ODDRST success] |
[Can you get the USB interrupt triggered manually, though] good point, I haven't tried it on this chip yet, maybe it's not possible. It has worked on LPC's in the past. Looking at the generic arm documentation setting the correspndng bit in NVIC_ISPR should generate an interrupt but the freescale docs say nothing so ill try it tomorrow. [istat flags don't change inside ISR] I have seen the flags change when stepping through the ISR and I can see in my log that the TOKDNE bit gets set again during an ISR so I'm pretty certain that this is a potential race condition, although bound to be super super rare. [ODDRST success] Like I said its a super hack, I will eventually need to get this working, my current design is super throughput critical and I will open source whatever I come up with. It could be on kvasir.io though if I have to make big API changes. — |
[It does mean that you have to be careful to clear TOKDNE exactly once per ISR call, as clearing that bit has this complex side effect (it's not just a passive register write).] Thats what I'm getting at, if you |= rather than = somewhere else you may be clearing it twice per ISR. |
I can confirm that setting the NVIC_ISPR bit to generate an interrupt in software works. I have also looked at making my double buffering hack pretty enough that other people can use it. The problem is that I am not going through the upper layers but just handling packets in my code injected right into the HAL. As soon as I go through upper layers the buffer even/oddness gets lost. Ok I could read STAT again and figure it out but it just seems like a lot of work to fight with the API. At then end of the day if someone is super throughput concerned then this API is not for them. One thing I particularly don't understand about the API is the why the HAL notifys upper layers about a packet which has arrived and then the upper layer calls the endpointReadResult function to actually read it. Why not read it in the HAL and pass the packet to the upper layer, that would save indirect calls and information loss like even/odd-ness. I also consider the fact that setup packets are decoded into a data structure and then referenced from all over the place bad design. The upper layer does look at the last OUT packet when an IN request comes in as is the case for |
Great - sounds like your all-ISR based design works as far as that goes. Pondering this more, though, I'm not sure there's actually any benefit to generating an interrupt artificially. The reason to handle everything in the ISR is that there are no threading issues, right? Because you can't get another of the same interrupt while you're already handling one. The thing is, that's just the same as creating a critical section. I think it might actually be detrimental to do the synthetic interrupts because it could create race conditions (I'm thinking missed interrupts) when the SIE hardware wants to generate a hard interrupt while you're handling a soft one. It also just strikes me as less efficient in that you have to go through that whole 7-branch if-else tree in the ISR to see what work there is to do, even though you already know exactly what work you wanted to accomplish because you initiated the fake interrupt. So I'm thinking that the ISR isn't really the right abstraction point here. I think the basic idea is on the right track, though, that you want to unify all (or most) of the buffer operations into common code rather than having gobs of separate ISR and application code.
I can confirm that setting the NVIC_ISPR bit to generate an interrupt in software works. I have also looked at making my double buffering hack pretty enough that other people can use it. The problem is that I am not going through the upper layers but just handling packets in my code injected right into the HAL. As soon as I go through upper layers the buffer even/oddness gets lost. Ok I could read STAT again and figure it out but it just seems like a lot of work to sight with the API. At then end of the day if someone is super throughput concerned then this API is not for them. One thing I particularly don't understand about the API is the why the HAL notifys upper layers about a packet which has arrived and then the upper layer calls the endpointReadResult function to actually read it. Why not read it in the HAL and pass the packet to the upper layer, that would save indirect calls and information loss like even/odd-ness. I also consider the fact that setup packets are decoded into a data structure and then referenced from all over the place bad design. The upper layer does look at the last OUT packet when an IN request comes in as is the case for USBCallback_requestCompleted(), however the way this is done is just plain ugly and totally breaks encapsulation. If one were to model USBDevice and USBCdc etc. as what they truly are, state machines, then none of this 'magic state which is private but actually more like protected but can also be accessed by an interrupt which finds it through a static instance pointer at any time' is needed and the object lifetimes of information like line coding are suddenly well defined. Sure we need to keep some state around in order to handle control transfers http://www.beyondlogic.org/usbnutshell/usb4.shtml but lets model it as a SM and only keep a pointer to the last data packet throughou — |
[software setting interrupts] If we think of the USBHAL as being chip specific and the USBDevice as being purely stuff that is in the standard then there is a very limited set of data flow paths from USBDevice to the USBHAL. The obvious flow path would be packets, if we make a data structure with a buffer, a size and an endpoint number and push them to a queue and enable the interrupt then we don't actually need any information about the USBHAL, we just need to know where to push it to. Besides packets we have things like stallEndpoint, setAddress, addEndpoint etc. All of these are essentially the result of a control packet going upstream. If we make the USBDevice handler return a sort of hardware command variant downstream I think all other special cases could be propagated to the USBHAL without the USBDevice actually having to know anything about it. Breaking the recursive dependency between USBDevice and USBHAL means we can easily pass the USBDevice as a template parameter to the USBHAL, make all the member functions and data static and hello static dispatch. Eliminating virtual functions should shrink the FLASH footprint enormously! I implemented a USB CDC for a bootloader on an LPC1768 which turned out to be almost 10x smaller than the mbed equivalent. If the USBDevice were to take the "packet buffer and queue pool" as a template parameter (kind of like an allocator in the STL) the user would have far more control over the amount of buffer space used. @salkinium I see you have some experience with the STM32F072, can you comment on the feasibility of this kind of a HAL with it (totally self interested as I will need to get USB working on that chip as well in the future, all the LPC11U, LPC15 and LPC17 I have worked with should be implementable with the proposed HAL) |
Forwarding to @ekiwi who has experience with the STM32F072. |
Ok correction we do need to know which NVIC_ISPR bit to set but that could be resolved using a traits class or the linker. |
According to this site: http://www.usbmadesimple.co.uk/ums_3.htm This is a bi-directional transfer which uses both an IN and an OUT endpoint. Each control transfer is made up of from 2 to several transactions. It is divided into three stages. The SETUP stage carries 8 bytes called the Setup packet. This defines the request, and specifies whether and how much data should be transferred in the DATA stage. The DATA stage is optional. If present, it always starts with a transaction containing a DATA1. The type of transaction then alternates between DATA0 and DATA1 until all the required data has been transferred. The STATUS stage is a transaction containing a zero-length DATA1 packet. If the DATA stage was IN then the STATUS stage is OUT, and vice versa. Control transfers are used for initial configuration of the device by the host, using Endpoint 0 OUT and Endpoint 0 IN, which are reserved for this purpose. They may be used (on the same endpoints) after configuration as part of the device-specific control protocol, if required. The max packet size for the data stage is 8 bytes at low speed, 8, 16, 32 or 64 at full Speed and 64 for high speed." seems to suggest that data0 data1 -ness does not actually always alternate but "starts over" at Data1 with each stage change in a control transfer. If this is true its maybe why you experienced difficulties with double buffering. I think control transfers need to be modeled as a SM in the Hal since the USBDevice should not need to know about data0/1-ness. Multi packet transfers from host to device are also theoretically possible so the USBDevice needs to accept an array of packets before handling everything and pushing an ack downstream. The only thing that seems to break my HAL concept so far is GET_STATUS -> ENDPOINT_RECIPIENT. This should naturally be handled in USBDevice since decoding it at least is not hardware specific. The response however is hardware specific in that we need information about the endpoint stall state. I could send a packet downstream with everything except the stall state and then have one 'hardware command' be "write stall status into buffer with pointer x". This seems very dirty though (although still not as bad as the current mbed implementation). |
According to this site: http://www.usbmadesimple.co.uk/ums_3.htm "Control Transfer This is a bi-directional transfer which uses both an IN and an OUT endpoint. Each control transfer is made up of from 2 to several transactions. It is divided into three stages. The SETUP stage carries 8 bytes called the Setup packet. This defines the request, and specifies whether and how much data should be transferred in the DATA stage. The DATA stage is optional. If present, it always starts with a transaction containing a DATA1. The type of transaction then alternates between DATA0 and DATA1 until all the required data has been transferred. The STATUS stage is a transaction containing a zero-length DATA1 packet. If the DATA stage was IN then the STATUS stage is OUT, and vice versa. Control transfers are used for initial configuration of the device by the host, using Endpoint 0 OUT and Endpoint 0 IN, which are reserved for this purpose. They may be used (on the same endpoints) after configuration as part of the device-specific control protocol, if required. The max packet size for the data stage is 8 bytes at low speed, 8, 16, 32 or 64 at full Speed and 64 for high speed." seems to suggest that data0 data1 -ness does not actually always alternate but "starts over" at Data1 with each stage change in a control transfer. If this is true its maybe why you experienced difficulties with double buffering. I think control transfers need to be modeled as a SM in the Hal since the USBDevice should not need to know about data0/1-ness. Multi packet transfers from host to device are also theoretically possible so the USBDevice needs to accept an array of packets before handling everything and pushing an ack downstream. The only thing that seems to break my HAL concept so far is GET_STATUS -> ENDPOINT_RECIPIENT. This should naturally be handled in USBDevice since decoding it at least is not hardware specific. The response however is hardware specific in that we need information about the endpoint stall state. I could send a packet downstream with everything except the stall state and then have one 'hardware command' be "write stall status into buffer with pointer x". This seems very dirty though (although still not as bad as the current mbed implementation). — |
To me this is a HAL problem because no one can really know how big the transfer can ever get. The user can actually theoretically send large amounts of data over control transfers so putting them in a buffer that lives in USBDevice is stupid, without passing an array of pointers to packets this cannot really be done well. In my proposed HAL the user can decide how big the biggest control transfer can be simply by specifying enough packets in the pool. |
I think I have an API for a new HAL worked out which will provide the following advantages over the current one:
USBDevice takes as template parameters:
USBHAL takes a USBDevice and calls its static members:
|
I built a USB error maker out of a miccorcontroller and a few transistors and capacitors which causes physical layer errors every 5 milliseconds by forcing a few recessive bits. Since a USB frame is 1ms there should be plenty of room on the bus for data to get through if error handling works correctly. Using the mbed implementation the device enumeration fails 9 out of 10 times so there is surely room to improve the error handling as well. |
I recently committed my usb abstraction to Kvasir in case anyone is interested in moving to conversation here |
realiseEndpoint() was unnecessarily allocating twice the buffer space for each endpoint buffer. This was presumably for the sake of the hardware SIE's double-buffering (EVEN/ODD) system, but that's vestigial - this implementation doesn't use the double-buffering capability at all, leaving the ODDRST bit in the CTL register always set. The double-size allocation is a pure waste of memory.