TODO

TODO: Make sure tcp_send errs on input pbufs that are larger than MSS. Also provide an API to probe the MSS.
TODO: Eliminate race conditions between application and worker threads. For example over rto_timer. This could be done by moving the TX and RX logic on the same thread with a separate application thread that handles socket control and enqueues and dequeues data segments. The worker application thread might be the easiest way to solve concurrency issues on TCP sockets.
      Assuming a worker thread model with both RX and TX logic on the same thread, should packets be enqueued individually or sockets?
      Enqueueing sockets has the advantage that the socket pointer is dereferenced only once per socket burst, however keeping a ring buffer with fixed capacity will require only unique sockets to be present inside it, this causes a race condition between the application thread and the worker thread as the ring buffer's semantics are single producer single consumer, a scan for uniqueness on socket add would cause it race. For example, when the application thread scans the ring, a ring pop may occur on the worker thread, this would shorten the ring while it is being read, perhaps leading to an out of-bounds read and unpredictable bhevaiour.
      If a shorter ring is used without uniqueness constraints, then the packets must be dequeued from the socket, the API call fail and the packets returned to the sender, this would cause the application thread to busy-wait on the socket, unless the epoll API exposes a freed-up ring. A non-comprehensive ring may not be necessary as we need not optimize for an infinity of concurrent sending threads.
      If one uses a linked list, locks should be held to keep the list consistent, this would cause contention issues during bidirectional trafic and slow-down ACK processing on send-heavy applications.
TODO: On the RX side when receiving smaller that MSS segments move them to a malloc-based allocator and push them to the application.
      Another issue is that the state of a TCP socket must be consistent across API calls from the application thread with reads on the worker thread. For example closing a TCP socket while packets are being fed into it during a read of the TCP socket state. This would cause it to send superflous packets when the application expects the socket to be closed. The issue however, is that a TCP socket state change across application and worker threads is not guaranteed to be consistent, nor instataneous. The concern should be weather the state is consistent during packet processing. Another issue is that there may still be conflicts between application and worker thread on TCP state change. For example a closed socket would be set to CONNECTED when a SYN+ACK packet would arrive at an inconvenient time.
      Another potential solution, and perhaps the asiest would be to place locks on socket processing, both on the worker and application threads.

TODO: Presently packets are delivered to the received socket once per burst, this has the greatest performance gain when there is only one CONNECTED socket and worst when packets from "burst" sockets are received in one burst. A solution would be to deliver packets to the socket once a pad buffer is full, say 64 packets, or on the expiration of a timer. The performance increase is due to the fact that ndpip_ring_push_one is very slow, while ndpip_ring_push is faster.