-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to create VPorts - sn_host_ioctl_create_netdev():447 invalid BAR address #918
Comments
This is DPDK's idea of "cannot convert an address". See ./lib/librte_eal/common/include/rte_memory.h and ./lib/librte_eal/common/rte_malloc.c in the DPDK source. |
It now works if i do not use -m 0 when I start bess, so that hugepages get allocated and DPDK works. Wondering if this is a bug since there is an option to run bess without hugepages and disable dpdk, but kernel still relies on it. |
Seeing this again, after I start BESS with some DPDK bound ports. I1024 14:45:51.234269 20 bessctl.cc:1004] CreatePortRequest from client: [3011571.907763] bess - sn_host_ioctl_create_netdev():449 invalid BAR address: phys=7efd6f47b780 virt=00000000b3a38191 |
This is some sort of DPDK bug. I will try to dig it further, but we should consider trying to upgrade to a later DPDK. |
Ack will try moving to a newer DPDK. I saw some release notes saying bugs were fixed with using ioctl inside containers with DPDK 19.x. Let's see what happens. |
Interestingly there is now a different problem after moving to DPDK 19.08. The problem is happening between the AllocBar function in vport.cc and sn_create_netdev: The memory is being corrupted between the allocation and netdev create time. I can see that when rte_zmalloc is called I now hit: Which comes from sn_create_netdev:
However I know the barsize set in AllocBar is correct. If I modify the code to use rte_zmalloc_socket and specify socket 0, I get the same result. However, if I use rte_zmalloc_socket and specify socket 1, I get a different error: [90645.592499] bess - sn_create_netdev():774 invalid ioctl arguments: num_txq=18627, num_rxq=17803 I know that in AllocBar the num_txq and num_rxq are both set to 1, which makes me thing somehow there is some memory corruption issue here. @kot-begemot-uk any ideas on how to proceed? |
This looks like memory corruption similar to the one I proposed to fix
in the 64 vs 128 max CPUs patch.
Which patches do you have applied to your tree and what is the hardware
config on the machine?
A.
…On 01/11/2019 17:08, Tim Rozet wrote:
Interestingly there is now a different problem after moving to DPDK
19.08. The problem is happening between the AllocBar function in
vport.cc and sn_create_netdev:
https://github.com/NetSys/bess/blob/master/core/drivers/vport.cc#L242
https://github.com/NetSys/bess/blob/master/core/kmod/sn_netdev.c#L755
The memory is being corrupted between the allocation and netdev create
time. I can see that when rte_zmalloc is called I now hit:
[88961.395932] bess - sn_create_netdev():765 invalid BAR size 0
[90433.161394] bess - sn_create_netdev():765 invalid BAR size 0
Which comes from sn_create_netdev:
|if (conf->barsize < sizeof(struct sn_conf_space)) { log_err("invalid
BAR size %llu\n", conf->bar_size); return -EINVAL; } |
However I know the barsize set in AllocBar is correct. If I modify the
code to use rte_zmalloc_socket and specify socket 0, I get the same
result. However, if I use rte_zmalloc_socket and specify socket 1, I
get a different error:
[90645.592499] bess - sn_create_netdev():774 invalid ioctl arguments:
num_txq=18627, num_rxq=17803
I know that in AllocBar the num_txq and num_rxq are both set to 1,
which makes me thing somehow there is some memory corruption issue here.
@kot-begemot-uk <https://github.com/kot-begemot-uk> any ideas on how
to proceed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#918?email_source=notifications&email_token=AC4TGWEKFATSPMG73WUO7QDQRRPCPA5CNFSM4HPH2PYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC3RVSI#issuecomment-548870857>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4TGWEMG6GR3GRJ5JM3X4DQRRPCPANCNFSM4HPH2PYA>.
--
Anton R. Ivanov
https://www.kot-begemot.co.uk/
|
/**
* phys_to_virt - map physical address to virtual
* @address: address to remap
*
* The returned virtual address is a current CPU mapping for
* the memory address given. It is only valid to use this function on
* addresses that have a kernel mapping
*
* This function does not handle bus mappings for DMA transfers. In
* almost all conceivable cases a device driver should not be using
* this function
*/ Note the second paragraph. A.
…On 01/11/2019 18:15, Anton Ivanov wrote:
This looks like memory corruption similar to the one I proposed to fix
in the 64 vs 128 max CPUs patch.
Which patches do you have applied to your tree and what is the
hardware config on the machine?
A.
On 01/11/2019 17:08, Tim Rozet wrote:
>
> Interestingly there is now a different problem after moving to DPDK
> 19.08. The problem is happening between the AllocBar function in
> vport.cc and sn_create_netdev:
> https://github.com/NetSys/bess/blob/master/core/drivers/vport.cc#L242
> https://github.com/NetSys/bess/blob/master/core/kmod/sn_netdev.c#L755
>
> The memory is being corrupted between the allocation and netdev
> create time. I can see that when rte_zmalloc is called I now hit:
> [88961.395932] bess - sn_create_netdev():765 invalid BAR size 0
> [90433.161394] bess - sn_create_netdev():765 invalid BAR size 0
>
> Which comes from sn_create_netdev:
>
> |if (conf->barsize < sizeof(struct sn_conf_space)) { log_err("invalid
> BAR size %llu\n", conf->bar_size); return -EINVAL; } |
>
> However I know the barsize set in AllocBar is correct. If I modify
> the code to use rte_zmalloc_socket and specify socket 0, I get the
> same result. However, if I use rte_zmalloc_socket and specify socket
> 1, I get a different error:
>
> [90645.592499] bess - sn_create_netdev():774 invalid ioctl arguments:
> num_txq=18627, num_rxq=17803
>
> I know that in AllocBar the num_txq and num_rxq are both set to 1,
> which makes me thing somehow there is some memory corruption issue here.
>
> @kot-begemot-uk <https://github.com/kot-begemot-uk> any ideas on how
> to proceed?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#918?email_source=notifications&email_token=AC4TGWEKFATSPMG73WUO7QDQRRPCPA5CNFSM4HPH2PYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC3RVSI#issuecomment-548870857>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC4TGWEMG6GR3GRJ5JM3X4DQRRPCPANCNFSM4HPH2PYA>.
>
--
Anton R. Ivanov
https://www.kot-begemot.co.uk/
--
Anton R. Ivanov
https://www.kot-begemot.co.uk/
|
This is seriously wrong - you cannot have an even address in phys become an odd address in virtual and vice versa. I need a bit more info on the actual case where you see this to debug it as I myself cannot reproduce it. |
I'm using the patches from #943. I just added also the patch from #946 so that I can isolate to a single NUMA node. I only see this on the 2 socket servers you and I had both been using. I do not see any issues thus far (even with newer dpdk) on my laptop. I'll pm you the detail so the setup and you can debug there. |
What DPDK returns is simply not normal. It returns phys = virt. Phys should be != virt if we are talking about the same terms: https://www.kernel.org/doc/Documentation/arm/Porting As a result, if the address happens to be inside the kernel memory map you get memory corruption, if not you get it bounced as invalid. Is it using 1G pages? |
2M hugepages. I managed to get around the issue by using PA mode for IOVA instead of IOMMU/VA mode: '''
|
That explains it.
If it is in iommu mode it will need something from the remap family of
functions to get the right address.
In pa mode you get real phys addresses and it should all work.
I will have a look if there is a way to tell the driver to switch
between the two.
A.
On 05/11/2019 15:15, Tim Rozet wrote:
2M hugepages. I managed to get around the issue by using PA mode for
IOVA instead of IOMMU/VA mode:
'''
@@ -135,6 +137,7 @@ void init_eal(int dpdk_mb_per_socket, int socket,
std::string nonworker_corelist
rte_args.Append({"--huge-unlink"});
}
* rte_args.Append({"--iova-mode=pa"});
'''
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#918?email_source=notifications&email_token=AC4TGWBRMDCAWCMJR2SORTDQSGE2HA5CNFSM4HPH2PYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDAWBA#issuecomment-549849860>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4TGWGEFDAZDNYOGKTZ3WDQSGE2HANCNFSM4HPH2PYA>.
--
Anton R. Ivanov
https://www.kot-begemot.co.uk/
|
Upon trying to create VPort for either a container or just on the host, I see:
localhost:10514 $ run file /home/trozet/Code/bess/bessctl/trozet.py
*** Error: Unhandled exception in the configuration script (most recent call last)
File "/home/trozet/Code/bess/bessctl/trozet.py", line 1, in
container_if = VPort(ifname='eth_host_test', ip_addrs=['10.255.99.1/24'])
File "/home/trozet/Code/bess/bessctl/../pybess/port.py", line 42, in init
self.choose_arg(None, kwargs))
File "/home/trozet/Code/bess/bessctl/../pybess/bess.py", line 383, in create_port
return self._request('CreatePort', request)
File "/home/trozet/Code/bess/bessctl/../pybess/bess.py", line 272, in _request
raise self.Error(code, errmsg, query=name, query_arg=req_dict)
***** Error: SN_IOC_CREATE_HOSTNIC failure**
BESS daemon response - errno=1 (EPERM: Operation not permitted)
query: CreatePort
query_arg: {'driver': 'VPort', 'arg': {'type_url': 'type.googleapis.com/bess.pb.VPortArg', 'value': '\n\reth_host_testJ\x0e10.255.99.1/24'}}
[trozet@localhost bin]$ dmesg | tail -n 1
[778856.477580] bess - sn_host_ioctl_create_netdev():447 invalid BAR address: phys=ffffffffffffffff virt=00000000501baed2
[trozet@localhost bin]$ uname -a
Linux localhost.localdomain 4.18.12-200.fc28.x86_64 #1 SMP Thu Oct 4 15:46:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
###bessd.INFO:
I0523 11:47:05.589193 13205 bessctl.cc:487] *** All workers have been paused ***
I0523 11:47:05.590729 13206 vport.cc:241] BAR total_bytes = 9408
I0523 11:47:05.590832 13206 vport.cc:545] virt: 0x7f4db7a5ad00, phys: 18446744073709551615
I0523 11:47:05.592839 13207 bessctl.cc:691] Checking scheduling constraints
The text was updated successfully, but these errors were encountered: