mirror of
https://github.com/rtbrick/bngblaster.git
synced 2024-05-06 15:54:57 +00:00
284 lines
9.9 KiB
ReStructuredText
284 lines
9.9 KiB
ReStructuredText
.. _performance:
|
|
|
|
Performance Guide
|
|
=================
|
|
|
|
The BNG Blaster handles all traffic sent and received (I/O) in the main thread per default.
|
|
With this default behavior, you can achieve between 100.000 and 250.000 PPS bidirectional
|
|
traffic in most environments. Depending on the actual setup, this can be even less or much
|
|
more, which is primarily driven by the single-thread performance of the given CPU.
|
|
|
|
Those numbers can be increased by splitting the workload over multiple I/O worker threads.
|
|
Every I/O thread will handle only one interface and direction. It is also possible to start
|
|
multiple threads for the same interface and direction.
|
|
|
|
The number of I/O threads can be configured globally for all interfaces or per interface link.
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"interfaces": {
|
|
"rx-threads": 2,
|
|
"tx-threads": 1,
|
|
"links": [
|
|
{
|
|
"interface": "eth1",
|
|
"rx-threads": 4,
|
|
"tx-threads": 2,
|
|
}
|
|
]
|
|
}
|
|
}
|
|
|
|
The configuration per interface link allows asymmetric thread pools. Assuming you would send
|
|
massive unidirectional traffic from eth1 to eth2. In such a scenario, you would set up multiple
|
|
TX threads and one RX thread on eth1. For eth2 you would do the opposite, meaning to set up
|
|
multiple RX threads but only one TX thread.
|
|
|
|
It is also possible to start dedicated threads for TX but remain RX in the main thread or
|
|
vice versa by setting the number of threads to zero (default).
|
|
|
|
With multithreading, you should be able to scale up to 8 million PPS bidirectional, depending on
|
|
the actual configuration and setup. This allows starting 1 million flows with 1 PPS per flow over
|
|
at least 4 TX threads to verify all prefixes of a BGP full table for example.
|
|
|
|
The configured traffic streams are automatically balanced over all TX threads of the corresponding
|
|
interfaces but a single stream can't be split over multiple threads to prevent re-ordering issues.
|
|
|
|
Enabling multithreaded I/O causes some limitations. First of all, it works only on systems with
|
|
CPU cache coherence, which should apply to all modern CPU architectures. TX threads are not allowed
|
|
for LAG (Link Aggregation) interfaces but RX threads are supported. It is also not possible to capture
|
|
traffic streams send or received on threaded interfaces. All other traffic is still captured on threaded
|
|
interfaces.
|
|
|
|
.. note::
|
|
|
|
The BNG Blaster is currently tested for 8 million PPS with 10 million flows, which is not a
|
|
hard limitation but everything above should be considered with caution. It is also possible to
|
|
scale far beyond using DPDK-enabled interfaces.
|
|
|
|
A single stream will be always handled by a single thread to prevent re-ordering.
|
|
|
|
It is also recommended to increase the hardware and software queue size of your
|
|
network interface links to the maximum for higher throughput as explained
|
|
in the :ref:`Operating System Settings <interfaces>`.
|
|
|
|
The packet receives performance can be increased by the number of RX threads and IO slots.
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"interfaces": {
|
|
"rx-threads": 20,
|
|
"io-slots": 32768
|
|
}
|
|
}
|
|
|
|
The packet receives performance is also limited by the abilities of your network
|
|
interfaces to properly distribute the traffic over multiple hardware queues using
|
|
receive side scaling (RSS). This is a technology that allows network applications
|
|
to distribute the processing of incoming network packets across multiple CPUs,
|
|
improving performance, RSS uses a hashing function to assign packets to different
|
|
CPUs based on their source and destination addresses and ports. RSS requires
|
|
hardware support from the network adapter and the driver.
|
|
|
|
Some network interfaces are not able to distribute traffic for PPPoE/L2TP or even
|
|
MPLS traffic. Even double-tagged VLANs with default the default type 0x8100 is
|
|
often not supported.
|
|
|
|
Therefore best results can be reached with single tagged IPoE traffic. Depending
|
|
on the actual network adapter, there are different options to address this
|
|
limitation. For instance, Intel adapters support different
|
|
Dynamic Device Personalization (DDP) to support RSS for PPPoE traffic.
|
|
|
|
You can also boost the performance by adjusting some driver settings. For example,
|
|
we found that the following setting improved the performance for
|
|
`Intel 700 Series <https://www.kernel.org/doc/html/v6.6/networking/device_drivers/ethernet/intel/i40e.html>`_
|
|
in some of our tests. However, these settings may vary depending on your specific
|
|
test environment.
|
|
|
|
.. code-block:: none
|
|
|
|
ethtool -C <interface> adaptive-rx off adaptive-tx off rx-usecs 125 tx-usecs 125
|
|
|
|
.. note::
|
|
|
|
We are continuously working to increase performance. Contributions, proposals,
|
|
or recommendations on how to further increase performance are welcome!
|
|
|
|
|
|
NUMA
|
|
----
|
|
|
|
NUMA, which stands for Non-Uniform Memory Access, is a computer memory design used in multi-processor systems.
|
|
In a NUMA system, each processor, or a group of processors, has its own local memory. The processors can access
|
|
their own local memory faster than non-local memory, which is the memory local to another processor or shared
|
|
between processors.
|
|
|
|
On such systems, the best performance can be achieved by manually assigning RX and TX threads to a set of CPU
|
|
to ensure that the corresponding threads of an interface are running on the same NUMA node. The NUMA node
|
|
of the interface can be derived from the file ``/sys/class/net/<interface>/device/numa_node``.
|
|
|
|
.. code-block:: none
|
|
|
|
cat /sys/class/net/eth0/device/numa_node
|
|
0
|
|
cat /sys/class/net/eth1/device/numa_node
|
|
1
|
|
|
|
|
|
The command ``lscpu`` returns the number of NUMA nodes with the associated
|
|
CPU's for each NUMA node.
|
|
|
|
.. code-block:: none
|
|
|
|
...
|
|
NUMA:
|
|
NUMA node(s): 2
|
|
NUMA node0 CPU(s): 0-17,36-53
|
|
NUMA node1 CPU(s): 18-35,54-71
|
|
|
|
|
|
Folowing an example configuration.
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"interfaces": {
|
|
"links": [
|
|
{
|
|
"interface": "eth0",
|
|
"rx-threads": 4,
|
|
"rx-cpuset": [0, 36, 1, 37]
|
|
"tx-threads": 4,
|
|
"tx-cpuset": [2, 38, 3, 39]
|
|
},
|
|
{
|
|
"interface": "eth1",
|
|
"rx-threads": 4,
|
|
"rx-cpuset": [18, 54, 19, 55]
|
|
"tx-threads": 4,
|
|
"tx-cpuset": [20, 56, 21, 57]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
|
|
|
|
Following a real world example from a system with two CPU sockets (NUMA nodes) and two physical NIC adapters,
|
|
each connected to another socket (NUMA node). This example was optimized to send loss free 20G from
|
|
ens2f2np2, ens2f3np3 (NUMA node 0) to ens5f2np2, ens5f3np3 (NUMA node 1).
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"interfaces": {
|
|
"links": [
|
|
{
|
|
"interface": "ens2f2np2",
|
|
"tx-threads": 4,
|
|
"tx-cpuset": [0, 36, 1, 37]
|
|
},
|
|
{
|
|
"interface": "ens2f3np3",
|
|
"tx-threads": 4,
|
|
"tx-cpuset": [2, 38, 3, 39]
|
|
},
|
|
{
|
|
"interface": "ens5f2np2",
|
|
"rx-threads": 16,
|
|
"rx-cpuset": [18, 54, 19, 55, 20, 56, 21, 57, 22, 58, 23, 59, 24, 60, 25, 61],
|
|
"io-slots-rx": 32768
|
|
},
|
|
{
|
|
"interface": "ens5f3np3",
|
|
"rx-threads": 16,
|
|
"rx-cpuset": [26, 62, 27, 63, 28, 64, 29, 65, 30, 66, 31, 67, 32, 68, 33, 69],
|
|
"io-slots-rx": 32768
|
|
}
|
|
]
|
|
}
|
|
}
|
|
|
|
|
|
This example shows well that more RX threads are required than TX threads.
|
|
|
|
|
|
.. _dpdk-usage:
|
|
|
|
DPDK
|
|
----
|
|
|
|
Using the experimental `DPDK <https://www.dpdk.org/>`_ support requires building
|
|
the BNG Blaster from sources with DPDK enabled as explained
|
|
in the corresponding :ref:`installation <install-dpdk>` section.
|
|
|
|
.. note::
|
|
|
|
The official BNG Blaster Debian release packages do not support
|
|
`DPDK <https://www.dpdk.org/>`_!
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"interfaces": {
|
|
"io-slots": 32768
|
|
"links": [
|
|
{
|
|
"interface": "0000:23:00.0",
|
|
"io-mode": "dpdk",
|
|
"rx-threads": 8,
|
|
"rx-cpuset": [4,5,6,7],
|
|
"tx-threads": 3,
|
|
"tx-cpuset": [1,2,3]
|
|
},
|
|
{
|
|
"interface": "0000:23:00.2",
|
|
"io-mode": "dpdk",
|
|
"rx-threads": 8,
|
|
"rx-cpuset": [12,13,14,15],
|
|
"tx-threads": 3,
|
|
"tx-cpuset": [9,10,11]
|
|
}
|
|
],
|
|
"a10nsp": [
|
|
{
|
|
"__comment__": "PPPoE Server",
|
|
"interface": "0000:23:00.0"
|
|
}
|
|
],
|
|
"access": [
|
|
{
|
|
"__comment__": "PPPoE Client",
|
|
"interface": "0000:23:00.2",
|
|
"type": "pppoe",
|
|
"outer-vlan-min": 1,
|
|
"outer-vlan-max": 4000,
|
|
"inner-vlan-min": 1,
|
|
"inner-vlan-max": 4000,
|
|
"stream-group-id": 1
|
|
}
|
|
]
|
|
},
|
|
"pppoe": {
|
|
"reconnect": true
|
|
},
|
|
"dhcpv6": {
|
|
"enable": false
|
|
},
|
|
"streams": [
|
|
{
|
|
"stream-group-id": 1,
|
|
"name": "S1",
|
|
"type": "ipv4",
|
|
"direction": "both",
|
|
"pps": 1000,
|
|
"a10nsp-interface": "0000:23:00.0"
|
|
}
|
|
]
|
|
}
|
|
|
|
|
|
DPDK assigns one hardware queue to each RX thread, so you need to increase
|
|
the number of threads to utilize more queues and enhance performance.
|