- Notifications
You must be signed in to change notification settings - Fork545
Description
The Npcap driver allocates a buffer for each open instance (handle returned byPacketOpenAdapter()
) that stores packets until the user program retrieves them (viaPacketReceivePacket()
). The buffer is split intoN segments, whereN is the number of logical CPUs on the system. When a packet is processed, it's put into the segment of the buffer corresponding to the number of the processor where the filter callback is running (KeGetCurrentProcessorNumber()
). If the segment is full, the packet is dropped and not captured.
On systems with a high number of processors, the segments could be too small and suffer packet drops even if the buffer as a whole is largely empty. This is even more likely if Windows just doesn't schedule NDIS threads on some processors, leaving those buffer segments entirely empty. This is probably the case many times, since the OS wants to schedule threads on the same processors they have run on before in order to avoid cache misses.
Additionally, the Nmap currently handles numbers of processors greater thansizeof(KAFFINITY)
(64 on 64-bit Windows, 32 on 32-bit) very oddly. In at least one place it maps any system processor ID over 63 to 63, which means that if Npcap is running on a system with 128 processors in 2 groups, any threads which run on processors 63 through 127 will all share the same segment of the buffer. If the proposed change below is not implemented, we will have to rewrite this code anyway for NDIS 6.20, which adds support for more than 64 processors.
We need to better document the current behavior (issue to be opened later), but even better would be to limit the number of segments the buffer is split into and somehow encourage threads to use them all. I haven't worked out all the details, but here are some thoughts:
- The limit should be user-configurable via Packet.DLL function call, which would communicate with the driver via a new IoControl. The default value would probably be close to the number of CPU cores on a typical lower-end system, though we could set an upper limit which would be the number of logical processors on the system.
- If it's necessary for threads on a single processor to always use the same buffer segment, we would calculate the buffer number as
processor_number % number_of_segments
. However, I don't think this is even necessary: the threads could round-robin on the buffer segments to ensure they are all used. The benefit of segments at this point would be to avoid waiting on a spinlock to access the buffer; that is to say that the chances of not having to wait on a spinlock arenumber_of_segments
times better than with a single buffer. The current scheme probably has a better performance characteristic in this sense, at the cost of wasted kernel memory and higher likelihood of dropped packets.