Movatterモバイル変換


[0]ホーム

URL:


Hajime Tazaki, profile picture
Uploaded byHajime Tazaki
PDF, PPTX45,670 views

Library Operating System for Linux #netdev01

This document introduces a library operating system approach for using the Linux network stack in userspace. Some key points:- It describes building the Linux network stack (including components like ARP, TCP/IP, Qdisc, etc) as a library that can be loaded and used in userspace. - This allows flexible experimentation with and testing of new network stack ideas without modifying the kernel. Code can be added and tested through the library interface. - Implementations described include directly executing the code (DCE) and using it to integrate with a network simulator, as well as a Network Stack in Userspace (NUSE) that provides a full-featured POSIX-like platform for the network stack in user

In this document
Powered by AI

Presentation on library operating systems using the Linux network stack by Tazaki et al., highlighting its significance.

Explores reasons for using kernel space, including historical context and benefits like network stack personalization.

Discussion on various userspace network stacks and their evolution, including motivations and insights.

Raises important questions regarding benefits and adaptations needed for utilizing matured network stacks.

Proposes using the Linux network stack directly as a userspace library, hinting at practical applications.

Overview of the talk focused on introducing a library operating system for Linux and its implementation.

Design overview explaining hardware-independent architecture with three main components: host backend, kernel layer, and POSIX layer.

Examples of kernel glue and POSIX glue code, demonstrating the integration between userspace applications and kernel functionalities.

Details the implementations involving Direct Code Execution (DCE) and their networking platform benefits.

Explains ns-3 integration of network simulations with deterministic scheduling and network stack control.

Describes a userspace network stack running on Linux, emphasizing personalization and comprehensive features.

Demonstrates the execution process of the userspace network stack with an example using 'ping' command.

Illustrates scenarios where NUSE provides network stack personalization benefiting specific applications.

Outline of the workflow for writing and testing patches within the network stack development.

Discussion on continuous integration processes used for network stack testing and validation.

Details the stepwise approach for writing patches and creating test scenarios using ns-3.

Presents performance metrics of NUSE under various configurations, particularly in high-speed Ethernet environments.

Describes the setup used for measuring NUSE performance across different hardware and software configurations.

Detailed performance metrics of NUSE involving throughput and RTT analysis across various configurations.

Compares NUSE to other alternatives such as UML, containers, and scratch-based network stacks.

Highlights limitations encountered with ad-hoc kernel glues and performance issues in NUSE.

Summarizes the advantages of using a library operating system while planning future developments.

Provides links to GitHub repository and relevant resources for accessing the system discussed.

Contains ancillary materials or backup information related to the presentation subject.

Demonstrates how to utilize debugging tools to ensure proper functionality of network nodes.

Discusses the use of tools like Valgrind for memory error detection in the context of network stacks.

Instructions for configuring and building a kernel source tree tailored for NUSE.

Explains how timers function within the context of the userspace networking stack.

Presents the call graphs for network transmission and reception processes within the userspace framework.

Embed presentation

Download as PDF, PPTX
Library OperatingSystem with MainlineLinux Network Stack!Hajime Tazaki, Ryo Nakamura, Yuji Sekiyanetdev0.1, Feb. 2015
MotivationWhy kernel space ?Packets were expensive in 1970’Why not userspace ?well grown in decades, costs degradesobtain network stack personalizationcontrollable by userspace utilities2
Userspace network stacksA lot of userspace network stackfull scratch: mTCP, Mirage, lwIPPorting: OSv, Sandstorm, libuinet (FreeBSD),Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC,cloud, high-speed Apps)Writing a network stack is 1-week DIY,but writing opera-table network stack is decadesDIY (which is not DIY)3
QuestionsHow to benefit matured network stackin userspace ?How to trivially introduce your ideaon network stack ?xxTCP, IPvX, etc..How to flexibly test your code with acomplex scenario ?4
The answersUsing Linux network stack as-is!as a userspace Library (libraryoperating system)5
This talk is aboutan introduction of a libraryoperating system for Linuxand its implementationwith a couple of useful use cases6
Outlook (design)hardware-independent arch (arch/lib)3 componentsHost backend layerKernel layerPOSIX layer7https://github.com/libos-nuse/net-next-nuse
Outlook (cont’d)8ARPQdiscTCP UDP DCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layerbottom halves/rcu/timer/interruptstructnet_deviceschedulernetdevclocksourcePOSIX glue layerApplication1) Build Linux srctreew/ glues as a library2) put backend!(vNIC, clock source,!scheduler) and bind3) add POSIX glue code4) applicationsmagically runs
Kernel glue code9https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.cvoid schedule(void)!{!! lib_task_wait();!}!signed long schedule_timeout(signed long timeout)!{!! u64 ns;!! struct SimTask *self;!!! if (timeout == MAX_SCHEDULE_TIMEOUT) {!! ! lib_task_wait();!! ! return MAX_SCHEDULE_TIMEOUT;!! }!! lib_assert(timeout >= 0);!! ns = ((__u64)timeout) * (1000000000 / HZ);!! self = lib_task_current();!! lib_event_schedule_ns(ns, &trampoline, self);!! lib_task_wait();!! /* we know that we are always perfectly on time. */!! return 0;!}
POSIX glue code10https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.cint nuse_socket(int domain, int type, int protocol)!{!! lib_update_jiffies();!! struct socket *kernel_socket = malloc(sizeof(struct socket));!! int ret, real_fd;!!! memset(kernel_socket, 0, sizeof(struct socket));!! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!! if (ret < 0)!! ! errno = -ret;!(snip)!! lib_softirq_wakeup();!! return real_fd;!}!weak_alias(nuse_socket, socket);
Implementations(Instances)Direct Code Execution (DCE)network simulator integration (ns-3)for more testingNetwork Stack in Userspace (NUSE)gives new platform of Linux network stackfor ad-hoc network stack11
Direct Code Executionns-3 integrationdeterministic schedulersingle-process model virtualizationdlmopen(3)-like virtualizationfull control over multiple network stacks12
Execution (DCE)main() => dlmopen(ping,liblinux.so)
=> main()=>socket(2)=>dce_socket()
=> (do whatever)13
14
15
Network Stack inUserspaceUserspace network stack running onLinux (POSIX) platformNetwork stack personalizationFull features by design (full stack)ARP/ND, UDP/TCP (all cc algorithm), SCTP,DCCP, QDISC, XFRM, netfilter, etc.16
17ApplicationARPQdiscTCP UDP DCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layer (NUSE)POSIX glue layerbottom halves/rcu/timer/interruptstructnet_deviceRAW DPDK netmap ...NICschedulernetdevclocksourcesystem call hijackApplicationmaster process slave processesrumpsyscallproxyrumpserver
Execution (NUSE)LD_PRELOAD=libnuse-linux.so 
ping www.google.comping(8) => socket(2) => nuse_socket()=> raw(7) => (network)18
When it’s useful?ad-hoc network stack (network stackpersonalization)LD_PRELOAD=liblinux-mptcp.so firefoxBundle with kernel bypassesIntel DPDK / netmap / PF_RING / etc.debugging/testing with ns-319
Testing workflow1.Write/modify code (patches)2.Write a test code (incl. packetexchanges)3.if PASS; accept pull-request
else; rejects20
continuous integration(CI)21http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
T1) write a patch22Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!---!net/ipv6/xfrm6_policy.c | 1 +!1 file changed, 1 insertion(+)!!diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!index 48bf5a0..8d2d01b 100644!--- a/net/ipv6/xfrm6_policy.c!+++ b/net/ipv6/xfrm6_policy.c!@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, intreverse)!!#if IS_ENABLED(CONFIG_IPV6_MIP6)!! ! case IPPROTO_MH:!+! ! ! offset += ipv6_optlen(exthdr);!! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!! ! ! ! struct ip6_mh *mh;!http://patchwork.ozlabs.org/patch/436351/
T2) write a testAs ns-3 scenarioC++ or pythoncreate a topologyconfig nodesrun/check results(e.g., ping6)23+-----------+!| HA |!+-----------+!|sim0!+----------+------------+!|sim0 |sim0!sim2+----+---+ +----+---+!- - -| AR1 | | AR2 |!+---+----+ +----+---+!|sim1 |sim1!| |!sim0 sim0!+----+------+ (Movement) +----+-----+!| MR | <=====> | MR |!+-----------+ +----------+!|sim1 |sim1!+---------+ +---------+!| MNN | | MNN |!+---------+ +---------+!http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
24#!/usr/bin/python!!from ns.dce import *!from ns.core import *!!nodes = NodeContainer()!nodes.Create (100)!dce = DceManagerHelper()!dce.SetNetworkStack ("liblinux.so")!dce.Install (nodes)!!app = DceApplicationHelper()!app.SetBinary ("ping6")!app.Install (nodes)!(snip)!!NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!<< " did not return successfully: " << g_testError)!!Simulator.Stop (Seconds(1000.0))!Simulator.Run ()
Performance of NUSE10G Ethernet back-to-backtransmissionIP forwardingnative Linux, raw socket, tap, dpdk,netmap25
Performance: setup2610G10GNUSE node Tx/Rx nodesCPUXeon E5-2650v2 @ 2.60GHz (16 core)Xeon L3426 @ 1.87GHz (8 core)Memory 32GB 4GBNIC Intel X520 Intel X520OShost:3.13.0-32nuse: 3.17.0-rc1host:3.13.0-32ping!flowgenvnstat!(packet count)Tx NUSE Rxping!flowgen
Host Tx27RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)native: ping A.B.C.D!others: ./nuse ping A.B.C.D
L3 RoutingSender->NUSE->Receiver28Tx RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)
AlternativesUML/LKL (1proc/1vm, no POSIX i/f)Containers (can’t change kernel)scratch-based (mTCP,Mirage)rumpkernel (in NetBSD)29
Limitationsad-hoc kernel glues requiredwhen we changed a member of a struct,LibOS needs to follow itPerformance drawbacks on NUSEadapt known techniques (mTCP)30
(not) ConclusionsAn abstraction for multiple benefitsConservativeUse past decades effort as muchwith a small amount of effortPlaning to RFC for upstreaming31
github: https://github.com/libos-nuse/net-next-nuseDCE: http://bit.ly/ns-3-dcetwitter: @thehajime32
Backups
Bug reproducibility34Wi-Fi Wi-FiHome AgentAP1 AP2handoffping6mobile nodecorrespondentnode(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.<continue>(gdb) bt 4#0  mip6_mh_filter(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish(skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
DebuggingMemory error detectionamong distributed nodesin a single processusing Valgrind!!35==5864== Memcheck, a memory error detector==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in==5864== Command: ../build/bin/ns3test-dce-vdl --verbose==5864== ==5864== Conditional jump or move depends on uninitialised value(s)==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)==5864== by 0x7D2313F: process_backlog (dev.c:3368)==5864== by 0x7D23455: net_rx_action (dev.c:3526)==5864== by 0x7CF2477: do_softirq (softirq.c:65)==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage==5864== Uninitialised value was created by a stack allocation==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)==5864==
Fine-grained parameter coverage36Code coverage measurement with DCEWith fine-grained network, node, protocol parameters
1) kernel buildbuild kernel source tree w/ the patchmake menuconfig ARCH=simmake library ARCH=sim➔ libnuse-linux-3.17-rc1.so37
Example: How timerworks38add_timer()TIMER_SOFTIRQtimer_listrun_timer_softirq ()timer handlertimer thread(timer_create (2))
Tx callgraph39sendmsg () (socket API)lib_sock_sendmsg () (NUSE)sock_sendmsg ()ip_send_skb ()ip_finish_output2 ()dst_neigh_output () (existingneigh_resolve_output () -kernel)arp_solicit ()dev_queue_xmit ()lib_dev_xmit () (NUSE)nuse_vif_raw_write ()
start_thread () (pthread)nuse_netdev_rx_trampoline ()nuse_vif_raw_read () (NUSE)lib_dev_rx ()netif_rx () (ex-kernel)Rx callgraph40start_thread () (pthread)do_softirq () (NUSE)net_rx_action ()process_backlog () (ex-kernel)__netif_receive_skb_core ()ip_rcv ()vNIC!rxsoftirq!rx

Recommended

PDF
Velocity 2015 linux perf tools
PDF
eBPF - Rethinking the Linux Kernel
PDF
BPF - in-kernel virtual machine
PPTX
Understanding eBPF in a Hurry!
PDF
DPDK in Containers Hands-on Lab
PDF
BPF / XDP 8월 세미나 KossLab
PDF
eBPF/XDP
PDF
Interrupt Affinityについて
PDF
Linux BPF Superpowers
PDF
BPF Internals (eBPF)
ODP
eBPF maps 101
PDF
Linux Internals - Part I
PPTX
Linux Network Stack
PDF
DoS and DDoS mitigations with eBPF, XDP and DPDK
PDF
Understanding Open vSwitch
PDF
eBPF Trace from Kernel to Userspace
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Building Network Functions with eBPF & BCC
PDF
Binary exploitation - AIS3
PDF
SFO15-503: Secure storage in OP-TEE
 
PPTX
Debug dpdk process bottleneck & painpoints
PDF
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
PDF
Faster packet processing in Linux: XDP
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Linux Profiling at Netflix
PDF
Cilium - Network security for microservices
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Container Performance Analysis
PDF
NUSE (Network Stack in Userspace) at #osio
PDF
Network Stack in Userspace (NUSE)

More Related Content

PDF
Velocity 2015 linux perf tools
PDF
eBPF - Rethinking the Linux Kernel
PDF
BPF - in-kernel virtual machine
PPTX
Understanding eBPF in a Hurry!
PDF
DPDK in Containers Hands-on Lab
PDF
BPF / XDP 8월 세미나 KossLab
PDF
eBPF/XDP
PDF
Interrupt Affinityについて
Velocity 2015 linux perf tools
eBPF - Rethinking the Linux Kernel
BPF - in-kernel virtual machine
Understanding eBPF in a Hurry!
DPDK in Containers Hands-on Lab
BPF / XDP 8월 세미나 KossLab
eBPF/XDP
Interrupt Affinityについて

What's hot

PDF
Linux BPF Superpowers
PDF
BPF Internals (eBPF)
ODP
eBPF maps 101
PDF
Linux Internals - Part I
PPTX
Linux Network Stack
PDF
DoS and DDoS mitigations with eBPF, XDP and DPDK
PDF
Understanding Open vSwitch
PDF
eBPF Trace from Kernel to Userspace
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Building Network Functions with eBPF & BCC
PDF
Binary exploitation - AIS3
PDF
SFO15-503: Secure storage in OP-TEE
 
PPTX
Debug dpdk process bottleneck & painpoints
PDF
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
PDF
Faster packet processing in Linux: XDP
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Linux Profiling at Netflix
PDF
Cilium - Network security for microservices
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Container Performance Analysis
Linux BPF Superpowers
BPF Internals (eBPF)
eBPF maps 101
Linux Internals - Part I
Linux Network Stack
DoS and DDoS mitigations with eBPF, XDP and DPDK
Understanding Open vSwitch
eBPF Trace from Kernel to Userspace
The TCP/IP Stack in the Linux Kernel
Building Network Functions with eBPF & BCC
Binary exploitation - AIS3
SFO15-503: Secure storage in OP-TEE
 
Debug dpdk process bottleneck & painpoints
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
Faster packet processing in Linux: XDP
LinuxCon 2015 Linux Kernel Networking Walkthrough
Linux Profiling at Netflix
Cilium - Network security for microservices
Linux Performance Analysis: New Tools and Old Secrets
Container Performance Analysis

Similar to Library Operating System for Linux #netdev01

PDF
NUSE (Network Stack in Userspace) at #osio
PDF
Network Stack in Userspace (NUSE)
PDF
Direct Code Execution - LinuxCon Japan 2014
PDF
LibOS as a regression test framework for Linux networking #netdev1.1
PDF
mTCP使ってみた
PDF
Direct Code Execution @ CoNEXT 2013
PDF
Geep networking stack-linuxkernel
PDF
Van jaconson netchannels
PDF
Fun with Network Interfaces
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
The linux networking architecture
PPTX
Spring sim 2010-riley
PPT
Processes and Threads in Windows Vista
PDF
DCCN 2016 - Tutorial 2 - 4G for SmartGrid ecosystem
PPTX
Realtime traffic analyser
PDF
Download full ebook of Linux Socket Programming Walton Sean instant download pdf
ODP
Sysprog17
PDF
Kernelvm 201312-dlmopen
PPT
Chapter 6 os
PDF
BSD Sockets API in Zephyr RTOS - SFO17-108
 
NUSE (Network Stack in Userspace) at #osio
Network Stack in Userspace (NUSE)
Direct Code Execution - LinuxCon Japan 2014
LibOS as a regression test framework for Linux networking #netdev1.1
mTCP使ってみた
Direct Code Execution @ CoNEXT 2013
Geep networking stack-linuxkernel
Van jaconson netchannels
Fun with Network Interfaces
High performace network of Cloud Native Taiwan User Group
The linux networking architecture
Spring sim 2010-riley
Processes and Threads in Windows Vista
DCCN 2016 - Tutorial 2 - 4G for SmartGrid ecosystem
Realtime traffic analyser
Download full ebook of Linux Socket Programming Walton Sean instant download pdf
Sysprog17
Kernelvm 201312-dlmopen
Chapter 6 os
BSD Sockets API in Zephyr RTOS - SFO17-108
 

Recently uploaded

PDF
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
PDF
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
PDF
Transcript: The partnership effect: Libraries and publishers on collaborating...
PPTX
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
PDF
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
PDF
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 
PDF
[BDD 2025 - Mobile Development] Crafting Immersive UI with E2E and AGSL Shade...
PPTX
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
PDF
Beyond Basics: How to Build Scalable, Intelligent Imagery Pipelines
PDF
Oracle MySQL HeatWave - Complete - Version 3
PDF
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
PDF
[BDD 2025 - Artificial Intelligence] Building AI Systems That Users (and Comp...
PDF
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
PDF
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
PDF
ODSC AI West: Agent Optimization: Beyond Context engineering
PPTX
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026
PDF
Dev Dives: Build smarter agents with UiPath Agent Builder
PPTX
kernel PPT (Explanation of Windows Kernal).pptx
PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PPTX
Guardrails in Action - Ensuring Safe AI with Azure AI Content Safety.pptx
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
Transcript: The partnership effect: Libraries and publishers on collaborating...
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 
[BDD 2025 - Mobile Development] Crafting Immersive UI with E2E and AGSL Shade...
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
Beyond Basics: How to Build Scalable, Intelligent Imagery Pipelines
Oracle MySQL HeatWave - Complete - Version 3
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
[BDD 2025 - Artificial Intelligence] Building AI Systems That Users (and Comp...
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
ODSC AI West: Agent Optimization: Beyond Context engineering
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026
Dev Dives: Build smarter agents with UiPath Agent Builder
kernel PPT (Explanation of Windows Kernal).pptx
Open Source Post-Quantum Cryptography - Matt Caswell
Guardrails in Action - Ensuring Safe AI with Azure AI Content Safety.pptx

Library Operating System for Linux #netdev01

  • 1.
    Library OperatingSystem withMainlineLinux Network Stack!Hajime Tazaki, Ryo Nakamura, Yuji Sekiyanetdev0.1, Feb. 2015
  • 2.
    MotivationWhy kernel space?Packets were expensive in 1970’Why not userspace ?well grown in decades, costs degradesobtain network stack personalizationcontrollable by userspace utilities2
  • 3.
    Userspace network stacksAlot of userspace network stackfull scratch: mTCP, Mirage, lwIPPorting: OSv, Sandstorm, libuinet (FreeBSD),Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC,cloud, high-speed Apps)Writing a network stack is 1-week DIY,but writing opera-table network stack is decadesDIY (which is not DIY)3
  • 4.
    QuestionsHow to benefitmatured network stackin userspace ?How to trivially introduce your ideaon network stack ?xxTCP, IPvX, etc..How to flexibly test your code with acomplex scenario ?4
  • 5.
    The answersUsing Linuxnetwork stack as-is!as a userspace Library (libraryoperating system)5
  • 6.
    This talk isaboutan introduction of a libraryoperating system for Linuxand its implementationwith a couple of useful use cases6
  • 7.
    Outlook (design)hardware-independent arch(arch/lib)3 componentsHost backend layerKernel layerPOSIX layer7https://github.com/libos-nuse/net-next-nuse
  • 8.
    Outlook (cont’d)8ARPQdiscTCP UDPDCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layerbottom halves/rcu/timer/interruptstructnet_deviceschedulernetdevclocksourcePOSIX glue layerApplication1) Build Linux srctreew/ glues as a library2) put backend!(vNIC, clock source,!scheduler) and bind3) add POSIX glue code4) applicationsmagically runs
  • 9.
    Kernel glue code9https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.cvoidschedule(void)!{!! lib_task_wait();!}!signed long schedule_timeout(signed long timeout)!{!! u64 ns;!! struct SimTask *self;!!! if (timeout == MAX_SCHEDULE_TIMEOUT) {!! ! lib_task_wait();!! ! return MAX_SCHEDULE_TIMEOUT;!! }!! lib_assert(timeout >= 0);!! ns = ((__u64)timeout) * (1000000000 / HZ);!! self = lib_task_current();!! lib_event_schedule_ns(ns, &trampoline, self);!! lib_task_wait();!! /* we know that we are always perfectly on time. */!! return 0;!}
  • 10.
    POSIX glue code10https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.cintnuse_socket(int domain, int type, int protocol)!{!! lib_update_jiffies();!! struct socket *kernel_socket = malloc(sizeof(struct socket));!! int ret, real_fd;!!! memset(kernel_socket, 0, sizeof(struct socket));!! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!! if (ret < 0)!! ! errno = -ret;!(snip)!! lib_softirq_wakeup();!! return real_fd;!}!weak_alias(nuse_socket, socket);
  • 11.
    Implementations(Instances)Direct Code Execution(DCE)network simulator integration (ns-3)for more testingNetwork Stack in Userspace (NUSE)gives new platform of Linux network stackfor ad-hoc network stack11
  • 12.
    Direct Code Executionns-3integrationdeterministic schedulersingle-process model virtualizationdlmopen(3)-like virtualizationfull control over multiple network stacks12
  • 13.
    Execution (DCE)main() =>dlmopen(ping,liblinux.so)
=> main()=>socket(2)=>dce_socket()
=> (do whatever)13
  • 14.
  • 15.
  • 16.
    Network Stack inUserspaceUserspacenetwork stack running onLinux (POSIX) platformNetwork stack personalizationFull features by design (full stack)ARP/ND, UDP/TCP (all cc algorithm), SCTP,DCCP, QDISC, XFRM, netfilter, etc.16
  • 17.
    17ApplicationARPQdiscTCP UDP DCCPSCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layer (NUSE)POSIX glue layerbottom halves/rcu/timer/interruptstructnet_deviceRAW DPDK netmap ...NICschedulernetdevclocksourcesystem call hijackApplicationmaster process slave processesrumpsyscallproxyrumpserver
  • 18.
    Execution (NUSE)LD_PRELOAD=libnuse-linux.so 
pingwww.google.comping(8) => socket(2) => nuse_socket()=> raw(7) => (network)18
  • 19.
    When it’s useful?ad-hocnetwork stack (network stackpersonalization)LD_PRELOAD=liblinux-mptcp.so firefoxBundle with kernel bypassesIntel DPDK / netmap / PF_RING / etc.debugging/testing with ns-319
  • 20.
    Testing workflow1.Write/modify code(patches)2.Write a test code (incl. packetexchanges)3.if PASS; accept pull-request
else; rejects20
  • 21.
  • 22.
    T1) write apatch22Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!---!net/ipv6/xfrm6_policy.c | 1 +!1 file changed, 1 insertion(+)!!diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!index 48bf5a0..8d2d01b 100644!--- a/net/ipv6/xfrm6_policy.c!+++ b/net/ipv6/xfrm6_policy.c!@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, intreverse)!!#if IS_ENABLED(CONFIG_IPV6_MIP6)!! ! case IPPROTO_MH:!+! ! ! offset += ipv6_optlen(exthdr);!! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!! ! ! ! struct ip6_mh *mh;!http://patchwork.ozlabs.org/patch/436351/
  • 23.
    T2) write atestAs ns-3 scenarioC++ or pythoncreate a topologyconfig nodesrun/check results(e.g., ping6)23+-----------+!| HA |!+-----------+!|sim0!+----------+------------+!|sim0 |sim0!sim2+----+---+ +----+---+!- - -| AR1 | | AR2 |!+---+----+ +----+---+!|sim1 |sim1!| |!sim0 sim0!+----+------+ (Movement) +----+-----+!| MR | <=====> | MR |!+-----------+ +----------+!|sim1 |sim1!+---------+ +---------+!| MNN | | MNN |!+---------+ +---------+!http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
  • 24.
    24#!/usr/bin/python!!from ns.dce import*!from ns.core import *!!nodes = NodeContainer()!nodes.Create (100)!dce = DceManagerHelper()!dce.SetNetworkStack ("liblinux.so")!dce.Install (nodes)!!app = DceApplicationHelper()!app.SetBinary ("ping6")!app.Install (nodes)!(snip)!!NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!<< " did not return successfully: " << g_testError)!!Simulator.Stop (Seconds(1000.0))!Simulator.Run ()
  • 25.
    Performance of NUSE10GEthernet back-to-backtransmissionIP forwardingnative Linux, raw socket, tap, dpdk,netmap25
  • 26.
    Performance: setup2610G10GNUSE nodeTx/Rx nodesCPUXeon E5-2650v2 @ 2.60GHz (16 core)Xeon L3426 @ 1.87GHz (8 core)Memory 32GB 4GBNIC Intel X520 Intel X520OShost:3.13.0-32nuse: 3.17.0-rc1host:3.13.0-32ping!flowgenvnstat!(packet count)Tx NUSE Rxping!flowgen
  • 27.
    Host Tx27RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdknative netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)native: ping A.B.C.D!others: ./nuse ping A.B.C.D
  • 28.
    L3 RoutingSender->NUSE->Receiver28Tx RxNUSEping(RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)
  • 29.
    AlternativesUML/LKL (1proc/1vm, noPOSIX i/f)Containers (can’t change kernel)scratch-based (mTCP,Mirage)rumpkernel (in NetBSD)29
  • 30.
    Limitationsad-hoc kernel gluesrequiredwhen we changed a member of a struct,LibOS needs to follow itPerformance drawbacks on NUSEadapt known techniques (mTCP)30
  • 31.
    (not) ConclusionsAn abstractionfor multiple benefitsConservativeUse past decades effort as muchwith a small amount of effortPlaning to RFC for upstreaming31
  • 32.
  • 33.
  • 34.
    Bug reproducibility34Wi-Fi Wi-FiHomeAgentAP1 AP2handoffping6mobile nodecorrespondentnode(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.<continue>(gdb) bt 4#0  mip6_mh_filter(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish(skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  • 35.
    DebuggingMemory error detectionamongdistributed nodesin a single processusing Valgrind!!35==5864== Memcheck, a memory error detector==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in==5864== Command: ../build/bin/ns3test-dce-vdl --verbose==5864== ==5864== Conditional jump or move depends on uninitialised value(s)==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)==5864== by 0x7D2313F: process_backlog (dev.c:3368)==5864== by 0x7D23455: net_rx_action (dev.c:3526)==5864== by 0x7CF2477: do_softirq (softirq.c:65)==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage==5864== Uninitialised value was created by a stack allocation==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)==5864==
  • 36.
    Fine-grained parameter coverage36Codecoverage measurement with DCEWith fine-grained network, node, protocol parameters
  • 37.
    1) kernel buildbuildkernel source tree w/ the patchmake menuconfig ARCH=simmake library ARCH=sim➔ libnuse-linux-3.17-rc1.so37
  • 38.
  • 39.
    Tx callgraph39sendmsg ()(socket API)lib_sock_sendmsg () (NUSE)sock_sendmsg ()ip_send_skb ()ip_finish_output2 ()dst_neigh_output () (existingneigh_resolve_output () -kernel)arp_solicit ()dev_queue_xmit ()lib_dev_xmit () (NUSE)nuse_vif_raw_write ()
  • 40.
    start_thread () (pthread)nuse_netdev_rx_trampoline()nuse_vif_raw_read () (NUSE)lib_dev_rx ()netif_rx () (ex-kernel)Rx callgraph40start_thread () (pthread)do_softirq () (NUSE)net_rx_action ()process_backlog () (ex-kernel)__netif_receive_skb_core ()ip_rcv ()vNIC!rxsoftirq!rx

[8]ページ先頭

©2009-2025 Movatter.jp