Movatterモバイル変換


[0]ホーム

URL:


Hajime Tazaki, profile picture
Uploaded byHajime Tazaki
PDF, PPTX45,670 views

Library Operating System for Linux #netdev01

This document introduces a library operating system approach for using the Linux network stack in userspace. Some key points:- It describes building the Linux network stack (including components like ARP, TCP/IP, Qdisc, etc) as a library that can be loaded and used in userspace. - This allows flexible experimentation with and testing of new network stack ideas without modifying the kernel. Code can be added and tested through the library interface. - Implementations described include directly executing the code (DCE) and using it to integrate with a network simulator, as well as a Network Stack in Userspace (NUSE) that provides a full-featured POSIX-like platform for the network stack in user

In this document
Powered by AI

Presentation on library operating systems using the Linux network stack by Tazaki et al., highlighting its significance.

Explores reasons for using kernel space, including historical context and benefits like network stack personalization.

Discussion on various userspace network stacks and their evolution, including motivations and insights.

Raises important questions regarding benefits and adaptations needed for utilizing matured network stacks.

Proposes using the Linux network stack directly as a userspace library, hinting at practical applications.

Overview of the talk focused on introducing a library operating system for Linux and its implementation.

Design overview explaining hardware-independent architecture with three main components: host backend, kernel layer, and POSIX layer.

Examples of kernel glue and POSIX glue code, demonstrating the integration between userspace applications and kernel functionalities.

Details the implementations involving Direct Code Execution (DCE) and their networking platform benefits.

Explains ns-3 integration of network simulations with deterministic scheduling and network stack control.

Describes a userspace network stack running on Linux, emphasizing personalization and comprehensive features.

Demonstrates the execution process of the userspace network stack with an example using 'ping' command.

Illustrates scenarios where NUSE provides network stack personalization benefiting specific applications.

Outline of the workflow for writing and testing patches within the network stack development.

Discussion on continuous integration processes used for network stack testing and validation.

Details the stepwise approach for writing patches and creating test scenarios using ns-3.

Presents performance metrics of NUSE under various configurations, particularly in high-speed Ethernet environments.

Describes the setup used for measuring NUSE performance across different hardware and software configurations.

Detailed performance metrics of NUSE involving throughput and RTT analysis across various configurations.

Compares NUSE to other alternatives such as UML, containers, and scratch-based network stacks.

Highlights limitations encountered with ad-hoc kernel glues and performance issues in NUSE.

Summarizes the advantages of using a library operating system while planning future developments.

Provides links to GitHub repository and relevant resources for accessing the system discussed.

Contains ancillary materials or backup information related to the presentation subject.

Demonstrates how to utilize debugging tools to ensure proper functionality of network nodes.

Discusses the use of tools like Valgrind for memory error detection in the context of network stacks.

Instructions for configuring and building a kernel source tree tailored for NUSE.

Explains how timers function within the context of the userspace networking stack.

Presents the call graphs for network transmission and reception processes within the userspace framework.

Embed presentation

Download as PDF, PPTX
Library OperatingSystem with MainlineLinux Network Stack!Hajime Tazaki, Ryo Nakamura, Yuji Sekiyanetdev0.1, Feb. 2015
MotivationWhy kernel space ?Packets were expensive in 1970’Why not userspace ?well grown in decades, costs degradesobtain network stack personalizationcontrollable by userspace utilities2
Userspace network stacksA lot of userspace network stackfull scratch: mTCP, Mirage, lwIPPorting: OSv, Sandstorm, libuinet (FreeBSD),Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC,cloud, high-speed Apps)Writing a network stack is 1-week DIY,but writing opera-table network stack is decadesDIY (which is not DIY)3
QuestionsHow to benefit matured network stackin userspace ?How to trivially introduce your ideaon network stack ?xxTCP, IPvX, etc..How to flexibly test your code with acomplex scenario ?4
The answersUsing Linux network stack as-is!as a userspace Library (libraryoperating system)5
This talk is aboutan introduction of a libraryoperating system for Linuxand its implementationwith a couple of useful use cases6
Outlook (design)hardware-independent arch (arch/lib)3 componentsHost backend layerKernel layerPOSIX layer7https://github.com/libos-nuse/net-next-nuse
Outlook (cont’d)8ARPQdiscTCP UDP DCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layerbottom halves/rcu/timer/interruptstructnet_deviceschedulernetdevclocksourcePOSIX glue layerApplication1) Build Linux srctreew/ glues as a library2) put backend!(vNIC, clock source,!scheduler) and bind3) add POSIX glue code4) applicationsmagically runs
Kernel glue code9https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.cvoid schedule(void)!{!! lib_task_wait();!}!signed long schedule_timeout(signed long timeout)!{!! u64 ns;!! struct SimTask *self;!!! if (timeout == MAX_SCHEDULE_TIMEOUT) {!! ! lib_task_wait();!! ! return MAX_SCHEDULE_TIMEOUT;!! }!! lib_assert(timeout >= 0);!! ns = ((__u64)timeout) * (1000000000 / HZ);!! self = lib_task_current();!! lib_event_schedule_ns(ns, &trampoline, self);!! lib_task_wait();!! /* we know that we are always perfectly on time. */!! return 0;!}
POSIX glue code10https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.cint nuse_socket(int domain, int type, int protocol)!{!! lib_update_jiffies();!! struct socket *kernel_socket = malloc(sizeof(struct socket));!! int ret, real_fd;!!! memset(kernel_socket, 0, sizeof(struct socket));!! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!! if (ret < 0)!! ! errno = -ret;!(snip)!! lib_softirq_wakeup();!! return real_fd;!}!weak_alias(nuse_socket, socket);
Implementations(Instances)Direct Code Execution (DCE)network simulator integration (ns-3)for more testingNetwork Stack in Userspace (NUSE)gives new platform of Linux network stackfor ad-hoc network stack11
Direct Code Executionns-3 integrationdeterministic schedulersingle-process model virtualizationdlmopen(3)-like virtualizationfull control over multiple network stacks12
Execution (DCE)main() => dlmopen(ping,liblinux.so)
=> main()=>socket(2)=>dce_socket()
=> (do whatever)13
14
15
Network Stack inUserspaceUserspace network stack running onLinux (POSIX) platformNetwork stack personalizationFull features by design (full stack)ARP/ND, UDP/TCP (all cc algorithm), SCTP,DCCP, QDISC, XFRM, netfilter, etc.16
17ApplicationARPQdiscTCP UDP DCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layer (NUSE)POSIX glue layerbottom halves/rcu/timer/interruptstructnet_deviceRAW DPDK netmap ...NICschedulernetdevclocksourcesystem call hijackApplicationmaster process slave processesrumpsyscallproxyrumpserver
Execution (NUSE)LD_PRELOAD=libnuse-linux.so 
ping www.google.comping(8) => socket(2) => nuse_socket()=> raw(7) => (network)18
When it’s useful?ad-hoc network stack (network stackpersonalization)LD_PRELOAD=liblinux-mptcp.so firefoxBundle with kernel bypassesIntel DPDK / netmap / PF_RING / etc.debugging/testing with ns-319
Testing workflow1.Write/modify code (patches)2.Write a test code (incl. packetexchanges)3.if PASS; accept pull-request
else; rejects20
continuous integration(CI)21http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
T1) write a patch22Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!---!net/ipv6/xfrm6_policy.c | 1 +!1 file changed, 1 insertion(+)!!diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!index 48bf5a0..8d2d01b 100644!--- a/net/ipv6/xfrm6_policy.c!+++ b/net/ipv6/xfrm6_policy.c!@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, intreverse)!!#if IS_ENABLED(CONFIG_IPV6_MIP6)!! ! case IPPROTO_MH:!+! ! ! offset += ipv6_optlen(exthdr);!! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!! ! ! ! struct ip6_mh *mh;!http://patchwork.ozlabs.org/patch/436351/
T2) write a testAs ns-3 scenarioC++ or pythoncreate a topologyconfig nodesrun/check results(e.g., ping6)23+-----------+!| HA |!+-----------+!|sim0!+----------+------------+!|sim0 |sim0!sim2+----+---+ +----+---+!- - -| AR1 | | AR2 |!+---+----+ +----+---+!|sim1 |sim1!| |!sim0 sim0!+----+------+ (Movement) +----+-----+!| MR | <=====> | MR |!+-----------+ +----------+!|sim1 |sim1!+---------+ +---------+!| MNN | | MNN |!+---------+ +---------+!http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
24#!/usr/bin/python!!from ns.dce import *!from ns.core import *!!nodes = NodeContainer()!nodes.Create (100)!dce = DceManagerHelper()!dce.SetNetworkStack ("liblinux.so")!dce.Install (nodes)!!app = DceApplicationHelper()!app.SetBinary ("ping6")!app.Install (nodes)!(snip)!!NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!<< " did not return successfully: " << g_testError)!!Simulator.Stop (Seconds(1000.0))!Simulator.Run ()
Performance of NUSE10G Ethernet back-to-backtransmissionIP forwardingnative Linux, raw socket, tap, dpdk,netmap25
Performance: setup2610G10GNUSE node Tx/Rx nodesCPUXeon E5-2650v2 @ 2.60GHz (16 core)Xeon L3426 @ 1.87GHz (8 core)Memory 32GB 4GBNIC Intel X520 Intel X520OShost:3.13.0-32nuse: 3.17.0-rc1host:3.13.0-32ping!flowgenvnstat!(packet count)Tx NUSE Rxping!flowgen
Host Tx27RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)native: ping A.B.C.D!others: ./nuse ping A.B.C.D
L3 RoutingSender->NUSE->Receiver28Tx RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)
AlternativesUML/LKL (1proc/1vm, no POSIX i/f)Containers (can’t change kernel)scratch-based (mTCP,Mirage)rumpkernel (in NetBSD)29
Limitationsad-hoc kernel glues requiredwhen we changed a member of a struct,LibOS needs to follow itPerformance drawbacks on NUSEadapt known techniques (mTCP)30
(not) ConclusionsAn abstraction for multiple benefitsConservativeUse past decades effort as muchwith a small amount of effortPlaning to RFC for upstreaming31
github: https://github.com/libos-nuse/net-next-nuseDCE: http://bit.ly/ns-3-dcetwitter: @thehajime32
Backups
Bug reproducibility34Wi-Fi Wi-FiHome AgentAP1 AP2handoffping6mobile nodecorrespondentnode(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.<continue>(gdb) bt 4#0  mip6_mh_filter(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish(skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
DebuggingMemory error detectionamong distributed nodesin a single processusing Valgrind!!35==5864== Memcheck, a memory error detector==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in==5864== Command: ../build/bin/ns3test-dce-vdl --verbose==5864== ==5864== Conditional jump or move depends on uninitialised value(s)==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)==5864== by 0x7D2313F: process_backlog (dev.c:3368)==5864== by 0x7D23455: net_rx_action (dev.c:3526)==5864== by 0x7CF2477: do_softirq (softirq.c:65)==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage==5864== Uninitialised value was created by a stack allocation==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)==5864==
Fine-grained parameter coverage36Code coverage measurement with DCEWith fine-grained network, node, protocol parameters
1) kernel buildbuild kernel source tree w/ the patchmake menuconfig ARCH=simmake library ARCH=sim➔ libnuse-linux-3.17-rc1.so37
Example: How timerworks38add_timer()TIMER_SOFTIRQtimer_listrun_timer_softirq ()timer handlertimer thread(timer_create (2))
Tx callgraph39sendmsg () (socket API)lib_sock_sendmsg () (NUSE)sock_sendmsg ()ip_send_skb ()ip_finish_output2 ()dst_neigh_output () (existingneigh_resolve_output () -kernel)arp_solicit ()dev_queue_xmit ()lib_dev_xmit () (NUSE)nuse_vif_raw_write ()
start_thread () (pthread)nuse_netdev_rx_trampoline ()nuse_vif_raw_read () (NUSE)lib_dev_rx ()netif_rx () (ex-kernel)Rx callgraph40start_thread () (pthread)do_softirq () (NUSE)net_rx_action ()process_backlog () (ex-kernel)__netif_receive_skb_core ()ip_rcv ()vNIC!rxsoftirq!rx

Recommended

PPTX
Understanding eBPF in a Hurry!
PDF
Velocity 2015 linux perf tools
PDF
DPDK in Containers Hands-on Lab
PDF
eBPF - Rethinking the Linux Kernel
PDF
Interrupt Affinityについて
PDF
BPF - in-kernel virtual machine
PDF
BPF / XDP 8월 세미나 KossLab
PDF
eBPF/XDP
PDF
Building Network Functions with eBPF & BCC
PPTX
Linux Network Stack
PDF
Binary exploitation - AIS3
PDF
BPF Internals (eBPF)
PDF
Faster packet processing in Linux: XDP
PDF
Linux Internals - Part I
PDF
Linux BPF Superpowers
PDF
Linux Performance Analysis: New Tools and Old Secrets
ODP
eBPF maps 101
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Container Performance Analysis
PDF
Understanding Open vSwitch
PDF
SFO15-503: Secure storage in OP-TEE
 
PDF
eBPF Trace from Kernel to Userspace
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Linux Profiling at Netflix
PDF
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
PPTX
Debug dpdk process bottleneck & painpoints
PDF
DoS and DDoS mitigations with eBPF, XDP and DPDK
PDF
Cilium - Network security for microservices
PDF
Fun with Network Interfaces
PDF
The linux networking architecture

More Related Content

PPTX
Understanding eBPF in a Hurry!
PDF
Velocity 2015 linux perf tools
PDF
DPDK in Containers Hands-on Lab
PDF
eBPF - Rethinking the Linux Kernel
PDF
Interrupt Affinityについて
PDF
BPF - in-kernel virtual machine
PDF
BPF / XDP 8월 세미나 KossLab
PDF
eBPF/XDP
Understanding eBPF in a Hurry!
Velocity 2015 linux perf tools
DPDK in Containers Hands-on Lab
eBPF - Rethinking the Linux Kernel
Interrupt Affinityについて
BPF - in-kernel virtual machine
BPF / XDP 8월 세미나 KossLab
eBPF/XDP

What's hot

PDF
Building Network Functions with eBPF & BCC
PPTX
Linux Network Stack
PDF
Binary exploitation - AIS3
PDF
BPF Internals (eBPF)
PDF
Faster packet processing in Linux: XDP
PDF
Linux Internals - Part I
PDF
Linux BPF Superpowers
PDF
Linux Performance Analysis: New Tools and Old Secrets
ODP
eBPF maps 101
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Container Performance Analysis
PDF
Understanding Open vSwitch
PDF
SFO15-503: Secure storage in OP-TEE
 
PDF
eBPF Trace from Kernel to Userspace
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Linux Profiling at Netflix
PDF
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
PPTX
Debug dpdk process bottleneck & painpoints
PDF
DoS and DDoS mitigations with eBPF, XDP and DPDK
PDF
Cilium - Network security for microservices
Building Network Functions with eBPF & BCC
Linux Network Stack
Binary exploitation - AIS3
BPF Internals (eBPF)
Faster packet processing in Linux: XDP
Linux Internals - Part I
Linux BPF Superpowers
Linux Performance Analysis: New Tools and Old Secrets
eBPF maps 101
The TCP/IP Stack in the Linux Kernel
Container Performance Analysis
Understanding Open vSwitch
SFO15-503: Secure storage in OP-TEE
 
eBPF Trace from Kernel to Userspace
LinuxCon 2015 Linux Kernel Networking Walkthrough
Linux Profiling at Netflix
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
Debug dpdk process bottleneck & painpoints
DoS and DDoS mitigations with eBPF, XDP and DPDK
Cilium - Network security for microservices

Similar to Library Operating System for Linux #netdev01

PDF
Fun with Network Interfaces
PDF
The linux networking architecture
PDF
NUSE (Network Stack in Userspace) at #osio
PDF
Geep networking stack-linuxkernel
PDF
mTCP使ってみた
PPT
Chapter 6 os
PPT
Processes and Threads in Windows Vista
PDF
Network Stack in Userspace (NUSE)
PDF
Direct Code Execution @ CoNEXT 2013
PDF
Kernelvm 201312-dlmopen
PDF
BSD Sockets API in Zephyr RTOS - SFO17-108
 
PPTX
Spring sim 2010-riley
PPTX
Realtime traffic analyser
PDF
Download full ebook of Linux Socket Programming Walton Sean instant download pdf
PDF
LibOS as a regression test framework for Linux networking #netdev1.1
PDF
Direct Code Execution - LinuxCon Japan 2014
PDF
DCCN 2016 - Tutorial 2 - 4G for SmartGrid ecosystem
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Van jaconson netchannels
ODP
Sysprog17
Fun with Network Interfaces
The linux networking architecture
NUSE (Network Stack in Userspace) at #osio
Geep networking stack-linuxkernel
mTCP使ってみた
Chapter 6 os
Processes and Threads in Windows Vista
Network Stack in Userspace (NUSE)
Direct Code Execution @ CoNEXT 2013
Kernelvm 201312-dlmopen
BSD Sockets API in Zephyr RTOS - SFO17-108
 
Spring sim 2010-riley
Realtime traffic analyser
Download full ebook of Linux Socket Programming Walton Sean instant download pdf
LibOS as a regression test framework for Linux networking #netdev1.1
Direct Code Execution - LinuxCon Japan 2014
DCCN 2016 - Tutorial 2 - 4G for SmartGrid ecosystem
High performace network of Cloud Native Taiwan User Group
Van jaconson netchannels
Sysprog17

Recently uploaded

PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PPTX
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
PDF
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
PDF
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
PDF
Agentic Intro and Hands-on: Build your first Coded Agent
PDF
[BDD 2025 - Full-Stack Development] Digital Accessibility: Why Developers nee...
PDF
Top Crypto Supers 15th Report November 2025
PDF
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
PDF
ODSC AI West: Agent Optimization: Beyond Context engineering
PDF
Accessibility & Inclusion: What Comes Next. Presentation of the Digital Acces...
PPTX
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
PPTX
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
PDF
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
PDF
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
PDF
10 Best Automation QA Testing Software Tools in 2025.pdf
PDF
Transcript: The partnership effect: Libraries and publishers on collaborating...
PDF
5 Common Supply Chain Attacks and How They Work | CyberPro Magazine
PPTX
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]
Open Source Post-Quantum Cryptography - Matt Caswell
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
Agentic Intro and Hands-on: Build your first Coded Agent
[BDD 2025 - Full-Stack Development] Digital Accessibility: Why Developers nee...
Top Crypto Supers 15th Report November 2025
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
ODSC AI West: Agent Optimization: Beyond Context engineering
Accessibility & Inclusion: What Comes Next. Presentation of the Digital Acces...
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
10 Best Automation QA Testing Software Tools in 2025.pdf
Transcript: The partnership effect: Libraries and publishers on collaborating...
5 Common Supply Chain Attacks and How They Work | CyberPro Magazine
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]

Library Operating System for Linux #netdev01

  • 1.
    Library OperatingSystem withMainlineLinux Network Stack!Hajime Tazaki, Ryo Nakamura, Yuji Sekiyanetdev0.1, Feb. 2015
  • 2.
    MotivationWhy kernel space?Packets were expensive in 1970’Why not userspace ?well grown in decades, costs degradesobtain network stack personalizationcontrollable by userspace utilities2
  • 3.
    Userspace network stacksAlot of userspace network stackfull scratch: mTCP, Mirage, lwIPPorting: OSv, Sandstorm, libuinet (FreeBSD),Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC,cloud, high-speed Apps)Writing a network stack is 1-week DIY,but writing opera-table network stack is decadesDIY (which is not DIY)3
  • 4.
    QuestionsHow to benefitmatured network stackin userspace ?How to trivially introduce your ideaon network stack ?xxTCP, IPvX, etc..How to flexibly test your code with acomplex scenario ?4
  • 5.
    The answersUsing Linuxnetwork stack as-is!as a userspace Library (libraryoperating system)5
  • 6.
    This talk isaboutan introduction of a libraryoperating system for Linuxand its implementationwith a couple of useful use cases6
  • 7.
    Outlook (design)hardware-independent arch(arch/lib)3 componentsHost backend layerKernel layerPOSIX layer7https://github.com/libos-nuse/net-next-nuse
  • 8.
    Outlook (cont’d)8ARPQdiscTCP UDPDCCP SCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layerbottom halves/rcu/timer/interruptstructnet_deviceschedulernetdevclocksourcePOSIX glue layerApplication1) Build Linux srctreew/ glues as a library2) put backend!(vNIC, clock source,!scheduler) and bind3) add POSIX glue code4) applicationsmagically runs
  • 9.
    Kernel glue code9https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.cvoidschedule(void)!{!! lib_task_wait();!}!signed long schedule_timeout(signed long timeout)!{!! u64 ns;!! struct SimTask *self;!!! if (timeout == MAX_SCHEDULE_TIMEOUT) {!! ! lib_task_wait();!! ! return MAX_SCHEDULE_TIMEOUT;!! }!! lib_assert(timeout >= 0);!! ns = ((__u64)timeout) * (1000000000 / HZ);!! self = lib_task_current();!! lib_event_schedule_ns(ns, &trampoline, self);!! lib_task_wait();!! /* we know that we are always perfectly on time. */!! return 0;!}
  • 10.
    POSIX glue code10https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.cintnuse_socket(int domain, int type, int protocol)!{!! lib_update_jiffies();!! struct socket *kernel_socket = malloc(sizeof(struct socket));!! int ret, real_fd;!!! memset(kernel_socket, 0, sizeof(struct socket));!! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!! if (ret < 0)!! ! errno = -ret;!(snip)!! lib_softirq_wakeup();!! return real_fd;!}!weak_alias(nuse_socket, socket);
  • 11.
    Implementations(Instances)Direct Code Execution(DCE)network simulator integration (ns-3)for more testingNetwork Stack in Userspace (NUSE)gives new platform of Linux network stackfor ad-hoc network stack11
  • 12.
    Direct Code Executionns-3integrationdeterministic schedulersingle-process model virtualizationdlmopen(3)-like virtualizationfull control over multiple network stacks12
  • 13.
    Execution (DCE)main() =>dlmopen(ping,liblinux.so)
=> main()=>socket(2)=>dce_socket()
=> (do whatever)13
  • 14.
  • 15.
  • 16.
    Network Stack inUserspaceUserspacenetwork stack running onLinux (POSIX) platformNetwork stack personalizationFull features by design (full stack)ARP/ND, UDP/TCP (all cc algorithm), SCTP,DCCP, QDISC, XFRM, netfilter, etc.16
  • 17.
    17ApplicationARPQdiscTCP UDP DCCPSCTPICMP IPv4IPv6NetlinkBridgingNetfilterIPSec TunnelingKernel layerHost backend layer (NUSE)POSIX glue layerbottom halves/rcu/timer/interruptstructnet_deviceRAW DPDK netmap ...NICschedulernetdevclocksourcesystem call hijackApplicationmaster process slave processesrumpsyscallproxyrumpserver
  • 18.
    Execution (NUSE)LD_PRELOAD=libnuse-linux.so 
pingwww.google.comping(8) => socket(2) => nuse_socket()=> raw(7) => (network)18
  • 19.
    When it’s useful?ad-hocnetwork stack (network stackpersonalization)LD_PRELOAD=liblinux-mptcp.so firefoxBundle with kernel bypassesIntel DPDK / netmap / PF_RING / etc.debugging/testing with ns-319
  • 20.
    Testing workflow1.Write/modify code(patches)2.Write a test code (incl. packetexchanges)3.if PASS; accept pull-request
else; rejects20
  • 21.
  • 22.
    T1) write apatch22Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!---!net/ipv6/xfrm6_policy.c | 1 +!1 file changed, 1 insertion(+)!!diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!index 48bf5a0..8d2d01b 100644!--- a/net/ipv6/xfrm6_policy.c!+++ b/net/ipv6/xfrm6_policy.c!@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, intreverse)!!#if IS_ENABLED(CONFIG_IPV6_MIP6)!! ! case IPPROTO_MH:!+! ! ! offset += ipv6_optlen(exthdr);!! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!! ! ! ! struct ip6_mh *mh;!http://patchwork.ozlabs.org/patch/436351/
  • 23.
    T2) write atestAs ns-3 scenarioC++ or pythoncreate a topologyconfig nodesrun/check results(e.g., ping6)23+-----------+!| HA |!+-----------+!|sim0!+----------+------------+!|sim0 |sim0!sim2+----+---+ +----+---+!- - -| AR1 | | AR2 |!+---+----+ +----+---+!|sim1 |sim1!| |!sim0 sim0!+----+------+ (Movement) +----+-----+!| MR | <=====> | MR |!+-----------+ +----------+!|sim1 |sim1!+---------+ +---------+!| MNN | | MNN |!+---------+ +---------+!http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
  • 24.
    24#!/usr/bin/python!!from ns.dce import*!from ns.core import *!!nodes = NodeContainer()!nodes.Create (100)!dce = DceManagerHelper()!dce.SetNetworkStack ("liblinux.so")!dce.Install (nodes)!!app = DceApplicationHelper()!app.SetBinary ("ping6")!app.Install (nodes)!(snip)!!NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!<< " did not return successfully: " << g_testError)!!Simulator.Stop (Seconds(1000.0))!Simulator.Run ()
  • 25.
    Performance of NUSE10GEthernet back-to-backtransmissionIP forwardingnative Linux, raw socket, tap, dpdk,netmap25
  • 26.
    Performance: setup2610G10GNUSE nodeTx/Rx nodesCPUXeon E5-2650v2 @ 2.60GHz (16 core)Xeon L3426 @ 1.87GHz (8 core)Memory 32GB 4GBNIC Intel X520 Intel X520OShost:3.13.0-32nuse: 3.17.0-rc1host:3.13.0-32ping!flowgenvnstat!(packet count)Tx NUSE Rxping!flowgen
  • 27.
    Host Tx27RxNUSEping (RTT)throughput(1024byte,UDP)0100020003000400050006000dpdknative netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)native: ping A.B.C.D!others: ./nuse ping A.B.C.D
  • 28.
    L3 RoutingSender->NUSE->Receiver28Tx RxNUSEping(RTT)throughput(1024byte,UDP)0100020003000400050006000dpdk native netmap raw tapThroughput(Mbps)00.20.40.60.81dpdk native netmap raw tapRTT(ms)
  • 29.
    AlternativesUML/LKL (1proc/1vm, noPOSIX i/f)Containers (can’t change kernel)scratch-based (mTCP,Mirage)rumpkernel (in NetBSD)29
  • 30.
    Limitationsad-hoc kernel gluesrequiredwhen we changed a member of a struct,LibOS needs to follow itPerformance drawbacks on NUSEadapt known techniques (mTCP)30
  • 31.
    (not) ConclusionsAn abstractionfor multiple benefitsConservativeUse past decades effort as muchwith a small amount of effortPlaning to RFC for upstreaming31
  • 32.
  • 33.
  • 34.
    Bug reproducibility34Wi-Fi Wi-FiHomeAgentAP1 AP2handoffping6mobile nodecorrespondentnode(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.<continue>(gdb) bt 4#0  mip6_mh_filter(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver(skb=0x7ffff7cde8b0, nexthdr=135) 
at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish(skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  • 35.
    DebuggingMemory error detectionamongdistributed nodesin a single processusing Valgrind!!35==5864== Memcheck, a memory error detector==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in==5864== Command: ../build/bin/ns3test-dce-vdl --verbose==5864== ==5864== Conditional jump or move depends on uninitialised value(s)==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)==5864== by 0x7D2313F: process_backlog (dev.c:3368)==5864== by 0x7D23455: net_rx_action (dev.c:3526)==5864== by 0x7CF2477: do_softirq (softirq.c:65)==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage==5864== Uninitialised value was created by a stack allocation==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)==5864==
  • 36.
    Fine-grained parameter coverage36Codecoverage measurement with DCEWith fine-grained network, node, protocol parameters
  • 37.
    1) kernel buildbuildkernel source tree w/ the patchmake menuconfig ARCH=simmake library ARCH=sim➔ libnuse-linux-3.17-rc1.so37
  • 38.
  • 39.
    Tx callgraph39sendmsg ()(socket API)lib_sock_sendmsg () (NUSE)sock_sendmsg ()ip_send_skb ()ip_finish_output2 ()dst_neigh_output () (existingneigh_resolve_output () -kernel)arp_solicit ()dev_queue_xmit ()lib_dev_xmit () (NUSE)nuse_vif_raw_write ()
  • 40.
    start_thread () (pthread)nuse_netdev_rx_trampoline()nuse_vif_raw_read () (NUSE)lib_dev_rx ()netif_rx () (ex-kernel)Rx callgraph40start_thread () (pthread)do_softirq () (NUSE)net_rx_action ()process_backlog () (ex-kernel)__netif_receive_skb_core ()ip_rcv ()vNIC!rxsoftirq!rx

[8]ページ先頭

©2009-2025 Movatter.jp