higepon blog

この広告は、90日以上更新していないブログに表示しています。

2022-12-31

2022年振り返り

プライベートなことは Notion に書いた。

Kaggle コンペ参加は1回のみ。
バレエを始めた。
週2-3回の筋トレはよく続いた。
8月にコロナに罹った。熱を出して寝込んだ。家族はほぼ無症状。後遺症はなし。
息子は中学生になり、子育ての負荷がぐっと下がった。
仕事は新しいプロジェクト。比較的忙しい1年だった。
ロシアがウクライナと戦争を始めた。
肋骨を骨折した。
帯状疱疹になった。
Mosh の M1 対応。Rust で書き直す実験など。
ダッシュして膝を痛めた。老後に膝が痛いというのはこのような感じなんだろうか。確かに出不精になりそう。
ELSA Speak 始めた。
新しい習い事を始める勇気がなかった。

2022-12-03

How GC in unsafe loxido works

ceronman/loxido: Rust implementation of the Lox programming language.

How to allocate object

let bar= gc.alloc(LoxString::from_string("bar".to_owned()));

LoxString::from_string andGcObject implementation. They are staraightforward.

pubstructLoxString {pub header: GcObject,pub s:String,pub hash:usize,}impl LoxString {pubfnfrom_string(s:String)->Self {let hash=LoxString::hash_string(&s);        LoxString {            header:GcObject::new(ObjectType::LoxString),            s,            hash,    }__snip__pubstructGcObject {    marked:bool,    next:Option<NonNull<GcObject>>,    obj_type: ObjectType,}

Gc::alloc should be doing thetrick. I added a comment for each line.

pubfnalloc<T: Display+'static>(&mutself, object: T)-> GcRef<T> {unsafe {// Alloc object in heap.let boxed=Box::new(object);// Get raw pointer from Box::into_raw.// pointer is NonNull<T>.let pointer=NonNull::new_unchecked(Box::into_raw(boxed));// This assumes &T has GcObject as the first field with the exact sae alignhment.letmut header: NonNull<GcObject>=mem::transmute(pointer.as_ref());// Gc.first is linked list of GcObject(=header).            header.as_mut().next=self.first.take();// The newly allocated one becomes head of the linked list.self.first=Some(header);// GcRef is a struct with one field pointer: NonNull<T>.            GcRef { pointer }        }    }

Mark

Adding my comments inline.

fnmark_roots(&mutself) {// Mark VM stack.for&valuein&self.stack[0..self.stack_len()] {self.gc.mark_value(value);        }// Mark call frames.for framein&self.frames[..self.frame_count] {self.gc.mark_object(frame.closure)        }for&upvaluein&self.open_upvalues {self.gc.mark_object(upvalue);        }self.gc.mark_table(&self.globals);self.gc.mark_object(self.init_string);    }

Trace

mark_object pushes in-use objects togrey_stack.blacken_object marks recursively in objects. Unlike RustGC this doesn't have Trace trait. So inblacken_object it shouldenum all GcRef field.

fntrace_references(&mutself) {whileletSome(pointer)=self.grey_stack.pop() {self.blacken_object(pointer);        }    }

Sweep

if the object is not marked release the object from heap.

fnsweep(&mutself) {letmut previous:Option<NonNull<GcObject>>=None;letmut current:Option<NonNull<GcObject>>=self.first;whileletSome(mut object)= current {unsafe {let object_ptr= object.as_mut();                current= object_ptr.next;if object_ptr.marked {                    object_ptr.marked=false;                    previous=Some(object);                }else {ifletSome(mut previous)= previous {                        previous.as_mut().next= object_ptr.next                    }else {self.first= object_ptr.next                    }Box::from_raw(object_ptr);                }            }        }    }

2022-12-02

Finalize, Sweep and Rooting - Understanding Rust GC

Finalize Trait

Previously we exploredTrace inTrace - Understanding Rust GC - higepon blog. Let's look intoFinalize trait. This is really simple. EveryGC-ed object should implement this finalize.

pubtraitFinalize {fnfinalize(&self) {}}

Let's look at finalize implementation forarray andOption<T>. They are empty. Hmm. I'm not quite sure why. Let me start fromcollect_garbage and come back here later.

;; Arrayimpl<T: Trace,const N:usize> Finalizefor [T; N] {};;Option<T>impl<T: Trace> FinalizeforOption<T> {}

collect_garbage

Incollect_garbage after themark returns unmarked objects (= not used objects) and it will callfinalize_glue() for eachGC<T> object.

let unmarked=mark(&mut st.boxes_start);if unmarked.is_empty() {return;        }for nodein&unmarked {Trace::finalize_glue(&(*node.this.as_ptr()).data);        }

finalize_glue callsfinalize in the Trait. So say an array is unmarked, thegc eventually calls emptytrace of the array. Now I remember thatfinalize give the object a chance to any necessary work before it's deallocated. For example an object may want to close a file associated with it.

Sweep

Now we know that thegc collects unmarked objects and calls finalize for the unmarked objects. It's time tosweep them.

unsafefnsweep(finalized:Vec<Unmarked>, bytes_allocated:&mutusize) {let _guard=DropGuard::new();for nodein finalized.into_iter().rev() {if (*node.this.as_ptr()).header.marked.get() {continue;            }let incoming= node.incoming;letmut node=Box::from_raw(node.this.as_ptr());*bytes_allocated-=mem::size_of_val::<GcBox<_>>(&*node);*incoming= node.header.next.take();        }    }

This is actually very interesting. It recreates Box from returning raw pointer using from_raw which makes sense!

Rooting

Let's get back toroot andunroot.A Tour of Safe Tracing GC Designs in Rust - In Pursuit of Laziness has a good summary and examples of what is rooting and why we need it.

In one word: Rooting. In a garbage collector, the objects “directly” in use on the stack are the “roots”, and you need to be able to identify them.

structFoo {    bar:Option<Gc<Bar>>}// this is a rootlet bar=Gc::new(Bar::new());// this is also a rootlet foo=Gc::new(Foo::new());// bar should no longer be a root (but we can't detect that!)foo.bar=Some(bar);// but foo should still be a root here since it's not inside// another GC'd objectlet v=vec![foo];

To track root objects, theGC maintains root counts inGC_BOX. In shortGC_BOX with root count > 0 is a root object.Rooting - Designing a GC in Rust explains it very well. Note that the count is incremented or decremented whenGC<T> object is moved.

Summary

Finalize gives an object an opportunity to do some clean up work before it's free-ed.
Sweep returns unmarked objects usingBox.
To track objects directly use in stack thegc maintains root count.

2022-12-01

Trace - Understanding Rust GC

Goal

I want to understand howRust GC work to see if I can use it inmy Scheme VM interpreter to be written in Rust.

How to use

GC-ed objects should implement Trace and Finalize. You should useGc::new instead ofBox::new to allocate objects in heap. Here is an example from theofficial document.

let x=Gc::new(1_u8);let y=Gc::new(Box::new(Gc::new(1_u8)));#[derive(Trace, Finalize)]structFoo {    a: Gc<u8>,    b:u8}let z=Gc::new(Foo {a: x.clone(), b:1})

What isGc<Foo>?

z variable above isGc<Foo> type. You canaccess fields of Foo likez.a. But why? It doesn't have z field in it. Because it implementsDeref Trait Rust compiler can take care of it.

impl<T: Trace+ ?Sized> Dereffor Gc<T> {typeTarget= T;#[inline]fnderef(&self)->&T {&self.inner().value()    }}

The definition of the structGC<T> as follows. It has only 2 fields.

pubstructGc<T: Trace+ ?Sized+'static> {    ptr_root: Cell<NonNull<GcBox<T>>>,    marker: PhantomData<Rc<T>>,}

Trace Trait

PerDesigning a GC in Rust - In Pursuit of Laziness the purpose oftrace is a way of walking the fields of a given object and finding innerGc<T> fields. For example if you have oneGC<T> object, you should be able to find allGC<T> fields or inner fields by the tracing.

Let's look into the Trace trait. I'm not sure what root and unroot are doing yet. We'll get back to here later.

/// The Trace trait, which needs to be implemented on garbage-collected objects.pubunsafetraitTrace: Finalize {/// Marks all contained `Gc`s.unsafefntrace(&self);/// Increments the root-count of all contained `Gc`s.unsafefnroot(&self);/// Decrements the root-count of all contained `Gc`s.unsafefnunroot(&self);/// Runs Finalize::finalize() on this object and all/// contained subobjectsfnfinalize_glue(&self);}

Here is one Trace implementation of for 'static lifetime struct.

unsafeimpl<T: ?Sized> Tracefor&'static T {unsafe_empty_trace!();}

unsafe_empty_trace! is a macro which has no-op trace, root and unroot method. This makes sense because 'static lifetime indicates that the data pointed to by the reference lives for the entire lifetime of the running program. So we don't event need to track or sweep. Let's look at one more example. This is the trace for an array. I would guess this is marking all elements in the array. Let's confirm.

unsafeimpl<T: Trace,const N:usize> Tracefor [T; N] {custom_trace!(this, {for vin this {mark(v);        }    });}

custom_trace! is implemented as follows. It defines inlinemark method and call it in the$body.

/// This rule implements the trace method.////// You define a `this` parameter name and pass in a body, which should call `mark` on every/// traceable element inside the body. The mark implementation will automatically delegate to the/// correct method on the argument.#[macro_export]macro_rules! custom_trace {    ($this:ident,$body:expr)=> {#[inline]unsafefntrace(&self) {#[inline]unsafefnmark<T:$crate::Trace+ ?Sized>(it:&T) {$crate::Trace::trace(it);            }let$this=self;$body        }

Ithink$crate::Trace::trace(it); is callingtrace() forGc<T> but not 100% sure.

unsafeimpl<T: Trace+ ?Sized> Tracefor Gc<T> {#[inline]unsafefntrace(&self) {self.inner().trace_inner();    }

Then it callstrace_inner() forGcBox.

/// Marks this `GcBox` and marks through its data.pub(crate)unsafefntrace_inner(&self) {let marked=self.header.marked.get();if!marked {self.header.marked.set(true);self.data.trace();        }    }

Remember thatGcBox is the raw pointer allocated inGc::new and stored inptr_root.

pubfnnew(value: T)->Self {assert!(mem::align_of::<GcBox<T>>()>1);unsafe {// Allocate the memory for the objectlet ptr=GcBox::new(value);// When we create a Gc<T>, all pointers which have been moved to the// heap no longer need to be rooted, so we unroot them.            (*ptr.as_ptr()).value().unroot();let gc= Gc {                ptr_root:Cell::new(NonNull::new_unchecked(ptr.as_ptr())),                marker: PhantomData,            };

Let's quickly look atGcBox definition.

pub(crate)structGcBoxHeader {//XXX This is horribly space inefficient - not sure if we care// We are using a word word bool - there is a full 63 bits of unused data :(//XXX: Should be able to store marked in the high bit of roots?    roots: Cell<usize>,    next:Option<NonNull<GcBox<dyn Trace>>>,    marked: Cell<bool>,}#[repr(C)]// to justify the layout computation in Gc::from_rawpub(crate)structGcBox<T: Trace+ ?Sized+'static> {    header: GcBoxHeader,    data: T,}

Okay. Now I have better understanding. Thetrace method just setmarked=true which means the ptr is in use. By the way who is calling thetrace method and when? The answer iscollect_garbage =>GcBox::trace_inner() =>trace. This is exciting! We're so close to the core of thegc. Let's look atcollect_garbage.

/// Collects garbage.fncollect_garbage(st:&mut GcState) {__snip__unsafefnmark(head:&mutOption<NonNull<GcBox<dyn Trace>>>)->Vec<Unmarked> {// Walk the tree, tracing and marking the nodesletmut mark_head=*head;whileletSome(node)= mark_head {__snip__unsafe {let unmarked=mark(&mut st.boxes_start);if unmarked.is_empty() {return;        }

Now we knowGcState.boxes_start is the actual starting point ofmark and it recursively marksGC<T> objects. My next question is who's setting upboxes_start? The answer is it is done inGcBox::new whenever it allocates new GcBox, it maintains a link list of GcBox and set it toboxes_start.

let gcbox=Box::into_raw(Box::new(GcBox {                header: GcBoxHeader {                    roots:Cell::new(1),                    marked:Cell::new(false),                    next: st.boxes_start.take(),                },                data: value,            }));            st.boxes_start=Some(unsafe {NonNull::new_unchecked(gcbox) });

And finally GcState is a thread local object as follows. So GcState is a kind of global variable which is local to a thread.

// The garbage collector's internal state.thread_local!(static GC_STATE: RefCell<GcState> = RefCell::new(GcState {    bytes_allocated: 0,    threshold: INITIAL_THRESHOLD,    boxes_start: None,}));

Summary of Trace (Mark)

GC-ed object should implementTrace trait.
Thetrace method should recursively calltrace method for inner objects.
In mark phase thegc callstrace forGcBox::newed objects.

Next:Finalize, Sweep, root and unroot - Understanding Rust GC - higepon blog.

2022-10-30

Mosh compiler/VM のデバッグを1週間続けた話

Mosh は R7RS に対応しつつあるのでecraven/r7rs-benchmarks というベンチマークを走らせてみた。今回は速度は気にせずにR7RSScheme 処理系としてコードを間違いなく正しく実行できるかに注力。結局57個中で3つのベンチマークで実行に失敗することが分かった。うち2つは比較的簡単に修正できた。最後の1つが手強かったので記録を残す。

スタート

失敗するのは conform というベンチマーク。期待される結果とは違うものをMosh が返してしまう。そして実行時間がやけに短い。conformベンチマークのスクリプト（r7rs-benchmarks/conform.scm）を見てみるとグラフ構造を作って何かやっているらしい。正直コードを理解しようとするのは3秒で諦めた。

この時点ではデバッグが難航する事は知る由もないのでデバッグ print を入れてなぜ期待と違うかを調べようとするがすぐに挫折。なぜならば

A) グラフ構造を扱っているので自分の中で自分を参照していてデバッグ print と相性が悪いこと。write/ss で shared structure を print できるがそれでも視認性が悪い。
B) データ構造が大きく print しきれない
C) そもそも何をやっているのかコードから読み取れない。

この状態で2-3時間無駄に使った。

心理的安全性

これは難航しそうだとようやく気付いたので少し落ち着くことにした。正しい状態がよく分からないのが問題なのでGauche で実行して比べることとした。次に処理内容が分からないのは良いとして、メインの処理を何となく名前から特定した。ここでようやくメインの処理にデバッグ print を入れてGauche と比較できるようになり、ある関数で一瞬で間違った値が返っていることが分かった。

間違うポイントが分かったので勝利を確信。その関数への入力を維持したまま再現コードを小さくしていくことにした。ところがこれがかなり難しい。入力も出力もグラフなので文字列や数字を扱うのとは別の難しさがある。色々やっているうちにぐちゃぐちゃになってしまった。元に戻らなくなってしまい大反省。debug という git branch を作り少しずつ進むようにしたら急に捗ったし壊すことも無くなった。チェックポイントあるし壊しても大丈夫という心理的安全性大事。1日かけて小さなコードに絞ることができた。

(define (foo)  (define (bar  n)    (cond ((= n1)#t)          (else           (let loop ((lst'(1)))             (if (null? lst)#t                 (and                   (display"=> recusive1\n")                  (bar1)                  (display"=> loop\n")                                           (loop (cdr lst))))))))    (bar0)    )(foo)

このコードの(bar 1) の呼び出しの後に(display "=> loop\n") が呼ばれないことが分かった。これは明らかにおかしいなぜならば(bar 1) は #t を返すから。

それで誰が悪いのか？

Scheme 処理系を書いたことのある人なら分かると思うが、これは色々怪しい。define の中に define があり named let があり末尾再帰があり。どこでバグっていてもおかしくない。

この時点でバグの原因候補は実行順に

コンパイラ手前の S 式の変換。例）and を if に変換
コンパイラの S式 => iform への変換
コンパイラの iform に対する最適化
コンパイラのコード生成部
コンパイラの merge instructions 部
VM
R6RS -> バックエンド変換部
R7RS ->R6RS 変換部

というのを切り替えないといけない。その上この辺りを触るのは10年以上ぶりである！

コンパイラとVM を調べる

最適化を OFF にすると再現しなくなったので

最適化そのものが間違っている
最適化で出てくるコードを正しく実行できないパスがある

あたりが怪しい。

pass2 の最適化の途中。1週間のデバッグを終えた今の目で見ればおかしいところは明らか。

($if ($asm NUMBER_EQUAL       ($lref n[1;0])       ($const1))  ($const#t)  ($letrec ((loop[0;0] ($lambda[loop;2] (lst[2;0])                         ($label #0                           ($if ($asm NULL_P                                  ($lref lst[2;0]))                             ($const#t)                             ($if ($call ($grefdisplay)                                    ($const"=> recusive1\n"))                               ($if ($call[tail-rec] ($lref bar[2;0])                                      ($const1))                                 ($if ($call ($grefdisplay)                                        ($const"=> loop\n"))                                   ($call[jump] ($call[embed] ($lambda[loop;2] (lst[2;0])                                                                label#0)                                                  ($const (1)))                                     ($asm CDR                                       ($lref lst[2;0])))                                   ($it))                                 ($it))                               ($it))))))            )    ($call[embed] ($lambda[loop;2] (lst[2;0])                    label#0)      ($const (1)))))

pass2 の最適化後の iform は以下の通り。ここで時間を使いすぎた。

($define () foo  ($lambda[foo;0] ()    ($call[embed40] ($lambda[bar;2] (n[1;0])                        ($label #0                          ($if ($asm NUMBER_EQUAL                                 ($lref n[1;0])                                 ($const1))                            ($const#t)                            ($call[embed70] ($lambda[loop;2] (lst[2;0])                                                ($label #1                                                  ($if ($asm NULL_P                                                         ($lref lst[2;0]))                                                    ($const#t)                                                    ($if ($call ($grefdisplay)                                                           ($const"=> recusive1\n"))                                                      ($if ($call[jump00] ($call[embed40] ($lambda[bar;2] (n[1;0])                                                                                                label#0)                                                                              ($const0))                                                             ($const1))                                                        ($if ($call ($grefdisplay)                                                               ($const"=> loop\n"))                                                          ($call[jump00] ($call[embed70] ($lambda[loop;2] (lst[2;0])                                                                                               label#1)                                                                             ($const (1)))                                                            ($asm CDR                                                              ($lref lst[2;0])))                                                          ($it))                                                        ($it))                                                      ($it)))))                              ($const (1))))))      ($const0))))

結局 pass2 の最適化で local call を埋め込む処理あたりで何かがおかしい事は分かるのだが。この iform がおかしいのか。後続の処理がおかしいのか分からないので後続も見る。実際のVM instruction 列を表示してみると。ますます分からない。

  CLOSURE810#f011  ((reproduce.scm1) foo)  LET_FRAME7  CONSTANT_PUSH0  ENTER;; Label #01  REFER_LOCAL_PUSH_CONSTANT01  BRANCH_NOT_NUMBER_EQUAL;; if (= n 1)5  CONSTANT#t  LOCAL_JMP;; goto label #157  LET_FRAME5  REFER_LOCAL_PUSH0  DISPLAY1  CONSTANT_PUSH  (1)  ENTER1  REFER_LOCAL_BRANCH_NOT_NULL05  CONSTANT#t  LOCAL_JMP38  FRAME6  CONSTANT_PUSH  => recusive1  REFER_GLOBAL_CALLdisplay;; (display "=> recusive1\n")1;; # of args  TEST;; (display ...) return value is true. So we skip the +1 next line and go to +2 line.29  CONSTANT_PUSH;; Come here after (display ...) call1  SHIFTJ;; adjust SP and FP1;; depth4;; diff0;; How many closure to go up?  LOCAL_JMP;; Jump to label #0-42  TEST19  FRAME6  CONSTANT_PUSH  => loop  REFER_GLOBAL_CALLdisplay1  TEST10  REFER_LOCAL0  CDR_PUSH  SHIFTJ110  LOCAL_JMP-43  LEAVE1  LEAVE;; label #21;; adjust stack  RETURN;; return to the code (the code is taken from the top of stack). ** But we don't have the code in the stack?***0  DEFINE_GLOBAL  foo  HALT NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP)

動的にVM で実行される様子を stack と共に。

========================FRAME  FP|0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"========================REFER_GLOBAL  FP|0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"========================CALL    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"========================LET_FRAME                   # LET_FRAME forlambda[foo]    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"  FP|4: 4                   # Save fp    |5:"#(#(CLOSURE ...))" # Closure========================CONSTANT    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"  FP|4: 4    |5:"#(#(CLOSURE ...))"========================PUSH    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"  FP|4: 4    |5:"#(#(CLOSURE ...))"    |6: 0                  # Constant 0========================ENTER                      # Adjust fp    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0========================REFER_LOCAL               # a=(Constant0)    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 0========================PUSH    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0                  # Constant0    |7: 0                  # Constant 0========================CONSTANT                   # a=(Constant 1)    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0    |7: 0========================BRANCH_NOT_NUMBER_EQUAL   # a != stack-bottom(Constant 0)    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0                  # Discarded stack top.========================LET_FRAME                  # LET_FRAME for loop.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 0    |7:6                   # Save fp    |8: "#(#(CLOSURE ...))" # Push closure========================REFER_LOCAL                 # a=(Constant 0) REALLY???    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0    |7: 6    |8:"#(#(CLOSURE ...))"========================PUSH    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:0                  # push (Constant0)========================DISPLAY                    # Create adisplayand set it to closure.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 0    |7:6    |8: "#(#(CLOSURE ...))"# Note stack is popped.========================CONSTANT                   # a=(1)    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:0    |7: 6    |8:"#(#(CLOSURE ...))"========================PUSH    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1                  # (1)========================ENTER    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1                 # New FP.========================REFER_LOCAL               # a=(1)    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1========================BRANCH_NOT_NULL          # Go toelse.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1========================FRAME                     # Push stuff.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1    |10: "(#(#f ...)"      # closure    |11:9                 # fp    |12: 54                # pc + n    |13:"(#(CLOSURE ...)" # Current codes========================CONSTANT                   # a="=> recursive1"    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"========================PUSH    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"  FP|9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"    |14: "=> recusive1\n"========================REFER_GLOBAL               # a=<display>    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"  FP|9: 1    |10:"(#(#f ...)"    |11: 9    |12:54    |13: "(#(CLOSURE ...)"    |14:"=> recusive1\n"========================CALL                       # call a=<display>. Note codes is now body ofdisplay.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"    |15: ()               #display has optional-arg'()========================REFER_LOCAL               # a="=> recursive1\n"    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"    |15: ()========================BRANCH_NOT_NULL           # a isnot null so we go to Else. But this is really weird.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"    |15: ()========================REFER_LOCAL                # a="=> recursive1"    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"    |15: ()========================PUSH    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"    |15: ()    |16: "=> recusive1\n"========================REFER_FREE 0               # Point codes + 6    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"    |9: 1    |10:"(#(#f ...)"    |11: 9    |12:54    |13: "(#(CLOSURE ...)"  FP|14:"=> recusive1\n"    |15: ()    |16:"=> recusive1\n"========================TAIL_CALL                  # call <display>and jump to#(RETURN1 ...)=> recusive1    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"    |6: 0    |7:6    |8: "#(#(CLOSURE ...))"    |9:1    |10: "(#(#f ...)"    |11:9    |12: 54    |13:"(#(CLOSURE ...)"  FP|14: "=> recusive1\n"========================RETURN    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"  FP|9: 1                   # this is ()========================TEST                        # Return value of display is #t    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"  FP|9: 1========================CONSTANT    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"  FP|9: 1========================PUSH    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"    |6:0    |7: 6    |8:"#(#(CLOSURE ...))"  FP|9: 1                    # (1)    |10:1                   #1========================SHIFTJ140                 # Adjust frame for jump. Stack is now for bar call.    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 1========================LOCAL_JMP    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:1========================REFER_LOCAL                # a=1    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 1========================PUSH    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:1    |7: 1========================CONSTANT    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:1    |7: 1========================BRANCH_NOT_NUMBER_EQUAL    # Now 1=1. Jump to else.    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:1========================CONSTANT#t    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"    |4: 4    |5:"#(#(CLOSURE ...))"  FP|6: 1========================LOCAL_JMP 66    |0:"#(#(FRAME ...))"    |1: 0    |2:6    |3: "(#(FRAME ...)"    |4:4    |5: "#(#(CLOSURE ...))"  FP|6:1========================LEAVE    |0: "#(#(FRAME ...))"    |1:0    |2: 6    |3:"(#(FRAME ...)"========================RETURN========================HALT

これを脳内で何回も再生し iform と行ったり来たりして実行していくと(bar 1) を実行した後に local_jump したら戻ってこれないのは自明であることに気づく。この lambda 埋め込みは間違いである。それに気づけばあとは簡単で(and の途中で埋め込まれてしまったのは(bar 1) 呼び出しが末尾再帰と誤認されたのが原因。

この1行の修正でお仕事完了。

2022-10-18

Scheme で MNIST 続き

Scheme で MNIST からの学び - higepon blog の続き。

f64array

前回のプロトタイプをもとにMosh のシステムライブラリとして f64array というものを導入した。double 型の2次元の行列を表現したもの。追加した手続きは最低限のもの。

make-f64array
f64array-ref
f64array-set!
f64array-shape
f64array-dot-product)

f64array の構築、setter/getter、dot product 。プロファイラで見てみると行列の shape を返す手続きも無視できない回数呼ばれている。おそらく行列同士の計算での broadcast で呼ばれているのだろう。

Portable なコードを書く

R7RS では処理系ごとの違いをうまく扱える仕組みが用意されているのでGauche とMosh 両方で動くように整えてみよう。行列の初期化で make-normal-generator を使うのだがGauche にはSRFI-194 がないようなので data.random ライブラリから似たような手続きを持ってきて rename している。ところでSRFI-194 の author は Shiro さんなのであった。

  (cond-expand   [gauche      (import(rename (data random) (reals-normal$ make-normal-generator)))]   [(library (srfi194))     (import(only (srfi 194) make-normal-generator))])

またMosh では .mosh.sld という拡張子のライブラリを優先して読み込むという機能があるので以下のようにMosh が読み込むライブラリ、Mosh 以外が読み込むライブラリを分けることができる。ここで行列の実装を分岐している。

そしてどちらの行列ライブラリにも共通なコードは(include "matrix-common.scm") のように直接ファイルを include する。

というわけでめでたくMosh とGauche の両方で動く MNIST デモができました。他のR7RS 処理系でも少しの手直しで動くのではないかと期待している。PR 待ってます。コードはmosh/tests/mnist at master · higepon/mosh · GitHub。

2022-10-09

Scheme で MNIST からの学び

久しぶりにMosh Scheme に触っている。良い機会なのでゼロから作るDeep Learning ―Pythonで学ぶディープラーニングの理論と実装を参考にScheme で動く MNIST デモを実装しようと思い立つ。MNIST は 0 から 9 までの手書き数字の認識デモ。Neural Networkフレームワークのデモやチュートリアルでよく使われるあれだ。なおScheme にはPython でいうところの numpy は存在しないので必要な行列演算はすべてScheme で実装することとする。

データサイズ

train data 画像(28 x 28 x 1) と正解ラベル（0-9）
train data: 60000件
test data: 10000件

実装前の期待と予想

他の言語・フレームワークでデモ可能だったのだから CPU 上で十分速く動作する。
比較的小さなモデル・train data なので pureScheme 実装で大丈夫だろう。
matrix-mul がボトルネックになるかもしれない。

技術的な選択

Matrix はVector のVector として表現する。UniformVector （格納する型が決まっているもの）もありだがMosh では未実装なので見送り。
行列演算の高速実装は特に目指さない。ナイーブな実装で良い。

matrix-mul の実装

特に工夫のないナイーブな実装。３重ループ。

(define (matrix-mul a b)  (unless (= (matrix-shape a1) (matrix-shape b0))    (error"matrix-mul shapes don't match" (matrix-shape a) (matrix-shape b)))  (let* ([nrows (matrix-shape a0)]         [ncols (matrix-shape b1)]         [m     (matrix-shape a1)]         [mat (matrix nrows ncols)])    (define (mul row col)      (let loop ([k0]                 [ret0])        (if (= k m)            ret            (loop (+ k1) (+ ret (* (mat-at a row k) (mat-at b k col)))))))    (do ((i0 (+ i1)))        ((= i nrows) mat)      (do ((j0 (+ j1)))          ((= j ncols))        (mat-at mat i j (mul i j))))))

結果1

2層の NN を hidden_size = 10 batch_size=3 で動かしてみたが 1 batch 回したプロファイラの結果。matrix-mul は 32165 回呼ばれ 95%（98 秒）を占めている。

Time%        msec      calls   name                              location  95        98500      32165   (matrix-mul a b)                  <transcoded-textual-input-port <binary-input-port tests/mnist.scm>>:175   1         1140      64344   (matrix-map proc a)               <transcoded-textual-input-port <binary-input-port tests/mnist.scm>>:96

結果2

1 は明らかに遅いのでC++ 側で実装してみる。98秒->75秒だが思ったよりも速くない。逆に言えばVM ループは思ったよりも速い。

  92        74590      32165   matrix-mul                           1          680      64543   (matrix-element-wise op a...)     <transcoded-textual-input-port <binary-input-port tests/mnist.scm>>:186

ここでC++ 側の実装を注意して見る必要がある。途中の計算結果の保持にScheme Object を使っていること。行列に Fixnum もしくは Flonum が存在するので型チェックをしてくれる Arithmetic::add と mul を利用して計算していること。これだとmost inner loop で heap allocation が発生してしまう。

    Object ret = Object::makeVector(numRowsA);for (size_t i =0; i < numRowsA; i++) {        ret.toVector()->set(i, Object::makeVector(numColsB));    }for (size_t i =0; i < numRowsA; i++) {for (size_t j =0; j < numColsB; j++) {            Object value = Object::makeFlonum(0.0);for (size_t k =0; k < numColsA; k++) {                Object aik = a->ref(i).toVector()->ref(k);                Object bkj = b->ref(k).toVector()->ref(j);                value = Arithmetic::add(value, Arithmetic::mul(aik, bkj));            }            ret.toVector()->ref(i).toVector()->set(j, value);        }    }

結果3

もしも UniformVector なら内部データの型チェックは不要になり、すべての計算を double のまま行うことが可能なはず。これを模した適当な実装をやってみる。不正なデータを入力されたらインタプリタが落ちてしまうがデモなのでよしとする。98秒が60m秒になり実用的な速度に落ち着いた。heap allocation は避けたいと思わせる結果。

    1           60      32165   matrix-mul

double value =0.0;for (size_t k =0; k < numColsA; k++) {                Object aa = a->ref(i).toVector()->ref(k);                Object bb = b->ref(k).toVector()->ref(j);double aik = aa.isFixnum() ? aa.toFixnum() : aa.toFlonum()->value();double bkj = bb.isFixnum() ? bb.toFixnum() : bb.toFlonum()->value();                value += aik * bkj;            }            ret.toVec

学びと改善のアイデア

Matrix をVector で表現すると型チェックのオーバーヘッドがある
Matrix の内部表現は double/int の2次元配列としC++ でアクセサと mul を実装すれば良い。
matrix-mul 以外の行列操作はScheme 側で書いてもほぼ問題なさそう。多少遅くても呼ばれる回数が matrix-mul と比べて極端に少ない。例）add, element-wise multiply, transpose, shape。
matrix-mul はSIMD などでもっと速くなるだろう。

余談

numpy は歴史が長いだけあってよく出来ている。
オブジェクト指向の操作の方がコードが読みやすい場合があるa.t は a の転置。(t a) だと若干意味合いが違うし広すぎる。
numpy の broadcasting の仕組みは便利だし必須部品であることがよく分かった。スカラ lr と行列A の掛け算とか。
numpy の subset がSRFI で提案されたら面白いと思う。ただ上記の理由によりScheme だけで実装すると遅いと思うので悩ましい。各処理系が独自に実装することを期待はできなそうな。
現実的には Tensorflow フロントエンドでScheme が使えれば良いのかもしれないが、それならPython で十分。
Scheme で気軽に機械学習が楽しめるという Path は意外と遠い。Python 1強である理由も分かった。

Profile

id:higepon

SubscribedunsubscribeSubscribeSubscribe

About this blog

Links

Movatterモバイル変換

How to allocate object

Mark

Trace

Sweep

Finalize Trait

collect_garbage

Sweep

Rooting

Summary

Goal

How to use

What isGc<Foo>?

Trace Trait

Summary of Trace (Mark)

スタート

心理的安全性

それで誰が悪いのか？

コンパイラとVM を調べる

f64array

Portable なコードを書く

データサイズ

実装前の期待と予想

技術的な選択

matrix-mul の実装

結果1

結果2

結果3

学びと改善のアイデア

余談