- Notifications
You must be signed in to change notification settings - Fork1
Multi-core implementation of a word, character, line counting program
License
kgrz/kwc
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
An attempt at file offset-basedwc
implementation that can usemultiple cores to read the same file. Outline:
- Get number of CPUs on the machine
- Create those many number of go routines that start reading the filein chunks.
- Offsets are created based on that number so that each chunk is readstarting from that offset.
Mostly works on *nix machines
Warning: Code in here is crap, don't read it.
I haven't created releases or per-OS packages, so the only way to trythis out is viago get
, which means you need to have working Goinstallation.
go get github.com/kgrz/kwc
Thatshould compile and install the binary into your$GOPATH
. Then runthe binary askwc
. If it's not there, thencd
into$GOPATH/src/github.com/kgrz/kwc
and rungo install
.
I'm finding it non straight forward to do UTF-8 aware reading becauseif a chunk cuts an particular multi-byte character in the middle,that shouldn't be counted as two separate words! If we use
utf8.RuneCount()
on a slice that has a partial multi-byte word,that count can end up being wrong.Update: I think I have a solution for this! Will implement it soon.
It's fast™
The
os.readAt
Go function internally uses thepread
syscall whichworks well with multi-threaded access of the same file:http://man7.org/linux/man-pages/man2/pread.2.htmlThe initial implementation used a naive
isspace
function I wrotethat only catered to spaces and tabs (ascii 32 and 9). But as per theman page ofwc
andisspace
function that gets used in it, a"space" for the purposes ofwc
contains both a whitespacecharacters and new lines or equivalents:- ascii space (32)
- ascii tab (9) \t
- new line (10) \n
- vertical tab (11) \v
- form feed (12) \f
- carriage return (13) \r
- non breaking space (0xA0)
- next line character (0x85)
Avoiding
bufio.Scan()
is maybe something you'd want to consider if you're looking forspeed. TheScan()
function does a lot of things extra like basicconsistent error handling, and it's very useful if you want to storethe scanned bytes into lines/words for every iteration. We don't needto do that when just counting the characters or words, so we avoidusing it. Perf impact is considerable.To do a basic test of this hypothesis, try running the program on a
cat
-ed output which uses the scanner codepath and compare it withwc
.