kgrz/kwcPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star10

Multi-core implementation of a word, character, line counting program

License

MIT license

10 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
LICENSE		LICENSE
README.md		README.md
countBuffer_test.go		countBuffer_test.go
main.go		main.go

Repository files navigation

kwc

An attempt at file offset-basedwc implementation that can usemultiple cores to read the same file. Outline:

Get number of CPUs on the machine
Create those many number of go routines that start reading the filein chunks.
Offsets are created based on that number so that each chunk is readstarting from that offset.

Mostly works on *nix machines

Warning: Code in here is crap, don't read it.

Installation:

I haven't created releases or per-OS packages, so the only way to trythis out is viago get, which means you need to have working Goinstallation.

go get github.com/kgrz/kwc

Thatshould compile and install the binary into your$GOPATH. Then runthe binary askwc. If it's not there, thencd into$GOPATH/src/github.com/kgrz/kwc and rungo install.

Some problems:

I'm finding it non straight forward to do UTF-8 aware reading becauseif a chunk cuts an particular multi-byte character in the middle,that shouldn't be counted as two separate words! If we useutf8.RuneCount() on a slice that has a partial multi-byte word,that count can end up being wrong.
Update: I think I have a solution for this! Will implement it soon.

What's the advantage?:

It's fast™

Learnings:

Theos.readAt Go function internally uses thepread syscall whichworks well with multi-threaded access of the same file:http://man7.org/linux/man-pages/man2/pread.2.html
The initial implementation used a naiveisspace function I wrotethat only catered to spaces and tabs (ascii 32 and 9). But as per theman page ofwc andisspace function that gets used in it, a"space" for the purposes ofwc contains both a whitespacecharacters and new lines or equivalents:
- ascii space (32)
- ascii tab (9) \t
- new line (10) \n
- vertical tab (11) \v
- form feed (12) \f
- carriage return (13) \r
- non breaking space (0xA0)
- next line character (0x85)
Avoidingbufio.Scan()is maybe something you'd want to consider if you're looking forspeed. TheScan() function does a lot of things extra like basicconsistent error handling, and it's very useful if you want to storethe scanned bytes into lines/words for every iteration. We don't needto do that when just counting the characters or words, so we avoidusing it. Perf impact is considerable.
To do a basic test of this hypothesis, try running the program on acat-ed output which uses the scanner codepath and compare it withwc.

About

Multi-core implementation of a word, character, line counting program

Releases

No releases published

Packages

No packages published

Languages

Go100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

kwc

Installation:

Some problems:

What's the advantage?:

Learnings:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

kgrz/kwc

Folders and files

Latest commit

History

Repository files navigation

kwc

Installation:

Some problems:

What's the advantage?:

Learnings:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages