Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/kwcPublic

Multi-core implementation of a word, character, line counting program

License

NotificationsYou must be signed in to change notification settings

kgrz/kwc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 

Repository files navigation

An attempt at file offset-basedwc implementation that can usemultiple cores to read the same file. Outline:

  1. Get number of CPUs on the machine
  2. Create those many number of go routines that start reading the filein chunks.
  3. Offsets are created based on that number so that each chunk is readstarting from that offset.

Mostly works on *nix machines

Warning: Code in here is crap, don't read it.

Installation:

I haven't created releases or per-OS packages, so the only way to trythis out is viago get, which means you need to have working Goinstallation.

go get github.com/kgrz/kwc

Thatshould compile and install the binary into your$GOPATH. Then runthe binary askwc. If it's not there, thencd into$GOPATH/src/github.com/kgrz/kwc and rungo install.

Some problems:

  1. I'm finding it non straight forward to do UTF-8 aware reading becauseif a chunk cuts an particular multi-byte character in the middle,that shouldn't be counted as two separate words! If we useutf8.RuneCount() on a slice that has a partial multi-byte word,that count can end up being wrong.

    Update: I think I have a solution for this! Will implement it soon.

What's the advantage?:

It's fast™

Learnings:

  1. Theos.readAt Go function internally uses thepread syscall whichworks well with multi-threaded access of the same file:http://man7.org/linux/man-pages/man2/pread.2.html

  2. The initial implementation used a naiveisspace function I wrotethat only catered to spaces and tabs (ascii 32 and 9). But as per theman page ofwc andisspace function that gets used in it, a"space" for the purposes ofwc contains both a whitespacecharacters and new lines or equivalents:

    • ascii space (32)
    • ascii tab (9) \t
    • new line (10) \n
    • vertical tab (11) \v
    • form feed (12) \f
    • carriage return (13) \r
    • non breaking space (0xA0)
    • next line character (0x85)
  3. Avoidingbufio.Scan()is maybe something you'd want to consider if you're looking forspeed. TheScan() function does a lot of things extra like basicconsistent error handling, and it's very useful if you want to storethe scanned bytes into lines/words for every iteration. We don't needto do that when just counting the characters or words, so we avoidusing it. Perf impact is considerable.

    To do a basic test of this hypothesis, try running the program on acat-ed output which uses the scanner codepath and compare it withwc.

About

Multi-core implementation of a word, character, line counting program

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp