Sorting Algorithm
This is a sorting algorithm. It may be applied to a set of data in order to sort it. For comparing various sorts, seecompare sorts. For other sorting algorithms, seesorting algorithms, or:
Heap sort |Merge sort |Patience sort |Quick sort
O(n log2n) sorts
Shell Sort
O(n2) sorts
Bubble sort |Cocktail sort |Cocktail sort with shifting bounds |Comb sort |Cycle sort |Gnome sort |Insertion sort |Selection sort |Strand sort
other sorts
Bead sort |Bogo sort |Common sorted list |Composite structures sort |Custom comparator sort |Counting sort |Disjoint sublist sort |External sort |Jort sort |Lexicographical sort |Natural sorting |Order by pair comparisons |Order disjoint list items |Order two numerical lists |Object identifier (OID) sort |Pancake sort |Quickselect |Permutation sort |Radix sort |Ranking methods |Remove duplicate elements |Sleep sort |Stooge sort |[Sort letters of a string] |Three variable sort |Topological sort |Tree sort
Sort a huge file too large to fit into memory.
The algorithm consists in reading a large file to be sorted in chunks of data small enough to fit in main memory, sort each of the chunks, write them out to a temporary file, and finally combined the smaller subfiles into a single larger file.
For more info see:https://en.wikipedia.org/wiki/External_sorting
The sorting algorithm can be any popular sort, like quicksort.
For simplicity one can assume that the file consists of fixed length integers and that the sort function is less-than (<).
This effort uses an in-place Quicksort, which tends to move things around less than a merge sort and so hopefully means fewer writes to the file and a less long wait for the task to finish. It suits the characteristics of AppleScript's File Read/Write commands and the fact that stable sorting isn't required with integers.
A "half chunk" of integers at a time is read to each of two buffer lists covering different sections of the file range being partitioned. Only those integers needing to be swapped are written back to the file and each list is replaced as it's used up. When the converging sections eventually overlap, a single list is used instead which is updated in parallel with the file to ensure that the partitioning repeat stops in the right place. Partitions less than a "chunk" in length are sorted in memory with a heap sort. (The Foundation framework has a built-in NSMutableArray sort which is faster than a vanilla heap sort — even with the necessary derivation of NSMutableArrays from the lists and lists from the sorted arrays — but I don't know how well this fits the task's "low memory" conceit.)
(*Quicksort algorithm: S.A.R. (Tony) Hoare, 1960.Optimisations by Robert Sedgewick and others at various times.Heap sort algorithm: J.W.J. Williams, 1964.*)useAppleScriptversion"2.3.1"-- MacOS 10.9 (Mavericks) or later — for these 'use' commands!useinternalSorter:script"Heap sort"-- <https://www.rosettacode.org/wiki/Sorting_algorithms/Heapsort#AppleScript>.usescriptingadditions-- Configuration.propertymaxChunkSize:256000-- 256 KBytes (64000 AppleScript integers). The larger this figure can be, the less slow the sort.propertyintegerSize:4propertymaxInternalSortSize:maxChunkSizeonexternalSort(theFile)-- Param: file, alias, or HFS path.(* Properties and handlers for the sort. *)scripto-- Precalculated values.propertytwiceIntegerSize:integerSize+integerSizepropertymaxHalfChunkSize:maxChunkSizedivtwiceIntegerSize*integerSize-- Reference number of the system file handle for the open file.propertyfRef:missing value-- Start byte indices of integers in the file.propertyi:missing valuepropertyj:missing value-- Start byte indices of the most recent additions to pivot stores in the file.propertypLeft:missing valuepropertypRight:missing value-- Two buffer lists and assocatiated info.propertyleftList:missing valuepropertyleftEndByte:missing value-- End byte index of equivalent range in file.propertya:missing value-- Left list index.propertyleftListLength:missing valuepropertyrightList:missing valuepropertyrightStartByte:missing value-- Start byte index of equivalent range in file.propertyb:missing value-- Right list index.-- Whether or not down to single-list working.propertysingleList:missing value(* Quicksort handler. Sorts a file of integers in place. *)onqsrt(l,r)-- Params: start-byte indices in the file of the first and last integers in a range to be partitioned.repeat-- Tail call elimination repeat.-- Prime the properties for this range.set{i,j,pLeft,pRight,leftEndByte,a,leftListLength,rightStartByte,b,singleList}to¬{l,r,l-integerSize,r+integerSize,l-1,0,0,r+integerSize,0,false}-- Get the first and last integers in the range, setting up the first two buffer lists in the process.setleftValuetonextLeftInteger(l,r)setrightValuetonextRightInteger(l,r)-- Read a third integer directly from the middle of the range in the file.setpivotto(readfReffrom((l+r-2)divtwiceIntegerSize*integerSize+1)forintegerSizeasinteger)-- Choose one of the three as the pivot (median-of-3 pivot selection).setleftGreaterThanRightto(leftValue>rightValue)if(leftValue>pivot)thenif(leftGreaterThanRight)thenif(rightValue>pivot)thensetpivottorightValueelsesetpivottoleftValueendifelseif(pivot>rightValue)thenif(leftGreaterThanRight)thensetpivottoleftValueelsesetpivottorightValueendifendif-- Whichever's now the pivot, swap the outermost two if the left's greater than the right.-- If either of them *is* the pivot, advance the pivot store boundary on the relevant side.if(leftGreaterThanRight)thenwriteleftValuetofRefasintegerstartingatrwriterightValuetofRefasintegerstartingatlif(leftValue=pivot)thensetpRighttorelseif(rightValue=pivot)thensetpLefttolendifelseif(leftValue=pivot)thensetpLefttolif(rightValue=pivot)thensetpRighttorendif-- Continue partitioning the range.setitol+integerSizesetjtor-integerSizerepeatuntil(i>j)-- Partitioning repeat.setleftValuetonextLeftInteger(l,r)repeatwhile(leftValue<pivot)setitoi+integerSizesetleftValuetonextLeftInteger(l,r)endrepeatsetrightValuetonextRightInteger(l,r)repeatwhile(rightValue>pivot)setjtoj-integerSizesetrightValuetonextRightInteger(l,r)endrepeatif(j>i)then-- Three-way partitioning: if either value to be swapped is a pivot instance, extend the pivot store-- on the destination side and, if the appropriated slot isn't already the pivot destination, swap its-- current content for (a copy of) the pivot and use the retrieved value instead in the main swap.if(leftValue=pivot)thensetpRighttopRight-integerSizeif(pRight>j)thensetleftValueto(readfReffrompRightforintegerSizeasinteger)writepivotasintegertofRefstartingatpRightendifendifif(rightValue=pivot)thensetpLefttopLeft+integerSizeif(pLeft<i)thensetrightValueto(readfReffrompLeftforintegerSizeasinteger)writepivotasintegertofRefstartingatpLeftendifendif-- Write the values to be swapped to the appropriate places in the file.writerightValuetofRefasintegerstartingatiwriteleftValuetofRefasintegerstartingatj-- If down to a single buffer list, update this too so that the repeat will know when to stop.if(singleList)thensetitemaofmyleftListtorightValuesetitembofmyleftListtoleftValueendifelseif(i>j)thenexitrepeatendifsetitoi+integerSizesetjtoj-integerSizeendrepeat-- Partitioning.-- Swap any stored pivot instances into the slots next to the crossed indices-- and advance the indices to exclude the pivots from the rest of the sort.repeatwithpfromltopLeftbyintegerSizeif(j>pLeft)thenwrite(readfReffromjforintegerSizeasinteger)tofRefasintegerstartingatpwritepivottofRefasintegerstartingatjsetjtoj-integerSizeelse-- Don't bother swapping where store and target slots overlap.setjtop-integerSizeexitrepeatendifendrepeatrepeatwithpfromrtopRightby-integerSizeif(i<pRight)thenwrite(readfReffromiforintegerSizeasinteger)tofRefasintegerstartingatpwritepivottofRefasintegerstartingatisetitoi+integerSizeelsesetitop+integerSizeexitrepeatendifendrepeat-- Where the new partitions are short enough, sort them in memory with a non-recursive sort.-- Otherwise subpartition the shorter one recursively, then the longer iteratively.setleftDifftoj-lsetrightDifftor-iif(leftDiff<rightDiff)thenset{shorterDiff,ls,rs,longerDiff,l}to{leftDiff,l,j,rightDiff,i}elseset{shorterDiff,ls,rs,longerDiff,r}to{rightDiff,i,r,leftDiff,j}endifif(shorterDiff<maxInternalSortSize)thenif(rs>ls)thensortInMemory(ls,rs)elseqsrt(ls,rs)endifif(longerDiff<maxInternalSortSize)thenif(r>l)thensortInMemory(l,r)exitrepeat-- … and return from the handler.endif-- Otherwise go round again to handle the longer partition.endrepeat-- Tail call elimination.endqsrt(* Return the next integer from the left buffer list, setting up or replacing the list as necessary. *)onnextLeftInteger(l,r)setatoa+1if(a>leftListLength)then-- The existing left list has been used up or doesn't yet exist.setleftEndBytetoleftEndByte+maxHalfChunkSize-- Derive a new left list from the next half-chunk of data — unless any of this is already-- covered by the other list, in which case replace both lists with a single one.if(leftEndByte<rightStartByte)thensetleftListto(readfReffromiformaxHalfChunkSizeasinteger)aslistsetato1setleftListLengthto(countleftList)elsegoToSingleList(l,r)setbtob+1endifendifreturnitemaofmyleftListendnextLeftInteger(* Return the next integer from the right buffer list, simile. *)onnextRightInteger(l,r)setbtob-1if(b<1)thensetrightStartBytetorightStartByte-maxHalfChunkSizeif(rightStartByte>leftEndByte)thensetrightListto(readfReffromrightStartByteformaxHalfChunkSizeasinteger)aslistsetbto(countrightList)elsegoToSingleList(l,r)endifendifreturnitembofmyrightListendnextRightInteger(* Set up a single buffer list for use in the closing stage of a partitioning repeat. *)ongoToSingleList(l,r)-- The range to read from the file is from bytes i to (j + integerSize - 1)-- PLUS an integer either side, if these are within the range being partitioned,-- to help ensure that the partitioning repeat stops in the right place.if(i>l)thensetreadStarttoi-integerSizesetato2elsesetreadStarttoisetato1endifif(j<r)thensetreadEndtoj+twiceIntegerSize-1elsesetreadEndtoj+integerSize-1endif-- Ditch the existing right list.setrightListtomissing value-- Read the integers from the calculated range and set both list properties to the same result instance.setleftListto(readfReffromreadStarttoreadEndasinteger)aslistsetrightListtoleftList-- Set the other relevant properties.setbto(countrightList)if(j<r)thensetbtob-1setleftListLengthtobsetsingleListtotrueendgoToSingleList(* Read integers from a given range in the file, sort them in memory, and write them back to the same range. *)onsortInMemory(l,r)setrightListtomissing valuesetleftListto(readfReffromlto(r+integerSize-1)asinteger)aslisttellinternalSortertosort(leftList,1,-1)readfReffromlfor0-- Set the file handle's position pointer.repeatwithxfrom1to(countleftList)write(itemxofmyleftList)asintegertofRefendrepeatendsortInMemoryendscript(* Main handler code. Sets up and starts the sort. *)-- Check the input.trysettheFiletotheFileasaliasonerrordisplay dialog"The specified file doesn't exist."buttons{"Stop"}defaultbutton1cancelbutton1withiconstopendtrysetfileSizeto(get eoftheFile)if(fileSizeis0)thendisplay dialog"The file is empty."buttons{"Stop"}defaultbutton1cancelbutton1withiconstopelseif(fileSizemodintegerSize>0)thendisplay dialog¬"The file size isn't an exact number of integers."buttons{"Stop"}defaultbutton1cancelbutton1withiconstopendif-- Get the user to specify the destination file. Can be the original.setoldPathtotheFileastextsetastidtoAppleScript'stext item delimiterssetAppleScript'stext item delimitersto":"telloldPathtoset{rootPath,oldName}to{text1thrutextitem-2,textitem-1}setAppleScript'stext item delimitersto"."telloldNametosetnewNametotext1thrutextitem-2&" (sorted copy)."&textitem-1setAppleScript'stext item delimiterstoastidsetnewFileto¬(choose file namewithprompt"Save the sorted result as…"defaultnamenewNamedefaultlocation(rootPathasalias))-- If the original wasn't chosen, copy the data to the new location.-- There are simpler ways to copy a file, but this still practically instantaneous-- and definitely only involves maxChunkSize bytes at a time.if(newFileastextis notoldPath)thensetreadRefto(open for accesstheFile)setwriteRefto(open for accessnewFilewithwritepermission)tryset eofwriteRefto0repeatwithifrom1tofileSizebymaxChunkSizesetdto(readreadRefformaxChunkSizeasdata)writedasdatatowriteRefendrepeatclose accesswriteRefclose accessreadRefonerrorerrMsgclose accesswriteRefclose accessreadRefdisplay dialogerrMsgbuttons{"Stop"}defaultbutton1cancelbutton1withiconstopendtrysettheFiletonewFileendif-- Open the target file with write permission and perform the sort.seto'sfRefto(open for accesstheFilewithwritepermission)try-- Handler parameters: first-byte indices of the first and last integers in the file.if(fileSize>maxChunkSize)thentellotoqsrt(1,fileSize+1-integerSize)elsetellotosortInMemory(1,fileSize+1-integerSize)endifclose accesso'sfRefonerrorerrMsgclose accesso'sfRefdisplay dialogerrMsgbuttons{"Stop"}defaultbutton1cancelbutton1withiconstopendtry-- Return the specifier for the sorted file.returntheFileendexternalSortsettheFileto(path todesktopastext)&"Test.dat"setsortedFiletoexternalSort(theFile)
Sort a file of integers usingexternal merge sort.
The input file is read into asingle 32 byte buffer (8 ints)and the 8 ints are sorted and then writtento a temp file on disk as strings.
This continues until all the numbers from the input fileare read and distributed into files of 8 integer strings.
The last file may be less than 8 integers.
All the temp files are opened as ifstreams in a dynamic pointer array.
A minheap reads ints from each file stream in turn, and outputs its contents to the output file.
(The heap generally only fills to 6 items before it writes its contents to the output file.)
All sorted streams are merged in this way out to an external output filemerged.txt.
/* ExternalSort.cpp */#include<iostream>#include<fstream>#include<queue>#include<string>#include<algorithm>#include<cstdio>/* function signatures */intmain(intargc,char*argv[]);voidwrite_vals(int*const,constsize_t,constsize_t);std::stringmergeFiles(size_t);/* Comparison object compares first item of 2 std::pairs of ints true if first item is larger or the same. MinHeap sorts with this. It gets called a lot, so the simpler the better. STL api stipulates boolean predicate A "functor" object defines a function call operator () that returns a value.*/structCompare{// compare 2 pairs by first elementbooloperator()(std::pair<int,int>&p1,std::pair<int,int>&p2){returnp1.first>=p2.first;// Ascending order}};/* aliases */usingipair=std::pair<int,int>;usingpairvector=std::vector<ipair>;usingMinHeap=std::priority_queue<ipair,pairvector,Compare>;/* constants */constsize_tmemsize=32;// 32 bytesconstsize_tchunksize=memsize/sizeof(int);// 8 intconststd::stringtmp_prefix{"tmp_out_"};// tmpfile prefixconststd::stringtmp_suffix{".txt"};// tmpfile suffixconststd::stringmerged_file{"merged.txt"};// output file/* functions */// write int array to filevoidwrite_vals(int*constvalues,constsize_tsize,constsize_tchunk){// tmp_out_1.txt, tmp_out_2.txt ...std::stringoutput_file=(tmp_prefix+std::to_string(chunk)+tmp_suffix);std::ofstreamofs(output_file.c_str());//output filefor(inti=0;i<size;i++)ofs<<values[i]<<'\t';ofs<<'\n';ofs.close();}/* merge all external sorted files into one output file (same size as original input file) */std::stringmergeFiles(size_tchunks,conststd::string&merge_file){std::ofstreamofs(merge_file.c_str());MinHeapminHeap;// array of ifstreamsstd::ifstream*ifs_tempfiles=newstd::ifstream[chunks];for(size_ti=1;i<=chunks;i++){inttopval=0;// generate a unique name for temp file (temp_out_1.txt , temp_out_2.txt ..)std::stringsorted_file=(tmp_prefix+std::to_string(i)+tmp_suffix);// open an input file stream object for each nameifs_tempfiles[i-1].open(sorted_file.c_str());// bind to tmp_out_{i}.txt// get val from temp fileif(ifs_tempfiles[i-1].is_open()){ifs_tempfiles[i-1]>>topval;// first value in the file (min)ipairtop(topval,(i-1));// 2nd value is tempfile numberminHeap.push(top);// minHeap autosorts}}while(minHeap.size()>0){intnext_val=0;ipairmin_pair=minHeap.top();// get minminHeap.pop();ofs<<min_pair.first<<' ';// write value to filestd::flush(ofs);if(ifs_tempfiles[min_pair.second]>>next_val){ipairnp(next_val,min_pair.second);minHeap.push(np);}}// close open filesfor(inti=1;i<=chunks;i++){ifs_tempfiles[i-1].close();}ofs<<'\n';ofs.close();delete[]ifs_tempfiles;// free memoryreturnmerged_file;// string}intmain(intargc,char*argv[]){if(argc<2){std::cerr<<"usage: ExternalSort <filename>\n";return1;}// open input file for readingstd::ifstreamifs(argv[1]);if(ifs.fail()){std::cerr<<"error opening "<<argv[1]<<"\n";return2;}// temp array for input (small) (32 bytes -- 8 ints)int*inputValues=newint[chunksize];intchunk=1;// counter (which chunk)intval=0;// int for readingintcount=0;// count readsbooldone=false;std::cout<<"internal buffer is "<<memsize<<" bytes"<<"\n";// read chunksize values from input filewhile(ifs>>val){done=false;inputValues[count]=val;count++;if(count==chunksize){std::sort(inputValues,inputValues+count);write_vals(inputValues,count,chunk);// output vals tochunk++;count=0;done=true;}}// whileif(!done)// one more file{std::sort(inputValues,inputValues+count);write_vals(inputValues,count,chunk);// output vals to}else{chunk--;// fix overshoot}ifs.close();// done with original input filedelete[]inputValues;// free dynamically allocated memory// perform external mergesort on sorted temp files, if any.if(chunk==0)std::cout<<"no data found\n";elsestd::cout<<"Sorted output is in file: "<<mergeFiles(chunk,merged_file)<<"\n";returnEXIT_SUCCESS;}/* compile: clang++ -std=c++11 -Wall -pedantic -o ExternalSort ExternalSort.cpp *//* inputfile integers -- one per line for simplicity */
input:
2 65 76 88 55 22 35 11 76 99 7 111 4 55 34 63 22 13 45 9 0 112 123 345 456 8 654 678 999 888 10 3 555 534 236 143 860 648 1 627 711 223 332 5 443 445
tmp_out_1.txt
211223555657688
tmp_out_2.txt
473455637699111
tmp_out_3.txt
09132245112123345
tmp_out_4.txt
3810456654678888999
tmp_out_5.txt
1143236534555627648860
tmp_out_6.txt
5223332443445711
merged.txt:
0 1 2 3 4 5 7 8 9 10 11 13 22 22 34 35 45 55 55 63 65 76 76 88 99 111 112 123 143 223 236 332 345 443 445 456 534 555 627 648 654 678 711 860 888 999
This entry illustrates how to use DuckDB's COPY command to sort the recordsof a CSV file much larger than will fit into memory. As is generally thecase, DuckDB will "spill" data to a disk file in a specific locationprovided there is enough disk space available at that location.
The CSV file used for this task is a well-known file of Japanese trade statistics from 1988 to 2019. The file, which is about 4.5GB in size, is available athttps://www.kaggle.com/datasets/zanjibar/100-million-data-csvDuckDB can read files over the Internet directly, but here we'll suppose it is available in the pwdas custom_1988_2020.csv
The CSV file has 113,607,322 records, each with 8 columns. For this exercise, we'll sort it by the values in the last column, which is an amount in Yen.
To force the spilling of data, we'll set the `memory_limit` option to '1GB',and we'll set `temp_directory` so as to ensure there is enough disk spaceand so that during execution of the program, the presence of the temporaryfiles can be verified.
To verify the limit has not been exceeded, we'll use /usr/bin/time -lp
SETmemory_limit='1GB';SETtemp_directory='/tmp/duckdb/tmp/';COPY(FROM'custom_1988_2020.csv'orderbycolumn7)TO'/tmp/custom_1988_2020.sorted.csv';"
The peak memory footprint was 975,155,200;to verify the output file has been sorted, I ran:
cut -d, -f8 < /tmp/custom_1988_2020.sorted.csv | sort -nc
TypeMinHeapNodeelementAsIntegerindexAsIntegerEndTypeTypeMinHeapnodes(Any)AsMinHeapNodeEndTypeSubminHeapify(heap()AsMinHeapNode,ByvaliAsInteger)DimAsIntegerl,r,smallest,heapSize=Ubound(heap)+1DimAsMinHeapNodetmpDol=(2*i+1)r=(2*i+2)smallest=iIfl<heapSizeAndalsoheap(l).element<heap(i).elementThensmallest=lIfr<heapSizeAndalsoheap(r).element<heap(smallest).elementThensmallest=rIfsmallest=iThenExitDotmp=heap(i)heap(i)=heap(smallest)heap(smallest)=tmpi=smallestLoopEndSubSubbuildMinHeap(heap()AsMinHeapNode)ForiAsInteger=(Ubound(heap)-1)\2To0Step-1minHeapify(heap(),i)NextEndSubSubmerge(arr()AsInteger,ByvallAsInteger,ByvalmAsInteger,ByvalrAsInteger)StaticAsIntegertl(1023),tr(1023)'Static for better performanceDimAsIntegern1=m-l+1DimAsIntegern2=r-mDimAsIntegeri,j,kFori=0Ton1-1:tl(i)=arr(l+i):NextForj=0Ton2-1:tr(j)=arr(m+1+j):Nexti=0:j=0:k=lWhilei<n1Andalsoj<n2Iftl(i)<=tr(j)Thenarr(k)=tl(i):i+=1Elsearr(k)=tr(j):j+=1EndIfk+=1WendWhilei<n1:arr(k)=tl(i):i+=1:k+=1:WendWhilej<n2:arr(k)=tr(j):j+=1:k+=1:WendEndSubSubmergeSort(arr()AsInteger,ByvallAsInteger,ByvalrAsInteger)Ifl<rThenDimAsIntegerm=l+(r-l)\2mergeSort(arr(),l,m)mergeSort(arr(),m+1,r)merge(arr(),l,m,r)EndIfEndSubSubmergeFiles(outputFileAsString,ByvalnAsInteger,ByvalkAsInteger)StaticAsIntegerinFiles(15)'Assuming max k=16DimAsMinHeapNodenodes(k-1)DimAsIntegeroutFile,cnt,iFori=0Tok-1inFiles(i)=FreefileOpen"es"&iForInputAs#inFiles(i)Input#inFiles(i),nodes(i).elementnodes(i).index=iNextFori=(k-2)\2To0Step-1:minHeapify(nodes(),i):NextoutFile=FreefileOpenoutputFileForOutputAs#outFileDoPrint#outFile,nodes(0).element;IfEof(inFiles(nodes(0).index))Thennodes(0).element=&h7FFFFFFFcnt+=1Ifcnt=kThenExitDoElseInput#inFiles(nodes(0).index),nodes(0).elementEndIfminHeapify(nodes(),0)LoopFori=0Tok-1:Close#inFiles(i):NextClose#outFileEndSubSubcreateInitialRuns(inputFileAsString,ByvalrunSizeAsInteger,ByvalnumWaysAsInteger)StaticAsIntegerarr(1023)'Static for better performanceDimAsIntegerinFile=FreefileDimAsIntegeroutFiles(numWays-1)DimAsIntegeri,j,nextOutputFileOpeninputFileForInputAs#inFileFori=0TonumWays-1outFiles(i)=FreefileOpen"es"&iForOutputAs#outFiles(i)NextDoWhileNotEof(inFile)Fori=0TorunSize-1IfEof(inFile)ThenExitForInput#inFile,arr(i)NextIfi>0ThenmergeSort(arr(),0,i-1)Forj=0Toi-1Print#outFiles(nextOutputFile),arr(j);NextnextOutputFile+=1EndIfLoopClose#inFileFori=0TonumWays-1:Close#outFiles(i):NextEndSubSubexternalSort(inputFileAsString,outputFileAsString,numWaysAsInteger,runSizeAsInteger)createInitialRuns(inputFile,runSize,numWays)mergeFiles(outputFile,runSize,numWays)EndSub'Main programRandomizeTimerDimAsIntegernumWays=4DimAsIntegerrunSize=10DimAsStringinputFile="i:\external_sort_input.txt"DimAsStringoutputFile="i:\external_sort_output.txt"'Create input file with random numbersDimAsIntegerff=FreefileOpeninputFileForOutputAs#ffForiAsInteger=1TonumWays*runSizePrint#ff,Int(Rnd*&h7FFFFFFF);NextClose#ff'SortexternalSort(inputFile,outputFile,numWays,runSize)'Clean up temporary filesForiAsInteger=0TonumWays-1:Kill"es"&i:NextSleep
Sample run:
Contents of external_sort_input.txt: 771153728 440569039 1143474915 640249120 2085499789 443596200 105984682 1927843875 263537832 976541783 1608099832 1814237183 1505334772 877843328 1826156632 280344412 662866168 1154720301 385337647 712228972 437466640 344847458 992007931 1831227913 1129470434 702714149 1794352639 1306396046 187118382 1460500964 624660213 198693642 308483284 575723384 314802427 127277568 234618966 2085409103 572022291 771492143Contents of external_sort_output.txt: 105984682 127277568 187118382 198693642 234618966 263537832 280344412 308483284 314802427 344847458 385337647 437466640 440569039 443596200 572022291 575723384 624660213 640249120 662866168 702714149 712228972 771153728 771492143 877843328 976541783 992007931 1129470434 1143474915 1154720301 1306396046 1460500964 1505334772 1608099832 1794352639 1814237183 1826156632 1831227913 1927843875 2085409103 2085499789
This is a translation of the C++ codehere which implements external sorting using a merge sort. In the interests of brevity, the extensive comments in the C++ version have been largely omitted.
A small test file consisting of random integers has been generated and sorted to demonstrate that the approach works.
packagemainimport("fmt""io""log""math""math/rand""os""time")typeMinHeapNodestruct{element,indexint}typeMinHeapstruct{nodes[]MinHeapNode}funcleft(iint)int{return(2*i+1)}funcright(iint)int{return(2*i+2)}funcnewMinHeap(nodes[]MinHeapNode)*MinHeap{mh:=new(MinHeap)mh.nodes=nodesfori:=(len(nodes)-1)/2;i>=0;i--{mh.minHeapify(i)}returnmh}func(mh*MinHeap)getMin()MinHeapNode{returnmh.nodes[0]}func(mh*MinHeap)replaceMin(xMinHeapNode){mh.nodes[0]=xmh.minHeapify(0)}func(mh*MinHeap)minHeapify(iint){l,r:=left(i),right(i)smallest:=iheapSize:=len(mh.nodes)ifl<heapSize&&mh.nodes[l].element<mh.nodes[i].element{smallest=l}ifr<heapSize&&mh.nodes[r].element<mh.nodes[smallest].element{smallest=r}ifsmallest!=i{mh.nodes[i],mh.nodes[smallest]=mh.nodes[smallest],mh.nodes[i]mh.minHeapify(smallest)}}funcmerge(arr[]int,l,m,rint){n1,n2:=m-l+1,r-mtl:=make([]int,n1)tr:=make([]int,n2)copy(tl,arr[l:])copy(tr,arr[m+1:])i,j,k:=0,0,lfori<n1&&j<n2{iftl[i]<=tr[j]{arr[k]=tl[i]k++i++}else{arr[k]=tr[j]k++j++}}fori<n1{arr[k]=tl[i]k++i++}forj<n2{arr[k]=tr[j]k++j++}}funcmergeSort(arr[]int,l,rint){ifl<r{m:=l+(r-l)/2mergeSort(arr,l,m)mergeSort(arr,m+1,r)merge(arr,l,m,r)}}// Merge k sorted files: es0,es1 etc.funcmergeFiles(outputFilestring,n,kint){in:=make([]*os.File,k)varerrerrorfori:=0;i<k;i++{fileName:=fmt.Sprintf("es%d",i)in[i],err=os.Open(fileName)check(err)}out,err:=os.Create(outputFile)check(err)nodes:=make([]MinHeapNode,k)i:=0for;i<k;i++{_,err=fmt.Fscanf(in[i],"%d",&nodes[i].element)iferr==io.EOF{break}check(err)nodes[i].index=i}hp:=newMinHeap(nodes[:i])count:=0forcount!=i{root:=hp.getMin()fmt.Fprintf(out,"%d ",root.element)_,err=fmt.Fscanf(in[root.index],"%d",&root.element)iferr==io.EOF{root.element=math.MaxInt32count++}else{check(err)}hp.replaceMin(root)}forj:=0;j<k;j++{in[j].Close()}out.Close()}funccheck(errerror){iferr!=nil{log.Fatal(err)}}// Create initial runs, divide them evenly amongst the output files// and then merge-sort them.funccreateInitialRuns(inputFilestring,runSize,numWaysint){in,err:=os.Open(inputFile)out:=make([]*os.File,numWays)fori:=0;i<numWays;i++{fileName:=fmt.Sprintf("es%d",i)// es0, es1 etc.out[i],err=os.Create(fileName)check(err)}arr:=make([]int,runSize)moreInput:=truenextOutputFile:=0variintformoreInput{fori=0;i<runSize;i++{_,err:=fmt.Fscanf(in,"%d",&arr[i])iferr==io.EOF{moreInput=falsebreak}check(err)}mergeSort(arr,0,i-1)forj:=0;j<i;j++{fmt.Fprintf(out[nextOutputFile],"%d ",arr[j])}nextOutputFile++}forj:=0;j<numWays;j++{out[j].Close()}in.Close()}funcexternalSort(inputFile,outputFilestring,numWays,runSizeint){createInitialRuns(inputFile,runSize,numWays)mergeFiles(outputFile,runSize,numWays)}funcmain(){// Create a small test file of 40 random ints and split it into 4 files// of 10 integers each.numWays:=4runSize:=10inputFile:="input.txt"outputFile:="output.txt"in,err:=os.Create(inputFile)check(err)rand.Seed(time.Now().UnixNano())fori:=0;i<numWays*runSize;i++{fmt.Fprintf(in,"%d ",rand.Intn(math.MaxInt32))}in.Close()externalSort(inputFile,outputFile,numWays,runSize)// remove temporary filesfori:=0;i<numWays;i++{fileName:=fmt.Sprintf("es%d",i)err=os.Remove(fileName)check(err)}}
Contents of input.txt:
921996447 760852351 223421434 1245608832 745990119 1414811249 1947335121 762344474 588429291 993452626 2592794 491133923 1275871423 1152039534 649892156 278215570 595760601 1878223040 1267371451 2097209826 1409628494 1147072290 309824251 108477605 1705270413 1821354697 1703557665 473708588 110138202 1292465428 946557804 148800949 1471244316 1508853596 1306802817 1016358698 1661284048 527644251 546155704 337874167
Contents of output.txt:
2592794 108477605 110138202 148800949 223421434 278215570 309824251 337874167 473708588 491133923 527644251 546155704 588429291 595760601 649892156 745990119 760852351 762344474 921996447 946557804 993452626 1016358698 1147072290 1152039534 1245608832 1267371451 1275871423 1292465428 1306802817 1409628494 1414811249 1471244316 1508853596 1661284048 1703557665 1705270413 1821354697 1878223040 1947335121 2097209826
Untested on a memory mapped file.
NB. Apply an in-place sorting algorithm to a memory mapped fileNB. in-place sort is translation of in-place python quicksort.require'jmf'JCHARmap_jmf_'DATA';'file.huge'NB. The noun DATA now refers to the memory mapped file.NB. Use: quicksort DATANB. use: quicksort DATAquicksort=:3:'qsinternal 0 , <:@:# ARRAY=: y'NB. ARRAY is globalqsinternal=:3 :0'start stop'=.yif.0<stop-startdo.'left right pivot'=.start,stop,start{ARRAYNB. pivot, left, right = array[start], start, stopwhile.left<:rightdo.NB. while left <= right:while.pivot>left{ARRAYdo.NB. while array[left] < pivot:left=.>:leftend.while.pivot<right{ARRAYdo.NB. while array[right] > pivot:right=.<:rightNB. right -= 1end.if.left<:rightdo.NB. if left <= right:NB. mapped files work by reference, assignment not required, but for testing.ARRAY=:(left,right){`(|.@:[)`]}ARRAYNB. array[left], array[right] = array[right], array[left]left=.>:leftNB. left += 1right=.<:rightNB. right -= 1end.end.qsinternalstart,rightNB. _quicksort(array, start, right)qsinternalleft,stopNB. _quicksort(array, left, stop)end.i.00NB. verbs return the final noun)
Demonstration the sorting works:
quicksort ?~10 ARRAY0 1 2 3 4 5 6 7 8 9
importjava.io.BufferedReader;importjava.io.BufferedWriter;importjava.io.IOException;importjava.nio.charset.StandardCharsets;importjava.nio.file.Files;importjava.nio.file.Path;importjava.util.ArrayList;importjava.util.Collections;importjava.util.Comparator;importjava.util.List;importjava.util.PriorityQueue;importjava.util.concurrent.ThreadLocalRandom;publicfinalclassExternalSort{publicstaticvoidmain(String[]args)throwsIOException{PathinputPath=Path.of("./input.txt");PathoutputPath=Path.of("./output.txt");// Create a "large" input file of a bounded random sizeThreadLocalRandomrandom=ThreadLocalRandom.current();try(BufferedWriterwriter=Files.newBufferedWriter(inputPath,StandardCharsets.UTF_8)){for(inti=0;i<random.nextInt(80,100);i++){writer.write(random.nextInt(0,100)+"\n");}}finalintinMemoryFileSize=20;finallongtempFileCount=Files.lines(inputPath).count()/inMemoryFileSize+1;createTemporaryFiles(inputPath,inMemoryFileSize);mergeFiles(outputPath,inMemoryFileSize,tempFileCount);}privatestaticvoidcreateTemporaryFiles(PathinputPath,intinMemoryFileSize)throwsIOException{try(BufferedReaderinputFileReader=Files.newBufferedReader(inputPath)){List<BufferedWriter>outputFileWriters=newArrayList<BufferedWriter>();inttempFileIndex=0;booleaninputAvailable=true;while(inputAvailable){List<Integer>data=newArrayList<Integer>();for(inti=0;i<inMemoryFileSize&&inputAvailable;i++){Stringline=inputFileReader.readLine();if(line!=null){data.add(Integer.parseInt(line));}else{inputAvailable=false;}}Collections.sort(data);// Write the elements to the appropriate temporary output filePathtempPath=Path.of(String.valueOf(tempFileIndex));outputFileWriters.addLast(Files.newBufferedWriter(tempPath,StandardCharsets.UTF_8));for(intelement:data){outputFileWriters.get(tempFileIndex).write(element+"\n");}outputFileWriters.get(tempFileIndex).close();tempFileIndex+=1;}}}privatestaticvoidmergeFiles(PathoutputPath,intinMemoryFileSize,longtempFileCount)throwsIOException{try(BufferedWriterwriter=Files.newBufferedWriter(outputPath,StandardCharsets.UTF_8)){List<BufferedReader>readers=newArrayList<BufferedReader>();Comparator<Node>custom=Comparator.comparingInt(a->a.element);PriorityQueue<Node>priorityQueue=newPriorityQueue<Node>(custom);// Read the output files and place an element from each file into the priority queuefor(inti=0;i<tempFileCount;i++){readers.addLast(Files.newBufferedReader(Path.of(String.valueOf(i))));Stringelement=readers.get(i).readLine();if(element!=null){priorityQueue.add(newNode(Integer.parseInt(element),i));}}while(!priorityQueue.isEmpty()){// Remove the minimum node from the priority queue and store it in the output fileNodenode=priorityQueue.poll();writer.write(node.element+"\n");// Replace the node removed from the priority queueStringelement=readers.get(node.fileIndex).readLine();if(element!=null){priorityQueue.add(newNode(Integer.parseInt(element),node.fileIndex));}}// Delete temporary filesfor(inti=0;i<tempFileCount;i++){Files.delete(Path.of(String.valueOf(i)));}}}privatestaticrecordNode(intelement,intfileIndex){}}
intfile=open("/tmp/mmap.bin","r+")arr=Mmap.mmap(intfile,Vector{Int64},(div(stat(intfile).size,8)))# Int64 is 8 bytessort!(arr)
importalgorithm,heapqueue,os,random,sequtils,strformat,strutilsprocsortFiles(filenames:seq[string])=forfilenameinfilenames:varlines=filename.readFile()lines.stripLineEnd()# Ignore last line feed, if any.varsortedLines=sorted(lines.splitLines())echo&"""{filename} => {sortedLines.join(", ")}"""filename.writeFile(sortedLines.join("\n"))procmergeFiles(outfile:File;filenames:seq[string])=varqueue:HeapQueue[(string,File)]forfilenameinfilenames:letf=open(fileName)queue.push(f.readLine(),f)whilequeue.len>0:let(s,infile)=queue.pop()outfile.writes&'\n'ifinfile.endOfFile:infile.close()else:queue.push((infile.readLine(),infile))whenisMainModule:constWriteToFile=false# Compile time switch.randomize()letnf=rand(2..4)# Number of files.letlp=3# Lines per file.varfilenames:seq[string]varlines=toSeq(1..nf*lp)lines.shuffle()foriin1..nf:letfilename=&"file{i}.txt"filenames.addfilenameletf=open(filename,fmWrite)forlin1..lp:f.write&"Line {lines[^l]:2}\n"lines.setLen(lines.len-lp)f.close()echo&"sorting {nf * lp} lines split over {nf} files"sortFiles(filenames)whenWriteToFile:letf=open("results.txt",fmWrite)mergeFiles(f,filenames)f.close()else:mergeFiles(stdout,filenames)forfilenameinfilenames:removeFile(filename)
sorting 12 lines split over 4 filesfile1.txt => Line 3, Line 4, Line 10file2.txt => Line 1, Line 6, Line 12file3.txt => Line 2, Line 5, Line 8file4.txt => Line 7, Line 9, Line 11Line 1Line 2Line 3Line 4Line 5Line 6Line 7Line 8Line 9Line 10Line 11Line 12
Simulate task by reading from 'DATA' handle and using tiny record limit. As written, works for any numeric input, but could define any kind of customized sorting.
usestrict;usewarnings;my$max=4;# records per merge filemy(@chunk,@tempf);submysort($$){return$_[0]<=>$_[1]}substore{my($a)=@_;my$f=IO::File->new_tmpfile;# self-deleting after program exitprint$fsortmysort@$a;seek$f,0,0orwarn"Oops: $!";push(@tempf,{fh=>$f,queued=>scalar<$f>});}# read input and create sorted temporary fileswhile(<DATA>){push@chunk,$_;store(\@chunk),@chunk=()if@chunk==$max;}store(\@chunk)if@chunk;# merge everythingwhile(1){my($lowest)=(sort{mysort($a->{queued},$b->{queued});}grep(defined$_->{queued},@tempf))[0];lastunless$lowest->{queued};print$lowest->{queued};$lowest->{queued}=$lowest->{fh}->getline();}__DATA__432345321543987456678123765567876654789234
123234321345432456543567654678765789876987
Slight variation onStream_Merge
withoutjs-- file i/oincludebuiltins/pqueue.eincludebuiltins/pfile.e-- write_lines() - not [yet] documentedprocedureadd(integerfn,pq)objectline=gets(fn)ifline=-1thenclose(fn)elsepq_add({fn,line},pq)endifendprocedureproceduresort_files(sequencefilenames)fori=1tolength(filenames)dosequencelines=get_text(filenames[i],GT_LF_STRIPPED),sorted=sort(lines)printf(1,"%s:%v => %v\n",{filenames[i],lines,sorted})ifwrite_lines(filenames[i],sorted)!=1then?9/0endifendforendprocedureproceduremerge_files(integeroutfn,sequencefilenames)integerpq=pq_new()fori=1tolength(filenames)doadd(open(filenames[i],"r"),pq)endforwhilenotpq_empty(pq)do{integerfn,stringline}=pq_pop(pq)puts(outfn,line)add(fn,pq)endwhilepq_destroy(pq)endprocedureproceduretest()integernf=rand(5),-- number of fileslp=3-- lines per filesequencefilenames={},lines=shuffle(tagset(nf*lp))fori=1tonfdostringfilename=sprintf("file%d.txt",i)filenames=append(filenames,filename)integerfn=open(filename,"w")forl=1tolpdoprintf(fn,"Line %02d\n",lines[l])endforlines=lines[lp+1..$]close(fn)endforprintf(1,"sorting %d lines split over %d files\n",{nf*lp,nf})sort_files(filenames)integeroutfn=1-- or open("results.txt","w")merge_files(outfn,filenames)-- close(outfn)fori=1tonfdo{}=delete_file(filenames[i])endforendproceduretest()
sorting 9 lines split over 3 filesfile1.txt:{"Line 04","Line 01","Line 09"} => {"Line 01","Line 04","Line 09"}file2.txt:{"Line 06","Line 07","Line 02"} => {"Line 02","Line 06","Line 07"}file3.txt:{"Line 08","Line 03","Line 05"} => {"Line 03","Line 05","Line 08"}Line 01Line 02Line 03Line 04Line 05Line 06Line 07Line 08Line 09A technique demonstrated with a short string character data.
#! /usr/bin/python3''' $ # example session in bash $ python3 external_sort.py expect 123456789 memory size 1 passed memory size 2 passed memory size 3 passed memory size 4 passed memory size 5 passed memory size 6 passed memory size 7 passed memory size 8 passed memory size 9 passed memory size 10 passed memory size 11 passed'''importiodefsort_large_file(n:int,source:open,sink:open,file_opener=open)->None:''' approach: break the source into files of size n sort each of these files merge these onto the sink '''# store sorted chunks into files of size nmergers=[]whileTrue:text=list(source.read(n))ifnotlen(text):break;text.sort()merge_me=file_opener()merge_me.write(''.join(text))mergers.append(merge_me)merge_me.seek(0)# merge onto sinkstack_tops=[f.read(1)forfinmergers]whilestack_tops:c=min(stack_tops)sink.write(c)i=stack_tops.index(c)t=mergers[i].read(1)ift:stack_tops[i]=telse:delstack_tops[i]mergers[i].close()delmergers[i]# __del__ method of file_opener should delete the filedefmain():''' test case sort 6,7,8,9,2,5,3,4,1 with several memory sizes '''# load test case into a file like objectinput_file_too_large_for_memory=io.StringIO('678925341')# generate the expected outputt=list(input_file_too_large_for_memory.read())t.sort()expect=''.join(t)print('expect',expect)# attempt to sort with several memory sizesformemory_sizeinrange(1,12):input_file_too_large_for_memory.seek(0)output_file_too_large_for_memory=io.StringIO()sort_large_file(memory_size,input_file_too_large_for_memory,output_file_too_large_for_memory,io.StringIO)output_file_too_large_for_memory.seek(0)assert(output_file_too_large_for_memory.read()==expect)print('memory size{} passed'.format(memory_size))if__name__=='__main__':example=mainexample()
(formerly Perl 6)Borrowing fromStream_Merge here. Temporary files are automatically deleted when program is done, so no explicit clean-up required.
useFile::Temp;submerge_streams (@streams ) {my@s =@streams.map({hash(STREAM =>$_,HEAD => .get ) }).grep({ .<HEAD>.defined });returngatherwhile@s {my$h =@s.min: +*.<HEAD>;take$h<HEAD>;$h<HEAD> :=$h<STREAM>.getorelse@s .=grep( {$_ !===$h } ); }}substore (@values) {my ($filename,$filehandle) =tempfile(:prefix('external-sort.'));$filehandle.say:join"\n",@values.sort: +*;$filename}# we're going to pretend that this is a long stream of input from stdin...my (@chunk,@files);for (<43 2 45 32 15 4 3 -9 45 66 0 42 78 123 -11 76 55 87 -2 64 92 34>) {@chunk.push:$_;@files.push:store(@chunk)and@chunk = ()if@chunk.elems ==4;# limit of records per merge file}@files.push:store(@chunk)if@chunk;sayjoin' ',merge_streams@files».&open;
-11 -9 -2 0 2 3 4 15 32 34 42 43 45 45 55 64 66 76 78 87 92 123
Programming note: the method used to generate the input file is to create the file with N records,
breaking up the records into sort work files of no more than 10 records (limit).
The sort work files are then sorted with an external sort, and then merged into one big file.
This particular example uses the DOS SORT and ERASE commands.
/*REXX pgm reads a file, splits into smaller files, sorts 'em, combines into sorted file*/parseargFIDnlimseed./*obtain optional arguments from the CL*/ifFID==''|FID==","thenFID='SORT_EXT.OUT'/*name of the output (sorted) file. */ifn==''|n==","thenn=500/*number of records (rand #s) to gen. */iflim==''|lim==","thenlim=10/*number of records per SORTWORK file. */ifdatatype(seed,'W')thencallrandom,,seed/*Numeric? Then use it as a rand seed.*/sWork='SORTWORK.'/*the filename of the SORTWORK files.*/callgenn,lim/*generate SORTWORK.nnn files. */callsrt#/*sort records in all SORTWORK files.*/callmrg/*merge records in the SORTWORK files.*/exit0/*stick a fork in it, we're all done. *//*──────────────────────────────────────────────────────────────────────────────────────*/mrg:procedureexposeFIDsWork;parsearg#/*#: the number of SORTWORK files. */@.=copies('ff'x,1e5)/*no value should be larger than this. */calllineoutFID,,1/*position the output file at rec # 1. */doj=1until@.j==@.;callrdrj/*read any number of SORTWORK files, */end/*j*//*but initially just 1 record per file.*/j=j-1/*adj. J; read from a non─existent file*/doforever;y=@./*find the lowest value for N values.*/z=0dok=1forj/*traipse through the stemmed @ array.*/if@.k==@.thencallrdrk/*Not defined? Then read a file record*/if@.k<<ythendo;y=@.k/*Lowest so far? Then mark this as min.*/z=kendend/*k*//* [↑] note use of << exact comparison*/ifz==0thenleave/*Any more records? No, close file. */calllineoutFID,@.z/*write the value to the output file. */callrdrz/*re-populate a value from the # file. */end/*forever*//*keep reading/merging until exhausted.*/calllineoutFID/*close the output file (just in case).*/'ERASE'sWORK"*"/*delete all the SORTWORK files. */return/*──────────────────────────────────────────────────────────────────────────────────────*/gen:procedureexpose#;parseargm,siz;d=digits()/*used for justify*/#=0/*number of SORTWORK.nnn files so far*/doj=1form;#=1+j%siz/*create workfile#*/calllineout'SORTWORK.'#,right(random(,1e5),d)/*write rand #. */end/*j*/dok=1for#;calllineout'SORTWORK.'#/*close a workfile*/end/*k*/return/*──────────────────────────────────────────────────────────────────────────────────────*/rdr:parsearga;@.a=@.;f=sWork||a;iflines(f)\==0then@.a=linein(f);return/*──────────────────────────────────────────────────────────────────────────────────────*/srt:procedureexposesWork;parsearg#doj=1for#;fn=sWORK||j;'SORT'fn"/O"fn;end/*j*/;return
A bit simpler than the Go version as we use fixed length integers which (together with a following space) can be sorted as strings.
import"io"forFileimport"random"forRandomimport"./dynamic"forStructimport"./sort"forSortimport"./str"forStrvarMinHeapNode=Struct.create("MinHeapNode",["element","index"])classMinHeap{constructnew(nodes){_nodes=nodesvarstart=((_nodes.count-1)/2).floorfor(iinstart..0)minHeapify(i)}left(i){2*i+1}right(i){2*i+2}nodes{_nodes}min{_nodes[0]}replaceMin(x){_nodes[0]=xminHeapify(0)}minHeapify(i){varl=left(i)varr=right(i)varsmallest=ivarheapSize=_nodes.countif(l<heapSize&&Str.lt(_nodes[l].element,_nodes[i].element))smallest=lif(r<heapSize&&Str.lt(_nodes[r].element,_nodes[smallest].element))smallest=rif(smallest!=i){_nodes.swap(i,smallest)minHeapify(smallest)}}}// Merge k sorted files: es0,es1 etc.varmergeFiles=Fn.new{|outputFile,k,e|varinp=List.filled(k,null)varoffset=List.filled(k,0)// current offset for each input filefor(iin0...k){varfileName="es%(i)"inp[i]=File.open(fileName)}varout=File.create(outputFile)varnodes=List.filled(k,null)for(iin0...k)nodes[i]=MinHeapNode.new(0,0)vari=0while(i<k){varbytes=inp[i].readBytes(e)if(bytes.count<e)break// end of file reachednodes[i].element=bytesnodes[i].index=ioffset[i]=offset[i]+ei=i+1}varhp=MinHeap.new(nodes[0...i])varcount=0while(count!=i){varroot=hp.minout.writeBytes(root.element)varbytes=inp[root.index].readBytes(e,offset[root.index])if(bytes.count<e){// end of file reachedroot.element="999999 "count=count+1}else{root.element=bytesoffset[root.index]=offset[root.index]+e}hp.replaceMin(root)}for(jin0...k)inp[j].close()out.close()}// Create initial runs, divide them evenly amongst the output files// and then merge-sort them.varcreateInitialRuns=Fn.new{|inputFile,numWays,runSize,elementSize|varinp=File.open(inputFile)varoffset=0for(iin0...numWays){varfileName="es%(i)"// es0, es1 etc.varbytes=inp.readBytes(runSize*elementSize,offset)offset=offset+runSize*elementSizevarnumbers=Str.chunks(bytes,elementSize)numbers=Sort.merge(numbers)File.create(fileName){|f|f.writeBytes(numbers.join(""))}}inp.close()}varexternalSort=Fn.new{|inputFile,outputFile,numWays,runSize,elementSize|createInitialRuns.call(inputFile,numWays,runSize,elementSize)mergeFiles.call(outputFile,numWays,elementSize)}// Create a small test file of 40 random 6 digit integers and split it into 4 files// of 10 such integers each.varnumWays=4varrunSize=10varelementSize=7// 6 digits + a following spacevarinputFile="external_sort_input.txt"varoutputFile="external_sort_output.txt"varinp=File.create(inputFile)varrand=Random.new()varmin=100000varmax=999999for(iin0...numWays*runSize)inp.writeBytes("%(rand.int(min,max).toString) ")inp.close()externalSort.call(inputFile,outputFile,numWays,runSize,elementSize)// remove temporary filesfor(iin0...numWays){varfileName="es%(i)"File.delete(fileName)}
Sample run:
Contents of external_sort_input.txt:195387 279593 270645 457221 187563 459521 984067 443317 890630 986820 357072 302605 354825 295908 541221 273855 318978 913819 961359 776939 337617 640070 100140 266938 597987 305187 731698 449166 388165 121283 516001 256453 197931 660491 785453 544828 346520 532447 688793 194774 Contents of external_sort_output.txt:100140 121283 187563 194774 195387 197931 256453 266938 270645 273855 279593 295908 302605 305187 318978 337617 346520 354825 357072 388165 443317 449166 457221 459521 516001 532447 541221 544828 597987 640070 660491 688793 731698 776939 785453 890630 913819 961359 984067 986820