- Notifications
You must be signed in to change notification settings - Fork0
Simple 9x9 Sudoku brute force solver with intrinsic parallel candidate set processing using bits to represent digits in the [1, 9] range, and bitwise operations to test a candidate against the candidate set, all at once.
nilostolte/Sudoku
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Simple 9x9 Sudoku brute force solver with intrinsic parallel candidate set processing using bits to represent digits in the [1, 9] range, and bitwise operations to test a candidate against the candidate set, all at once.
It can be upgraded for 16x16 or 25x25 grids.
The algorithm was implemented inJava, inC, as well as inZig. The descriptionbelow concerns the Java implementation, even though theC implementation is quite similar, butwithout classes.Zig implementation is similar to C's but faster and with an OOP style stack. Inthe Zig versionmany optimizations allowed to achieve a minimum running time of 0.7916miliseconds for the sametest grid run on my
Intel Core i7-2670QM @ 2.20GHz laptop:
The supplied Windows 64 executables for theC andZig implementations can beused to solve arbitrary grids as described in thedocumentation.
This is theclass containing the grid to be solved.
The grid can be initialized using a 9x9 matrix of typechar[][]
or through a linear string containing all the elements, representatingempty elements as 0 (or ' . ' in the C or Zig version), both given line by line. Thechar[][]
is the unique input, however, and it must exist before being able to useany other input format. Even though the 9x9 matrix contains characters (it's achar[][]
), the digits are not represented as ASCII or Unicodecharacters but rather as integers. In other words, the character '0' is actually represented by 0, and so forth.
In the string input format the string is just copied over the existing inputchar[][]
matrix using the static functionset
. This string usesASCII representation for the digits which are converted to integers by the functionset
.
An additional representation is possible, as illustrated inMain.java, byrepresenting the charcater '0' with the character '.' in the string. In this case one adds.replace('.','0')
at the end of the string as shown.
Both string input formats are common representations of Sudoku grids on the web.
The main data structure inGrid
ismatrix
which is a 9x9 matrix in identical format as the input matrix for the grid. This is the matrixwhere the input matrix is copied to.
The main auxiliary data structures are the most interesting part of this class, besides the solver algorithm itself:
lines
- an array with 9 positions, each one, corresponding to a line in the grid, and functioning as a set where each bit represents a digitalready present in that line.cols
- an array with 9 positions, each one, corresponding to a column in the grid, and functioning as a set where each bit represents a digitalready present in that column.cells
- a 3x3 matrix, corresponding to a 3x3 cell that the grid is subdivided, with 9 positions, each one functioning as a set where each bitrepresents a digit already present in that cell.
stk
- the stack to implement the backtracking algorithm. It uses an array of 81 positions. It uses thepush
andpop
operators as shown inthealgorithm below. Thepush
operator not only stores the digit, itsbinary representation,the line and column (i
andj
) of the element inserted in a stack node (StkNode
),"pushing" the node in the stack, but also inserts thedigit in the internal matrix (matrix[i][j]
) as well as its binary representation into the auxiliary data structures, thus, updating the candidateset of the new element inserted. Thepop
operation only removes the node from the stack, but the node is not garbage collected. It remains in thestack as an unused element. Nodes are lazily allocated, asnull
elements are found while pushing.cel
- an array with 9 positions, each one is the inverse mapping of the indices in the lines and columns transformed into indices in the 3x3matrixcells
.
All main auxiliary data structures use a common notation to represent a set of digits present in the line, column, or cell, accordingly.A bit is set to one at the position corresponding to a digit present in the set, or set to zero if it's position corresponds to a digit thatis absent. By reversing the bits one gets the "candidate set" of digits that are still missing in the corresponding line, column or cell. Fora better understanding of this candidate set scheme, please refer to thesubsection explaining how digits are represented in binary.
Let's suppose a particular line, column or cell having the digits, 1, 3, 4 and 9. This set is then represented by the following binary number:
100001101 =0x10D
- the first rightmost bit corresponds to the digit 1, and in this case it's present in the set already.
- the second bit on its left corresponds to the digit 2, and its clearly not present yet since its value is zero.
- bits three and four, corresponding to the digits 3 and 4, respectively, are clearly present, because they are both set to one.
- bits five, six, seven, and eight are all zeros, and thus, digits 5, 6, 7 and 8 are clearly absent in the set.
- bit 9 is 1. Therefore, the digit 9 is also present in the set.
In order to obtain a candidate set for a givenmatrix[i][j]
element of the grid one calculates:
lines[i] | cols[j] | cells[ cel[i] ][ cel[j] ]
(1)
The expression in (1) gives a set where all bits containing zeros correspond to the available digits that are possible to be inmatrix[i][j]
.The candidate set is detected by the absent elements in the set, that is, all bits which are zero.
The interest in this notation is that the concatenation of all three sets is obtained by just using two bitwise or operations.
One can observe howcel
inverse mapping works to access the corresponding cell incells
. First,i
andj
are used as indices incel
.cel[i]
andcel[j]
give the corresponding line and column incells
. Therefore,cells[cel[i]][cel[j]]
corresponds to the cell wherematrix[i][j]
is contained.
publicvoidsolve() {StkNodenode;intdigit =1,code =1,inserted;inti,j;char[]line =matrix[0];charc;i =j =0;do {c =line[j];if (c ==0) {inserted =lines[i]|cols[j]|cells[cel[i]][cel[j]];for ( ;digit !=10 ;digit++,code <<=1 ) {if ((code &inserted ) ==0 ) {push(i,j,code,digit);digit =code =1;break; } }if (digit ==10 ) {// no insertion -> backtrack to previous elementnode =pop();// pop previous inserted i, j, and digiti =node.i;j =node.j;digit =node.digit;code =node.code;remove(node);// remove digit from data structuresdigit++;code <<=1;// let's try next digit;line =matrix[i];// maybe line has changedcontinue;// short-circuit line by line logic } }if (j ==8 ) {// line by line logicj = -1;i++;// last line element, advance to next lineif (i <9)line =matrix[i];// update line from grid matrix }j++;// advance to next element in the line }while (i <9); }
In the binary representation, a digit is always a power of two, since it's a number with only one bit set to 1 at the position correspondingto the digit. The table below shows the correspondence between digits and their binary representation:
Digit | Binary Representation | Hexadecimal | Decimal |
---|---|---|---|
0 | 000000000 | 0x000 | 0 |
1 | 000000001 | 0x001 | 1 |
2 | 000000010 | 0x002 | 2 |
3 | 000000100 | 0x004 | 4 |
4 | 000001000 | 0x008 | 8 |
5 | 000010000 | 0x010 | 16 |
6 | 000100000 | 0x020 | 32 |
7 | 001000000 | 0x040 | 64 |
8 | 010000000 | 0x080 | 128 |
9 | 100000000 | 0x100 | 256 |
The binary representation as exposed in the table above is often called here as the"code" of the digit.
As we can see the variableinserted
contains the "candidate set" for a givenmatrix[i][j]
. This algorithm is quite simple but itcontains a major drawback. Since the digit is represented with a 1 bit in its corresponding position in variablecode
, and it accessesthe candidate set in a sequential way, it loops until an empty bit is found (( code & inserted ) == 0 )
) or if it finds no availablecandidate (digit == 10
).
This means that even if there are no available candidates, the algorithm has to loop over all the remaining bits sequentially. Even if the binaryrepresentation allows to deal with the candidate set with all elements in parallel, that is, all elements at once, we still have to accessit one by one sequentially even when there are no useful results. This problem is adressed with some partial solutions as shownhere andhere, but this later employs far too many operations, despitethe fact it's a branchless solution. It's only interesting when associated with other optimizations as it has been done in theC version.
Digits are tried in ascending order from 1 to 9 for each element in the grid that is not yet occupied. That's whydigit
andcode
variables are both initialized with 1. Every time a new digit is tried against the candidate set, and a successful candidate is found(that is, when( code & inserted ) == 0 )
), the digit is pushed on the stack.
Thepush
function also updatesmatrix[i][j]
,lines[i]
,cols[j]
andcells[cel[i]][cel[j]]
with the new digit. Please check thecode and the description ofstk
for details.
When no suitable candidate is found (that is, when( code & inserted ) == 0 )
fails for every candidate tried), then thefor
loopends, anddigit == 10
. In this case, we need to backtrack, that is, remove the current candidate, and advance the previous inserteddigit to be the next candidate. This is taken care by the instructions found under theif ( digit == 10 )
statement, where the previouscandidate is popped from the stack, removed frommatrix
and the auxiliary data structures (functionremove
), and advanced tobe the next candidate (digit
is incremented andcode
is shifted left). Notice that this command sequence terminates with acontinue
statement in order to skip the line by line logic. Since the line and column (i
andj
) of the element to be dealt next are alreadyknown (they were popped from the stack), modifyingi
orj
is not required. Also of note, if all the possible candidates weretried,digit
will become 10, thefor
loop is summarily skipped, and the flow goes back into this code sequence to backtrack onceagain, dealing with the cases of "cascaded" backtracking sequences.
This completes the backtracking mechanism, allowing, as can be easily infered, to obtain the solution of the input grid in the internalmatrix. As shown inMain.java, the solution is printed using the functionprint
.
The logic to check if there are no candidates with no loops is much more involved than what's done in the algorithm above, but its not rocketscience. It only requires more effort to use our bit representation in a smarter way.
Every power of two subtracted by one is always equal to a sequence of ones on the right of the position it was previously one (except in thecase of 02, since there are no more binary digits on the right of 1). For example, the digit 8 in binary is 128 in ourrepresentation. When subtracted by one, that's 127, that is, 8 bits set to 1 on the right of bit 8:
128 - 1 = 010000000 - 1 = 001111111
By reversing every bit of this result one obtains a mask that's unique when all these bits are 1, that is, when there are no candidates fromthe bit in the current position until the last bit:
~001111111 = 110000000
That is, by executing a bitwiseand operation (&
) between this mask and a candidate set, and if the result is identical to this mask,we can say there is no available candidates left in the candidate set, starting with the digit we are trying, 8 in this case.
Let's check the same logic with digit = 5:
~(000010000 - 1) = ~000001111 = 111110000
Then by testing if
111110000 & inserted == 111110000
What this is actually saying is that there are no candidates neither for 5, neither any digit above it. In other words, this is exactly thecondition we were looking for.
One could call this asreachable
, that is, more formally speaking what we've got is:
reacheable = (~(code-1)) & 0x1ff;
Notice that we have to filter out all bits above bit 9. Then the condition searched would be written like
if ( (inserted & reacheable ) == reacheable )
(2)
In this caseif
statement (2) can substitute the followingif
statement in the algorithm:
if ( digit == 10 )
And we should placeif
statement (2) above thefor
loop statement instead of the order presented in the algorithm. In thiscase thefor
can be written with no final condition, since it would never be reached:
for ( ; ; digit++, code <<= 1 )
(3)
The reason for that is that if there are no candidates, as calculated here, then the condition of theif
statement (2) must be trueand, therefore, thecontinue
statement relative to the do-while statement is executed before thefor
statement (3) is ever reached.This obviously short-circuits thefor
statement (3), since it is now below theif
statement (2). If thefor
statement (3) is reached,the condition in theif
statement (2) must have been false. In this situation there will always be a valid candidate and thebreak
command relative to thefor
statement (3) will be executed, always ending this loop with no need to test the end condition.
Another way to see this optimization is by observing that instead of calculating the mask as explained above, which implies using an intermediatevariablereacheable
, one can infere an equivalent conclusion by simply discarding this variable and using the following test instead of if statement (2):
if ( (inserted + code ) > 511 )
(2a)
Which we call here an alternative to (2), or (2a) for short.
If there are only ones ininserted
starting at the position of the 1 incode
, addingcode
toinserted
will result insome value that is obviously beyond 511 (or 0x1ff). Therefore, we can detect the same situation with only the test (2a), not onlyeliminating the need of calculating the mask, but also the need of the variablereacheable
.
The benchmarks to measure algorithm performance were performed on an i7 2.2 Ghz machine in Java and in C.Theexecutable file compiled in C hasbeen done with optimization option-O3
using thegcc compiler on Windows provided inw64devkit, which is a Mingw-w64gcc compiler that is portable (can be installed by justcopying the directory structure in disk, SD card, or thumb drive).
The benchmarks were executed with several different grids, but particularly with thisone, which is known to be time consuming in automatic methods, and used to compare speed of different methodson the web:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
1 | |||||||||
2 | |||||||||
3 | |||||||||
4 | |||||||||
6 | |||||||||
6 | |||||||||
7 | |||||||||
8 | |||||||||
9 |
The minimal time measured for the optimized algorithm to solve the above grid after several attempts was 10 miliseconds,and the double for the unoptimized algorithm. Nevertheless, the times verified were quite variable as usual in Javawhile measuring fast algorithms like this. This is the reason it would worth trying to implement it with an entirelycompiled language (Java is only compiled when the JIT compiler is triggered) to verify if execution times are lessvariable. It looks like that for this kind of problem, an enterily compiled language would be more appropriate, sinceone expects similar times for the same grid running at different times. Unfortunately this is not the case for thisJava implementation.
Astonishingly, execution times running the executable compiled in C were only slightly more constant than in Java. Thetimes varied from 1.5 miliseconds to 5.26 miliseconds. However, these variations were considerably much less significantthan in Java. Also, C offered roughly about an order of magnitude to about twice less time than the Java implementationof the same optimized algorithm. Several optimizations were devised besides the ones mentioned below. After all theseoptimizations were applied, one obtained a significantimprovement in performance and theWindows 64 executable supplied was generated withthe resulting source code.
The parallel test for no candidates allows to discard unnecessaryfor
loop iterations, while also discarding the unecessary endcondition of thefor
loop (since the order of theif
statement (2) and thefor
statement was reversed). Nevertheless, fordetecting the first candidate one still has to loop and test the digits one by one sequentially against theinserted
set.
But there is a way to calculate the next candidate without any loop. The technique can be illustrated through and example.Supposing the setincluded = 101011110
(that is {9,7,5,4,3,2}, the set of digits already inserted) anddigit = 000000010
(2), one starts by adding both:
101011110 // included digits set: {9,7,5,4,3,2} + 000000010 // digit = 2
Which is equal to101100000
. One now does an exclusive or withincluded
:
101100000 ^ 101011110
Which is equal to000111110
. One now adds digit again:
000111110 + 000000010
Which is equal to001000000
. The bit representation of the next candidate, is obtained by shifting one position toright:
001000000 >> 1
Which is equal to000100000
. This corresponds to the digit 6, which is exactly the first zero bit found byapplying the for loop (3).
Therefore, assumingcode
as the bit representation of the digit, one calculates the next candidate doing:
code = (((code +inserted) ^inserted) +code) >>1;// branchless code calculation
The problem is that one only obtains the bit representation of the digit, not the digit itself. As, one can see,digit
isnecessary to be able to use this technique.
To obtain the digit from its code, one "assembles" the bit configuration of the digit from its bit representation (code
) as follows:
digit = (code >>8 ) | ((code &0x40 ) >>6 ) | ((code &0x140) >>5 ) | ((code &0xf0 ) >>4 ) | ((code &0x20 ) >>3 ) | ((code &0x14 ) >>2 ) | ((code &0x0c ) >>1 ) | (code &3);
This conversion is not only complex to understand, but also requires a high number of operations. Trying out this code and thebrachless calculation of the next candidate as shownpreviously, the minimal timein C passed from 1.5 to 1.4 miliseconds, which apparently wouldn't seem to justify the effort.
However, after multiple further opimizations, including usingregister
variables, the minimal running time was reduced to1.2 miliseconds. This corresponds to a speedup of roughly 20%, which starts to become quite consequential. It's clear thatthis is also consequence of the highly "imperative" way of implementing thisalgorithmwhich manifestly highly benefits the C implementation, that in itself is more easily optimizable by employing extremely low levelgimmicks that are absent in Java.
Another way to do the calculationaboveis using tables. For example, in C:
unsigned shortc1[]= {0,1,2,0,3 };unsigned shortc2[]= {0,4,5,0,6 };unsigned shortc3[]= {0,7,8,0,9 };
One can compose the digit from its bit representationcode
in the following way:
digit=c1[code&7] |c2[(code >>3)&7] |c3[code >>6];
This code is more understandable than thepreviousone. If the digit is 1, 2 or 3, one simply filters the first 3 bits ofcode
and index the tablec1
with this result. Position 3 is invalidsincecode
has only 1 bit set, and, thus, it can't be 3. Notwithstanding, the resulting operation can be zero, in the case the binaryrepresentation doesn't have any bit set in that range. In this case, to satisfy the branchless logic, the table value is 0.If the digit is 4, 5 or 6, one shiftscode
to the right 3 positions and filters the first 3 bits and index the tablec2
with this result. The same logic applies to digits 7, 8 and 9, using tablec3
. Since one doesn't know which one is correct, one simplyapply a binary or operation with the 3 results, after all only one of them contains the good digit. The other two will be zero.
Trying this solution instead of theprevious,had a significant impact in the minimal execution time of the compiled C code, that was reduced to practically 1 millisecond, that is,an optimization of more than 30%, since the initial minimal time in our comparisons was 1.5 milliseconds.
The several optimizations proposed are complex to understand and most of them do not result in a significant speed up. Theinitial algorithm andin theJava andCcodes, are more clear and relatively easy to understand after the binary representation is understood.
The idea of parallelizing the code by dealing with the whole candidate set at once just using binary representation is promising.However, it falls short if one was thinking in using its intrisic parallelism in the entire algorithm. As seenabove, theapproach allows branchless solutions for the sequential search of a candidate from an arbitrary digit value, which only partiallyexploits this intrisic paralellism. Notwithstanding, it's heavily relying on the integer addition carry propagation mechanism,which is actually a sequential mechanism, but implemented highly efficiently in hardware. This is just additional ingenuity, butnot the same approach. The actual problem in this partial solution is that it's highly complex and requires a high numberof operations. Thus, it highly diverges from the extreme simplicity of the original algorithm. Fortunately, associated with numerousother low level optimizations in C language, it contributed to a significantspeedup (as can be seenhere), anda better speedup as well as less complexity (as seenhere).
A comparative test between the Java implementation and an identical C inplementation has given a considerable advantage to the Cimplementation, not only in terms of raw performance, but also in terms of less variability in times measured for solvingthe same grid, even though, variable execution times were also present in the C implementation. This was expected since Javaactivates the JIT compiler not quite regularly in codes that are executed in short ammounts of time like this one.
Given the extremely short execution times, the low level nature of theoriginal algorithm,and the considerable amount of low level optmizations that are possible in C language, one may confortably conclude that C is themost appropriate language to use the algorithm, since it will provide faster answers. This means, that the C implementation can be seenas the ideal engine for an interactive program where the grid can be entered through a GUI and that the solution must be suppliedin real time when it is requested by the user.
About
Simple 9x9 Sudoku brute force solver with intrinsic parallel candidate set processing using bits to represent digits in the [1, 9] range, and bitwise operations to test a candidate against the candidate set, all at once.