In some multi-threading schemes, JR_NT and IR_NT may produce idle threads
not performing any computation.
This commits detect such situation and implement a collapse of JR/IR loops.

Note: rungit show -w <sha1> for more readability, due to new indent level byif()

Detect and deal with mis-balancing in GEMM macro-kernel (flame#437)

4710c49

Details:- In some multi-threading schemes, JR_NT and IR_NT may produce idle threadsnot performing any computation.- This commits detect such situation and implement a collapse of JR/IR loops.

Copy link

Member

fgvanzee commentedJul 12, 2022

@devinamatthews Curious: did you ever take a look at this PR? What are your thoughts?

Copy link

Member

devinamatthews commentedJul 12, 2022

@fgvanzee I'll look at the changes in more detail. This was dicussed in#437 and only affects cases whereBLIS_JR_NT is probably too large.

Copy link

Member

devinamatthews commentedJul 12, 2022•
edited
Loading

Yeah, I'm not sure I'm comfortable with this. I really think the better answer is just to not use so many threads in the JR loop.@hominhquan Is there a particular reason to setBLIS_JR_NT so high?@fgvanzee ping

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022

@hominhquan Is there a particular reason to setBLIS_JR_NT so high?

Yes, on our MPPA processor, on which matrix tiles (A, B, C) are copied to scratchpad by DMA, memory-footprint is a big concern. We have 16 cores on each "cluster", but only spawnone simultaneous NC-loop (akaBLIS_JC_NT = 1) on the top, then going down and sharing the computation between the cores in the macro-kernel. So the macro-kernel dispatch is subjected toBLIS_JR_NT * BLIS_IR_NT = 16.

Spawning two NC-flows (BLIS_JC_NT = 2) can lighten the above constraint, but will inevitably double the local DMA scratchpad requirement, that we can't afford. (see illustrationhttps://user-images.githubusercontent.com/1337056/178685971-7510aca3-ab44-46b2-aed3-43c25985f67b.png)

Talking more generally, I think collapsing the two JR/IR loops in the macro kernel could help reducing the loop overhead and give better load-balancing in any threading scheme, and so more "hardware-friendly". I remember (I may be wrong) having seen some commits or discussions about adding custom hand-optimized macro-kernels into BLIS ? If yes, I am curious to know the motivation behind that.

Note: In this PR I kept the original two-nested-loops and added a new collapsed one, which sadly made bigger code, but if it was my personal repo, I would keep only the collapsed version.

Copy link

Member

devinamatthews commentedJul 13, 2022

You aren't doing any threading along the M dimension (BLIS_IC_NT)?

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022

No, my bothBLIS_JC_NT andBLIS_IC_NT are set to 1.

OnlyBLIS_JR_NT andBLIS_IR_NT can take any value in{1, 2, 4, 8, 16} under the condition ofBLIS_JR_NT * BLIS_IR_NT = 16.

Copy link

Member

devinamatthews commentedJul 13, 2022

Is this also a memory thing? Parallelizing along the IC loop would definitely be preferable.

Alternatively, since you are currently just collapsing the IR/JR loops, why not set IR_NT=4 and JR_NT=4?

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022•
edited
Loading

Setting IR_NT=4 and JR_NT=4 could improve the situation , but is not the ultimate solution. Here is an example of how the two-nested-loop dispatch can be mis-balanced on edge-blocks:

M = N = 3000, MC = NC = 256, MR = 8, NR = 16.
Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5))

If BLIS_JR_NT = 16, BLIS_IR_NT = 1, there will be only 12 threads working on the JR-loop, and 4 threads in standby.
If BLIS_JR_NT = 8, BLIS_IR_NT = 2, there will be enough work for the 8 JR-threads for the first batch (8 < n_iter),
but the second batch there will be only 4 JR-threads working (n_iter - 8 = 4), and 4 JR-threads in standby.
Idem for IR-threads (since m_iter = 23 not multiple of BLIS_IR_NT)
If BLIS_JR_NT = 4, BLIS_IR_NT = 4, the dispatch will be perfect in JR-loop (n_iter = 12 multiple of BLIS_JR_NT), but not in IR-loop where m_iter = 23 not multiple of BLIS_IR_NT. In the last batch of m_iter's (20+3), there will be 3 IR-threads working, and 1 IR-thread idle x 4 JR-threads = 4 threads in standby.

In this case, setting IR_NT=4 and JR_NT=4 can significantly improve the misbalancing (but not remove completely, we still have idle threads in some trailing iterations) for edge-blocks of size 256x184 or 184x256 (right and bottom), but not the final edge-block 184x184 in the right-bottom corner, even that it impacts only one block of 184x184.

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022•
edited
Loading

Is this also a memory thing? Parallelizing along the IC loop would definitely be preferable.

Yes, still the memory-footprint stuff. Setting IC_NT to something bigger than 1 will also increase the DMA-scratchpad requirement.

Copy link

Member

devinamatthews commentedJul 13, 2022

@fgvanzee what might happen if the collapsed version were used all the time?

Copy link

Member

fgvanzee commentedJul 13, 2022

@devinamatthews For context: The only reason I asked about this PR when I did is because an industry partner asked about it during a conference call, so I wanted to be able to tell them what was going on with it. Now that I've looked at it more closely, I can say that I'm not comfortable with the current implementation. But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

@fgvanzee what might happen if the collapsed version were used all the time?

I would need to study the code more before answering this.

Copy link

Member

devinamatthews commentedJul 13, 2022

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

This was my concern as well, and hence the question.

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.
This was my concern as well, and hence the question.

May be I was way too careful when doing this PR and kept the original implementation. As I said before, if this was my code I would replace the two-nested loops by the collapsed one.

Copy link

ContributorAuthor

hominhquan commentedJul 13, 2022

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.
@fgvanzee what might happen if the collapsed version were used all the time?
I would need to study the code more before answering this.

@devinamatthews @fgvanzee I've looked at the code again, and saw thatthrinfo_t *thread andthrinfo_t* caucus were already created in the thread-info tree. My modif of using still another temporarythrinfo_t is not optimal either.

Further, I would like to know your thoughts on the relevance of always collapsing the JR/IR loops. Is there any other macro-kernel in which collapsing JR/IR loops is bad or difficult or impossible ? (1m ? trsm ? mixed-precision ? mixed-domain ?)

If the answer is yes, then we can apply the collapsing case-by-case, where the modification is trivial (e.g. real-real gemm)
If the answer is no, and if you are convinced that collapsing is good for performance, we might apply it everywhere, and reconsider the utility of "caucus" in the thread-info tree.

fgvanzee added a commit that referenced this pull request

Dec 9, 2022

Tile-level partitioning in jr/ir loops (ex-trsm).

62d383f

Details:- Reimplemented parallelization of the JR loop in gemmt (which is  recycled for herk, her2k, syrk, and syr2k). Previously, the  rectangular region of the current MC x NC panel of C would be  parallelized separately from from the diagonal region of that same  submatrix, with the rectangular portion being assigned to threads via  slab or round-robin (rr) partitioning (as determined at configure-  time) and the diagonal region being assigned via round-robin. This  approach did not work well when extracting lots of parallelism from  the JR loop and was often suboptimal even for smaller degrees of  parallelism. This commit implements tile-level load balancing (tlb) in  which the IR loop is effectively subjugated in service of more  equitably dividing work in the JR loop. This approach is especially  potent for certain situations where the diagonal region of the MC x NR  panel of C are significant relative to the entire region. However, it  also seems to benefit many problem sizes of other level-3 operations  (excluding trsm, which has an inherent algorithmic dependency in the  IR loop that prevents the application of tlb). For now, tlb is  implemented as _var2b.c macrokernels for gemm (which forms the basis  for gemm, hemm, and symm), gemmt (which forms the basis of herk,  her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and  trmm3). Which function pointers (_var2() or _var2b()) are embedded in  the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp  macro is defined, which is controlled by the value passed to the  existing --thread-part-jrir=METHOD (or -r METHOD) configure option.  This script adds 'tlb' as a valid option alongside the previously  supported values of 'slab' and 'rr'. ('tlb' is now the default.)  Thanks to Leick Robinson for abstractly inspiring this work, and to  Minh Quan Ho for inquiring (in PR#562, and before that in Issue#437)  about the possibility of improved load balance in macrokernel loops,  and even prototyping what it might look like, long before I fully  understood the problem.- In bli_thread_range_weighted_sub(), tweaked the the way we compute the  area of the current MC x NC trapezoidal panel of C by better taking  into account the microtile structure along the diagonal. Previously,  it was an underestimate, as it assumed MR = NR = 1 (that is, it  assumed that the microtile column of C that overlapped with microtiles  exactly coincided with the diagonal). Now, we only assume MR = NR.  This is still a slight underestimate when MR != NR, so the additional  area is scaled by 1.5 in a hackish attempt to compensate for this, as  well as other additional effects that are difficult to model (such as  the increased cost of writing to temporary tiles before finally  updating C). The net effect of this better estimation of the  trapezoidal area should be (on average) slightly larger regions  assigned to threads that have little or no overlap with the diagonal  region (and correspondingly slightly smaller regions in the diagonal  region), which we expect will lead to slightly better load balancing  in most situations.- Spun off the contents of bli_thread.[ch] that relate to computing  thread ranges into one of three source/header file pairs:  - bli_thread_range.[ch], which define functions that are not specific    to the jr/ir loops;  - bli_thread_range_slab_rr.[ch], which define functions that implement    slab or round-robin partitioning for the jr/ir loops;  - bli_thread_range_tlb.[ch], which define functions that implement    tlb for the jr/ir loops.- Fixed the computation of a_next in the last iteration of the IR loop  in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around  to the first micropanel of the current MC x KC packed block of A.  However, this is almost never actually the micropanel that is used  next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next  correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use  in the upper-stored case (which *does* actually always choose the  first micropanel of A as its a_next at the end of the IR loop).- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-  intersecting case of gemmt_l_ker_var2() and the above-diagonal case  of gemmt_u_ker_var2() since these cases will only coincide with the  last iteration of the IR loop in very small problems.- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of  which explicitly considers whether the current microtile is the last  tile that intersects the diagonal. (The former does the same, but the  computation coincides with the original bli_is_last_iter().) These  functions are now used in gemmt to test when a_next (or a2) should  "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()  and bli_is_last_iter_tlb_u(), which are similar to the aforementioned  functions but are used when employing tlb in gemmt.- Redefined macros in bli_packm_thrinfo.h, which test whether an  iteration of work is assigned to a thread, as static inline functions  in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).  In the process of redefining these macros, I also renamed them from  bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().- Renamed    bli_thread_range_jrir_rr() -> bli_thread_range_rr()    bli_thread_range_jrir_sl() -> bli_thread_range_sl()    bli_thread_range_jrir()    -> bli_thread_range_slrr()- Renamed    bli_is_last_iter() -> bli_is_last_iter_slrr()- Defined    bli_info_get_thread_jrir_tlb()  and renamed:  - bli_info_get_thread_part_jrir_slab() ->    bli_info_get_thread_jrir_slab()  - bli_info_get_thread_part_jrir_rr() ->    bli_info_get_thread_jrir_rr()- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism  into the JR loop when tlb is enabled for non-trsm level-3 operations.- Added a sanity check to prevent bli_prune_unref_mparts() from being  used on packed objects. This prohibition is necessary because the  current implementation does not take into account the atomicity of  packed micropanel widths relative to the diagonal of structured  matrices. That is, the function prunes greedily without regard to  whether doing so would prune off part of a micropanel *which has  already been packed* and assigned to a thread for inclusion in the  computation.- Further restricted early returns in bli_prune_unref_mparts() to  situations where the primary matrix is not only of general structure  but also dense (in terms of its uplo_t value). The addition of the  matrix's dense-ness to the conditional is required because gemmt is  somewhat unusual in that its C matrix has general structure but is  marked as lower- or upper-stored via its uplo_t. By only checking  for general structure, attempts to prune gemmt C matrices would  incorrectly result in early returns, even though that operation  effectively treats the matrix as symmetric (and stored in only one  triangle).- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges  were computed when 1 < bf. Thankfully, this bug was not yet  manifesting since all current invocations used bf == 1.- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()  that would perform incorrect pruning of unreferenced regions above  where the diagonal of a lower-stored matrix intersects the right edge.  Thankfully, the bug was not harming anything since those unreferenced  regions were being pruned prior to the macrokernel.- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved  C into rectangular and diagonal regions prior to parallelizing each  separately. The new macrokernels use a unified loop structure where  quadratic (slab) partitioning is used.- Updated all level-3 macrokernels to have a more uniform coding style,  such as wrt combining variable declarations with initializations as  well as the use of const.- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that  corresponded to functions that were removed inaeb5f0c.- Other very minor cleanups.- Comment updates.

fgvanzee mentioned this pull request

Dec 9, 2022

Tile-level partitioning in jr/ir loops (ex-trsm).#695

Merged

fgvanzee added a commit that referenced this pull request

Jan 11, 2023

Tile-level partitioning in jr/ir loops (ex-trsm). (#695)

2e1ba9d

Details:- Reimplemented parallelization of the JR loop in gemmt (which is  recycled for herk, her2k, syrk, and syr2k). Previously, the  rectangular region of the current MC x NC panel of C would be  parallelized separately from from the diagonal region of that same  submatrix, with the rectangular portion being assigned to threads via  slab or round-robin (rr) partitioning (as determined at configure-  time) and the diagonal region being assigned via round-robin. This  approach did not work well when extracting lots of parallelism from  the JR loop and was often suboptimal even for smaller degrees of  parallelism. This commit implements tile-level load balancing (tlb) in  which the IR loop is effectively subjugated in service of more  equitably dividing work in the JR loop. This approach is especially  potent for certain situations where the diagonal region of the MC x NR  panel of C are significant relative to the entire region. However, it  also seems to benefit many problem sizes of other level-3 operations  (excluding trsm, which has an inherent algorithmic dependency in the  IR loop that prevents the application of tlb). For now, tlb is  implemented as _var2b.c macrokernels for gemm (which forms the basis  for gemm, hemm, and symm), gemmt (which forms the basis of herk,  her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and  trmm3). Which function pointers (_var2() or _var2b()) are embedded in  the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp  macro is defined, which is controlled by the value passed to the  existing --thread-part-jrir=METHOD (or -r METHOD) configure option.  This script adds 'tlb' as a valid option alongside the previously  supported values of 'slab' and 'rr'. ('slab' is still the default.)  Thanks to Leick Robinson for abstractly inspiring this work, and to  Minh Quan Ho for inquiring (in PR#562, and before that in Issue#437)  about the possibility of improved load balance in macrokernel loops,  and even prototyping what it might look like, long before I fully  understood the problem.- In bli_thread_range_weighted_sub(), tweaked the the way we compute the  area of the current MC x NC trapezoidal panel of C by better taking  into account the microtile structure along the diagonal. Previously,  it was an underestimate, as it assumed MR = NR = 1 (that is, it  assumed that the microtile column of C that overlapped with microtiles  exactly coincided with the diagonal). Now, we only assume MR = NR.  This is still a slight underestimate when MR != NR, so the additional  area is scaled by 1.5 in a hackish attempt to compensate for this, as  well as other additional effects that are difficult to model (such as  the increased cost of writing to temporary tiles before finally  updating C). The net effect of this better estimation of the  trapezoidal area should be (on average) slightly larger regions  assigned to threads that have little or no overlap with the diagonal  region (and correspondingly slightly smaller regions in the diagonal  region), which we expect will lead to slightly better load balancing  in most situations.- Spun off the contents of bli_thread.[ch] that relate to computing  thread ranges into one of three source/header file pairs:  - bli_thread_range.[ch], which define functions that are not specific    to the jr/ir loops;  - bli_thread_range_slab_rr.[ch], which define functions that implement    slab or round-robin partitioning for the jr/ir loops;  - bli_thread_range_tlb.[ch], which define functions that implement    tlb for the jr/ir loops.- Fixed the computation of a_next in the last iteration of the IR loop  in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around  to the first micropanel of the current MC x KC packed block of A.  However, this is almost never actually the micropanel that is used  next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next  correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use  in the upper-stored case (which *does* actually always choose the  first micropanel of A as its a_next at the end of the IR loop).- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-  intersecting case of gemmt_l_ker_var2() and the above-diagonal case  of gemmt_u_ker_var2() since these cases will only coincide with the  last iteration of the IR loop in very small problems.- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of  which explicitly considers whether the current microtile is the last  tile that intersects the diagonal. (The former does the same, but the  computation coincides with the original bli_is_last_iter().) These  functions are now used in gemmt to test when a_next (or a2) should  "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()  and bli_is_last_iter_tlb_u(), which are similar to the aforementioned  functions but are used when employing tlb in gemmt.- Redefined macros in bli_packm_thrinfo.h, which test whether an  iteration of work is assigned to a thread, as static inline functions  in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).  In the process of redefining these macros, I also renamed them from  bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().- Renamed    bli_thread_range_jrir_rr() -> bli_thread_range_rr()    bli_thread_range_jrir_sl() -> bli_thread_range_sl()    bli_thread_range_jrir()    -> bli_thread_range_slrr()- Renamed    bli_is_last_iter() -> bli_is_last_iter_slrr()- Defined    bli_info_get_thread_jrir_tlb()  and renamed:  - bli_info_get_thread_part_jrir_slab() ->    bli_info_get_thread_jrir_slab()  - bli_info_get_thread_part_jrir_rr() ->    bli_info_get_thread_jrir_rr()- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism  into the JR loop when tlb is enabled for non-trsm level-3 operations.- Added a sanity check to prevent bli_prune_unref_mparts() from being  used on packed objects. This prohibition is necessary because the  current implementation does not take into account the atomicity of  packed micropanel widths relative to the diagonal of structured  matrices. That is, the function prunes greedily without regard to  whether doing so would prune off part of a micropanel *which has  already been packed* and assigned to a thread for inclusion in the  computation.- Further restricted early returns in bli_prune_unref_mparts() to  situations where the primary matrix is not only of general structure  but also dense (in terms of its uplo_t value). The addition of the  matrix's dense-ness to the conditional is required because gemmt is  somewhat unusual in that its C matrix has general structure but is  marked as lower- or upper-stored via its uplo_t. By only checking  for general structure, attempts to prune gemmt C matrices would  incorrectly result in early returns, even though that operation  effectively treats the matrix as symmetric (and stored in only one  triangle).- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges  were computed when 1 < bf. Thankfully, this bug was not yet  manifesting since all current invocations used bf == 1.- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()  that would perform incorrect pruning of unreferenced regions above  where the diagonal of a lower-stored matrix intersects the right edge.  Thankfully, the bug was not harming anything since those unreferenced  regions were being pruned prior to the macrokernel.- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved  C into rectangular and diagonal regions prior to parallelizing each  separately. The new macrokernels use a unified loop structure where  quadratic (slab) partitioning is used.- Updated all level-3 macrokernels to have a more uniform coding style,  such as wrt combining variable declarations with initializations as  well as the use of const.- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and  bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and  bli_thrinfo_thread_id(), respectively. This change probably should  have been included inaeb5f0c.- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that  corresponded to functions that were removed inaeb5f0c.- Other very minor cleanups.- Comment updates.

fgvanzee added a commit that referenced this pull request

May 20, 2024

Tile-level partitioning in jr/ir loops (ex-trsm). (#695)

8c29b37

Details:- Reimplemented parallelization of the JR loop in gemmt (which is  recycled for herk, her2k, syrk, and syr2k). Previously, the  rectangular region of the current MC x NC panel of C would be  parallelized separately from from the diagonal region of that same  submatrix, with the rectangular portion being assigned to threads via  slab or round-robin (rr) partitioning (as determined at configure-  time) and the diagonal region being assigned via round-robin. This  approach did not work well when extracting lots of parallelism from  the JR loop and was often suboptimal even for smaller degrees of  parallelism. This commit implements tile-level load balancing (tlb) in  which the IR loop is effectively subjugated in service of more  equitably dividing work in the JR loop. This approach is especially  potent for certain situations where the diagonal region of the MC x NR  panel of C are significant relative to the entire region. However, it  also seems to benefit many problem sizes of other level-3 operations  (excluding trsm, which has an inherent algorithmic dependency in the  IR loop that prevents the application of tlb). For now, tlb is  implemented as _var2b.c macrokernels for gemm (which forms the basis  for gemm, hemm, and symm), gemmt (which forms the basis of herk,  her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and  trmm3). Which function pointers (_var2() or _var2b()) are embedded in  the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp  macro is defined, which is controlled by the value passed to the  existing --thread-part-jrir=METHOD (or -r METHOD) configure option.  This script adds 'tlb' as a valid option alongside the previously  supported values of 'slab' and 'rr'. ('slab' is still the default.)  Thanks to Leick Robinson for abstractly inspiring this work, and to  Minh Quan Ho for inquiring (in PR#562, and before that in Issue#437)  about the possibility of improved load balance in macrokernel loops,  and even prototyping what it might look like, long before I fully  understood the problem.- In bli_thread_range_weighted_sub(), tweaked the the way we compute the  area of the current MC x NC trapezoidal panel of C by better taking  into account the microtile structure along the diagonal. Previously,  it was an underestimate, as it assumed MR = NR = 1 (that is, it  assumed that the microtile column of C that overlapped with microtiles  exactly coincided with the diagonal). Now, we only assume MR = NR.  This is still a slight underestimate when MR != NR, so the additional  area is scaled by 1.5 in a hackish attempt to compensate for this, as  well as other additional effects that are difficult to model (such as  the increased cost of writing to temporary tiles before finally  updating C). The net effect of this better estimation of the  trapezoidal area should be (on average) slightly larger regions  assigned to threads that have little or no overlap with the diagonal  region (and correspondingly slightly smaller regions in the diagonal  region), which we expect will lead to slightly better load balancing  in most situations.- Spun off the contents of bli_thread.[ch] that relate to computing  thread ranges into one of three source/header file pairs:  - bli_thread_range.[ch], which define functions that are not specific    to the jr/ir loops;  - bli_thread_range_slab_rr.[ch], which define functions that implement    slab or round-robin partitioning for the jr/ir loops;  - bli_thread_range_tlb.[ch], which define functions that implement    tlb for the jr/ir loops.- Fixed the computation of a_next in the last iteration of the IR loop  in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around  to the first micropanel of the current MC x KC packed block of A.  However, this is almost never actually the micropanel that is used  next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next  correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use  in the upper-stored case (which *does* actually always choose the  first micropanel of A as its a_next at the end of the IR loop).- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-  intersecting case of gemmt_l_ker_var2() and the above-diagonal case  of gemmt_u_ker_var2() since these cases will only coincide with the  last iteration of the IR loop in very small problems.- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of  which explicitly considers whether the current microtile is the last  tile that intersects the diagonal. (The former does the same, but the  computation coincides with the original bli_is_last_iter().) These  functions are now used in gemmt to test when a_next (or a2) should  "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()  and bli_is_last_iter_tlb_u(), which are similar to the aforementioned  functions but are used when employing tlb in gemmt.- Redefined macros in bli_packm_thrinfo.h, which test whether an  iteration of work is assigned to a thread, as static inline functions  in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).  In the process of redefining these macros, I also renamed them from  bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().- Renamed    bli_thread_range_jrir_rr() -> bli_thread_range_rr()    bli_thread_range_jrir_sl() -> bli_thread_range_sl()    bli_thread_range_jrir()    -> bli_thread_range_slrr()- Renamed    bli_is_last_iter() -> bli_is_last_iter_slrr()- Defined    bli_info_get_thread_jrir_tlb()  and renamed:  - bli_info_get_thread_part_jrir_slab() ->    bli_info_get_thread_jrir_slab()  - bli_info_get_thread_part_jrir_rr() ->    bli_info_get_thread_jrir_rr()- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism  into the JR loop when tlb is enabled for non-trsm level-3 operations.- Added a sanity check to prevent bli_prune_unref_mparts() from being  used on packed objects. This prohibition is necessary because the  current implementation does not take into account the atomicity of  packed micropanel widths relative to the diagonal of structured  matrices. That is, the function prunes greedily without regard to  whether doing so would prune off part of a micropanel *which has  already been packed* and assigned to a thread for inclusion in the  computation.- Further restricted early returns in bli_prune_unref_mparts() to  situations where the primary matrix is not only of general structure  but also dense (in terms of its uplo_t value). The addition of the  matrix's dense-ness to the conditional is required because gemmt is  somewhat unusual in that its C matrix has general structure but is  marked as lower- or upper-stored via its uplo_t. By only checking  for general structure, attempts to prune gemmt C matrices would  incorrectly result in early returns, even though that operation  effectively treats the matrix as symmetric (and stored in only one  triangle).- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges  were computed when 1 < bf. Thankfully, this bug was not yet  manifesting since all current invocations used bf == 1.- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()  that would perform incorrect pruning of unreferenced regions above  where the diagonal of a lower-stored matrix intersects the right edge.  Thankfully, the bug was not harming anything since those unreferenced  regions were being pruned prior to the macrokernel.- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved  C into rectangular and diagonal regions prior to parallelizing each  separately. The new macrokernels use a unified loop structure where  quadratic (slab) partitioning is used.- Updated all level-3 macrokernels to have a more uniform coding style,  such as wrt combining variable declarations with initializations as  well as the use of const.- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and  bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and  bli_thrinfo_thread_id(), respectively. This change probably should  have been included inaeb5f0c.- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that  corresponded to functions that were removed inaeb5f0c.- Other very minor cleanups.- Comment updates.- (cherry picked from commit2e1ba9d)

Labels

None yet

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and deal with mis-balancing in GEMM macro-kernel (#437)#562

Are you sure you want to change the base?

Detect and deal with mis-balancing in GEMM macro-kernel (#437)#562

Conversation

hominhquan commentedOct 20, 2021

fgvanzee commentedJul 12, 2022

devinamatthews commentedJul 12, 2022

devinamatthews commentedJul 12, 2022•
edited
Loading

hominhquan commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022•
edited
Loading

hominhquan commentedJul 13, 2022•
edited
Loading

devinamatthews commentedJul 13, 2022

fgvanzee commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022

hominhquan commentedJul 13, 2022

Movatterモバイル変換

Detect and deal with mis-balancing in GEMM macro-kernel (#437)#562

Are you sure you want to change the base?

Detect and deal with mis-balancing in GEMM macro-kernel (#437)#562

Conversation

hominhquan commentedOct 20, 2021

fgvanzee commentedJul 12, 2022

devinamatthews commentedJul 12, 2022

devinamatthews commentedJul 12, 2022• editedLoading

hominhquan commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022• editedLoading

hominhquan commentedJul 13, 2022• editedLoading

devinamatthews commentedJul 13, 2022

fgvanzee commentedJul 13, 2022

devinamatthews commentedJul 13, 2022

hominhquan commentedJul 13, 2022

hominhquan commentedJul 13, 2022

devinamatthews commentedJul 12, 2022•
edited
Loading

hominhquan commentedJul 13, 2022•
edited
Loading

hominhquan commentedJul 13, 2022•
edited
Loading