OpenSmalltalk/opensmalltalk-vmPublic

NotificationsYou must be signed in to change notification settings
Fork124
Star593

Update BitBlt support (primarily for 64-bit ARM)#565

Open

bavison wants to merge 17 commits intoOpenSmalltalk:Cogfrom

bavison:bitblt_aarch64

Open

Update BitBlt support (primarily for 64-bit ARM)#565
bavison wants to merge 17 commits intoOpenSmalltalk:Cogfrom
bavison:bitblt_aarch64

Conversation

Copy link

Contributor

bavison commentedMay 4, 2021

The accelerated BitBlt framework was initially targeted at the ARM11, running the AArch32 instruction set (which is the only one it fully supported).

More recent ARMs run much faster, which has enabled more comprehensive testing via the BitBlt fuzz test framework (https://github.com/bavison/SqueakBitBltTest). This has detected a handful of bugs in both the AArch32-specific and the architecture-neutral parts of the fast BitBlt framework. First I address these.

Next, I add a number of BitBlt fast paths written in platform-independent C. The 8-to-32bpp conversion routine is as fast as anything I could manage with hand-crafted AArch64 assembly. Others are useful as reference implementations for other architectures, or to fill in gaps in their abilities (for example, while I've introduced a class of fast paths for colour maps that only feature two distinct colours, I haven't retrospectively written any AArch32 fast paths for them, so the C fast path will be used for them on AArch32).

The fast path that handles operations with scalar halftoning and 32bpp destination images is a bit of a special case, in that it acts to extend the capabilities of other fast paths. It thus accelerates both AArch32 and AArch64.

The most significant commit, however, is the last one. This features a collection of fast paths implemented using inline AArch64 assembly, tuned for Cortex-A72 (as found in the Raspberry Pi 4). Based on the results of profiling, this has an emphasis on operations with a 32bpp destination image.

Operations with any source depth, in conjunction with 22 of the possible combinationRules (including the common sourceWord, pixPaint and alphaBlend rules) should all be accelerated, providing you don't use little-endian pixel packing, vector halftoning, or non-standard colour map rules when converting from different colour depths.

There are additional fast paths for alphaBlend for either a constant source colour, or a source image whose colour map only consists of two different colours (i.e. where the source image is effectively used as a 1bpp mask, despite being of a greater depth).

bavison added17 commits

May 4, 2021 11:30

Don't pass -m32 to GCC for ARM builds

9ebc245

Correct various "#if ENABLE_FAST_BLT" to "#ifdef"

38c4283

ENABLE_FAST_BLT is typically not assigned a value even when it is defined,so "#if" form is tecnically a syntax error.

Don't assume sourcePPW is valid on entry to copyBitsFallback

43ce975

This is not the case when being called from "fuzz" or "bench" testapplications. It may also not be accurate if a fast path has beensynthesised from a combination of copyBitsFallback and one or more otherfast paths.

Fallback routines need extra help to detect intra-image operations

df667ca

In some places, sourceForm and destForm were being compared to determinewhich code path to follow. However, when being called from fuzz or othertest tools, these structures aren't used to pass parameters, so the pointershaven't been initialised and default to 0, so the wrong code path is followed.Detect such cases and initialise them from sourceBits and destBits instead,since these will perform the same under equality tests.

Remove invalid shortcut in rgbComponentAlphawith

51df83e

This shortcut is triggered more frequently than it used to be, due toimprovements in copyLoop() that avoid buffer overruns.

Fix bug in 32-bit ARM fast paths

a64c5b6

When classed as "wide" because each line is long enough to warrant pipelinedprefetching as we go along, the inner loop is unrolled enough that there isat least one prefetch instruction per iteration. Loading the source imagecan only be done in atoms of 32 bit words due to big-endian packing, sowhen destination pixels are 8 times wider (or more) than source pixels, theloads happen less frequently than the store atoms (quadwords) and aconditional branch per subblock is required to decide whether to do a loador not, depending on the skew and the number of pixels remaining to process.The 'x' register is only updated once per loop, so an assembly-time constantderived from the unrolling subblock number needs to be factored in, butsince the number of pixels remaining decreases as the subblock numberincreases, this should have been a subtraction.In practice, since only the least-significant bits of the result matter,addition and subtraction behave the same when the source:destination pixelratio is 8, so the only operations affected were 1->16bpp, 2->32bpp and1->32bpp. The exact threshold that counts as "wide" depends on the prefetchdistance that was selected empirically, but typically would require anoperation that is several hundreds of pixels wide.

Fix buffer overflow bugs

b577ab2

In fastPathDepthConv (which combines sourceWord colour-depth conversion withanother fast path for another combinationRule at a constant colour depth)and fastPathRightToLeft, it could overflow the temporary buffer and therebycorrupt other local variables if the last chunk of a pixel row was 2048bytes (or just under). This was most likely to happen with 32bpp destinationimages and widths of about 512 pixels.

Fix corruption bugs with wide 1bpp source images

9084c17

For images that were wide enough to invoke intra-line preloads, there was aregister clash between the preload address calculation and one of theregisters holding the deskewed source pixels (this only occurred once perdestination cacheline).

Fix type of halftone array for 64-bit targets

2b0279a

The halftone array is accessed using a hard-coded multiplier of 4 bytes,therefore the type of each element needs to be 32 bit on every platform.`sqInt` is not appropriate for this use, since it is a 64-bit type on64-bit platforms. Rather than unilaterally introduce C99 stdint types,use `unsigned int` since this wil be 32-bit on both current fast pathbinary targets.

Detect and add a new fast path flag for effective-1bpp colour maps

e22ae0b

Sometimes, colour maps are used such that all entries except the firstcontain the same value. Combined with the fact that only source colour 0uses colour map entry 0 (any other colours for which all non-0 bits wouldotherwise be discarded during index generation are forced to use entry 1instead), this effectively acts as a 2-entry (or 1bpp) map, depending onwhether the source colour is 0 or not. This is far more efficiently codedin any fast path by a test against zero, than by a table lookup - it freesup 2 KB, 16 KB or 128 KB of data cache space, depending on whether a 9-,12- or 15-bit colour map was used. There is an up-front cost to scanningthe colour map to see if its entries are of this nature, however in most"normal" colour maps, this scan will rapidly be aborted.

C fast path for 32bpp alphaBlend

e44e2c8

This runs approx 2.6x faster when benchmarked on Cortex-A72 in AArch64.

C fast path for planar alphaBlend

45649aa

C fast path for 8->32bpp conversion

405f35b

C fast path for alphaBlend with 1bpp colour map and scalar halftone

dac723f

Apply scalar halftoning to colour map entries instead for 32bpp desti…

80cd2da

…nationThis makes better use of existing fast paths, and applies to all platforms.

Enable fast blit code for AArch64

e4a27ec

AArch64 assembly optimisations

10d8a11

Copy link

Contributor

nicolas-cellier-aka-nice commentedMay 4, 2021

Great!
The only thing to be changed is that src/plugins/BitBltPlugin/BitBltPlugin.c is generated from Smalltalk slang code.
I think that we can merge first, then modify VMMaker and regenerate BitBltPlugin
Thanks!!!

hogoww referenced this pull request in hogoww/opensmalltalk-vm

Dec 25, 2021

Mutantpharo-project#565, Installing [ Replace #ifTrue: receiver with…

d795dda

… true ] on method [ markWeaklingsAndMarkAndFireEphemerons ]

hogoww referenced this pull request in hogoww/opensmalltalk-vm

Dec 25, 2021

Mutantpharo-project#565, Reverting [ Replace #ifTrue: receiver with …

034e704

…true ] on method [ markWeaklingsAndMarkAndFireEphemerons ] KILLED by 1/10 test cases.

hogoww referenced this pull request in hogoww/opensmalltalk-vm

Feb 26, 2022

Mutantpharo-project#565, Installing [ Replace #+ with #- ] on method…

9f9eeba

… [ attemptToShrink ] 10 test cases.

hogoww referenced this pull request in hogoww/opensmalltalk-vm

Feb 26, 2022

Mutantpharo-project#565, Reverting [ Replace #+ with #- ] on method …

f5b0e8c

…[ attemptToShrink ] 7/10 Test Cases are NOT EQUIVALENT

guillep added a commit to tesonep/opensmalltalk-vm that referenced this pull request

May 12, 2023

Merge pull requestOpenSmalltalk#565from guillep/fix/ephemeron-immed…

aede3fc

…iate-keysVerify ephemeron key is not immediate when marking

krono mentioned this pull request

Aug 27, 2025

Update BitBltArmSimdSourceWord.s to latest ben avison version#592

Open

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BitBlt support (primarily for 64-bit ARM)#565

Update BitBlt support (primarily for 64-bit ARM)#565
bavison wants to merge 17 commits intoOpenSmalltalk:Cogfrom
bavison:bitblt_aarch64

Conversation

bavison commentedMay 4, 2021

Uh oh!

nicolas-cellier-aka-nice commentedMay 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments