Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Feb 10, 2025. It is now read-only.
/sparkyuvPublic archive

High peformance YUV decode/encode library

License

Apache-2.0, BSD-3-Clause licenses found

Licenses found

Apache-2.0
LICENSE.md
BSD-3-Clause
LICENSE-BSD.md
NotificationsYou must be signed in to change notification settings

awxkee/sparkyuv

Library allows to convert RGB to Y'UV formats at high speed using platform SIMD acceleration and appropriateapproximation level.

Also contains a lot of convenient methods, a very fast gaussian blur with very good speed included.

Some useful info indocumentation

Supported YUV matrix

Almost all YUV formats supported, everything you might need :)

YUV444422420411410400NV12/NV21NV16/NV61NV24/NV42
YCbCr
YcCbcCrcN/AN/AN/A
YCgCoN/AN/AN/A
YCgCo-RoN/AN/AN/A
YCgCo-ReN/AN/AN/A
YDzDxN/AN/AN/A
YIQN/AN/AN/A
YDbDrN/AN/AN/A

When encoding to NV12/NV21, NV12/N61, 420, 422 for chroma subsampling bi-linear scaling is automatically applied. Thereis no option to turn this off. Bi-linear scaling probably may be not so good as libsharpyuv, however also good.Due to nature of this transformation in those cases it is exceptionally fast.

All NV, YUV444 (4:4:4), YUV422(4:2:2), YUV420(4:2:2), YUV411 (4:1:1), YUV410 (4:1:0) do not accept nullable U and Vplanes, if you need to use 4:0:0 chroma subsample, please, do use 4:0:0 when it's available

Usage example

sparkyuv::YCbCr422BT601ToRGBA(reinterpret_cast<uint16_t *>(rgba16Data.data()),                             rgba16Stride,                             inWidth, inHeight,                             reinterpret_cast<uint16_t *>(yPlane.data()),                             yPlaneStride,                             reinterpret_cast<uint16_t *>(uPlane.data()),                             uvPlaneStride,                             reinterpret_cast<uint16_t *>(vPlane.data()),                             uvPlaneStride,                             0.299f, 0.114f,                             sparkyuv::YUV_RANGE_TV);sparkyuv::YCgCoR420P8ToRGBA8(rgbaData.data(), rgbaStride, inWidth, inHeight,                             yPlane.data(), yPlaneStride,                             uPlane.data(), uvPlaneStride,                             vPlane.data(), uvPlaneStride, sparkyuv::YCGCO_RE);

Things to note

  • YCgCo-Ro/YCgCo-Re 8-bit cannot be represented in 8-bit uint storage, so YCgCo-Ro/YCgCo-Re 8-bit requires a storagebuffer to be at least twice widen ( 16-bit storage type )
  • YCgCo-Ro/YCgCo-Re cannot be in limited YUV range at the moment, since it not clear how to this range reduction withdynamic bit-depth. For now, it always in full PC range.
  • YCgCo current implementation allows to do a range reduction for TV range. That approximation have significant slowdownagainst full range transform. If that important range reduction may be removed and potential speed up about 30-60% isexpected. 10/12-bit transformations may experience about 100-200% slowdown due to range reductions.
  • YcCbcCrc ( YUV constant light ) primarily intended to be used in BT.2020 CL ( BT.2020 constant light ) color space,however ITU-R provides implementation for any possible kr, kb.
  • YcCbcCrc is direct transformation due to its nature, so expect it to be at least 1000% slower, than any approximationmatrices. It contains especially good acceleration for arm64-v8a with full FP16 support, however it still 1000% slowerthan other approximations. Against naive implementation current transformation about 400-600% faster.
  • YDbDr should be computed from linearized components, however library expect that content already linearized and won'tdo that
  • YDbDr requires very high precision matrix for decoding, however low precision approximation is used, some color infoloss is highly possible especially in TV range
  • YUV (4:1:1), YUV (4:1:0) does only box scaling, it's not very good, however it's only one available at the moment tokeep performance in line

Performance

Compare to libyuv:

All tests performed on Apple M3 Pro.

Not all the conversion path exists in libyuv so not everything can be benchmarked.

  • Since very close approach to libyuv is used for YCbCr in general performance of decoding 8 bit is very close tolibyuv, it is faster on arm64-v8a ( NEON ) on other platform is should be considered same.
  • Encoding of YCbCr 8-bit faster in libyuv.
  • 10/12 bit YCbCr in the library faster than libyuv.
  • YcCbcCrc very slow transformation.
  • YCgCo-Re/YCgCo-Ro fastest transformations available.
  • YCgCo maybe reworked up to 100% performance gain for full range transform
  • Some additional optimizations made for NEON, it is expected to be slightly better on arm64-v8a

Benchmark

Run on (12 X 24 MHz CPU s)CPU Caches:  L1 Data 64 KiB  L1 Instruction 128 KiB  L2 Unified 4096 KiB (x12)Load Average: 6.41, 4.63, 3.87-----------------------------------------------------------------------Benchmark                             Time             CPU   Iterations-----------------------------------------------------------------------SparkyuvYCbCr444P10ToRGBA10      620223 ns       619880 ns         1135LibYuvYCbCr444P10ToRGBA8        1314001 ns      1312990 ns          523SparkyuvYCbCr422P10ToRGBA10      552089 ns       551719 ns         1274LibYuvYCbCr422P10ToRGBA8        1092268 ns      1091448 ns          639SparkyuvYCbCr420P10ToRGBA10      569507 ns       568057 ns         1272LibYuvYCbCr420P10ToRGBA8        1090866 ns      1089950 ns          639SparkyuvRGBAP10ToYCbCr420P10     494450 ns       494201 ns         1394SparkyuvRGBA10ToYCbCr422P10      551609 ns       551187 ns         1275SparkyuvRGBA10ToYCbCr444P10      493970 ns       493355 ns         1424SparkyuvYCbCr444ToRGBA8          332122 ns       331858 ns         2106LibYuvYCbCr444ToRGBA8            365939 ns       365582 ns         1909SparkyuvYCbCr422ToRGBA8          346422 ns       346065 ns         2022LibYuvYCbCr422ToRGBA8            458754 ns       458291 ns         1506SparkyuvYCbCr420ToRGBA8          347903 ns       347500 ns         2020LibYuvYCbCr420ToRGBA8            458823 ns       458356 ns         1504SparkyuvRGBA8ToYCbCr420          543371 ns       542838 ns         1291LibyuvRGBA8ToYCbCr420            270961 ns       270760 ns         2575LibYuvRGBA8ToYCbCr422            402977 ns       402710 ns         1735SparkyuvRGBA8ToYCbCr444          519188 ns       518637 ns         1354SparkyuvRGBA8ToYCbCr422          552492 ns       551868 ns         1269SparkyuvRGBA8ToYCbCr411          585180 ns       584749 ns         1216SparkyuvYCbCr411ToRGBA8          412131 ns       411310 ns         1688SparkyuvRGBA8ToYCbCr410          343129 ns       342826 ns         2041SparkyuvYCbCr410ToRGBA8          410149 ns       409832 ns         1634SparkyuvRGBA8ToYCbCr400          241941 ns       241749 ns         2873SparkyuvYCbCr400ToRGBA8          174510 ns       173868 ns         4416SparkyuvRGBA8ToNV21              605454 ns       604053 ns         1183SparkyuvNV21ToRGBA8              394745 ns       393337 ns         1721LibyuvNV21ToRGBA8                407932 ns       405722 ns         1744SparkyuvRGBA8ToNV12              602741 ns       601062 ns         1134SparkyuvNV12ToRGBA8              386739 ns       385713 ns         1775LibyuvNV12ToRGBA8                387258 ns       386217 ns         1781SparkyuvRGBA8ToNV16              630768 ns       629137 ns         1135SparkyuvNV16ToRGBA8              391858 ns       390851 ns         1749SparkyuvRGBA8ToNV24              542312 ns       541239 ns         1245SparkyuvNV24ToRGBA8              402250 ns       400533 ns         1699SparkyuvYCgCoR444ToRGBA8         401596 ns       400683 ns         1704SparkyuvYCgCoR422ToRGBA8         397587 ns       396831 ns         1760SparkyuvYCgCoR420ToRGBA8         398611 ns       398151 ns         1759SparkyuvRGBA8ToYCgCoR420         237379 ns       236989 ns         2997SparkyuvRGBA8ToYCgCoR422         295757 ns       295151 ns         2391SparkyuvRGBA8ToYCgCoR444         257157 ns       256537 ns         2762SparkyuvYCgCo444ToRGBA8          463513 ns       462416 ns         1504SparkyuvYCgCo422ToRGBA8          474461 ns       473481 ns         1390SparkyuvYCgCo420ToRGBA8          492716 ns       490531 ns         1421SparkyuvRGBA8ToYCgCo420          368884 ns       367747 ns         1910SparkyuvRGBA8ToYCgCo422          504085 ns       502893 ns         1000SparkyuvRGBA8ToYCgCo444          374294 ns       373273 ns         1863SparkyuvYcCbcCrc444ToRGBA8     16413192 ns     16376605 ns           43SparkyuvYcCbcCrc422ToRGBA8     16129819 ns     16099205 ns           44SparkyuvYcCbcCrc420ToRGBA8     16051337 ns     16028295 ns           44SparkyuvRGBA8ToYcCbcCrc420     18415296 ns     18357973 ns           37SparkyuvRGBA8ToYcCbcCrc422     18768154 ns     18751189 ns           37SparkyuvRGBA8ToYcCbcCrc444     18134075 ns     18117923 ns           39SparkyuvYIQ444ToRGBA8            661008 ns       660297 ns         1061SparkyuvYIQ422ToRGBA8            718039 ns       717432 ns          976SparkyuvYIQ420ToRGBA8            719200 ns       718537 ns          976SparkyuvRGBA8ToYIQ420            516908 ns       515223 ns         1410SparkyuvRGBA8ToYIQ422            689602 ns       683831 ns         1044SparkyuvRGBA8ToYIQ444            637150 ns       632302 ns         1051SparkyuvYDzDx444ToRGBA8          664280 ns       659891 ns         1134SparkyuvYDzDx422ToRGBA8          694884 ns       691275 ns          994SparkyuvYDzDx420ToRGBA8          680278 ns       679061 ns         1045SparkyuvRGBA8ToYDzDx420         1014901 ns      1013896 ns          689SparkyuvRGBA8ToYDzDx422         1158669 ns      1155639 ns          617SparkyuvRGBA8ToYDzDx444          235243 ns       234500 ns         3005SparkyuvYDbDr444ToRGBA8          683540 ns       681427 ns         1027SparkyuvYDbDr422ToRGBA8          733433 ns       731622 ns          924SparkyuvYDbDr420ToRGBA8          758845 ns       755145 ns          976SparkyuvRGBA8ToYDbDr420          501362 ns       500499 ns         1401SparkyuvRGBA8ToYDbDr422          686498 ns       684271 ns         1052SparkyuvRGBA8ToYDbDr444          617821 ns       616719 ns         1134SparkyuvRGB10BitToF16            376059 ns       375126 ns         1869LibyuvPremultiply                213810 ns       213188 ns         3219SparkyuvPremultiply              259198 ns       258596 ns         2720SparkyuvUnpremultiply            895136 ns       893093 ns          797LibyuvUnpremultiply             1382779 ns      1380055 ns          492SparkyuvWide8To10Fixed           251359 ns       250905 ns         2765SparkyuvWide8To10Dynamic         303078 ns       302300 ns         2254SparkyuvSaturate10To8Fixed       307243 ns       306721 ns         2316SparkyuvSaturate10To8Dynamic     317039 ns       316536 ns         2233
YcCbcCrc

WithoutF16 Support

Run on (12 X 24 MHz CPU s)CPU Caches:  L1 Data 64 KiB  L1 Instruction 128 KiB  L2 Unified 4096 KiB (x12)Load Average: 3.59, 2.77, 2.86---------------------------------------------------------------------Benchmark                           Time             CPU   Iterations---------------------------------------------------------------------SparkyuvYcCbcCrc444ToRGBA8   20296729 ns     20080429 ns           35SparkyuvYcCbcCrc422ToRGBA8   19318374 ns     19229343 ns           35SparkyuvYcCbcCrc420ToRGBA8   19421785 ns     19355111 ns           36SparkyuvRGBA8ToYcCbcCrc420   19789903 ns     19645216 ns           37SparkyuvRGBA8ToYcCbcCrc422   19822250 ns     19736371 ns           35SparkyuvRGBA8ToYcCbcCrc444   19283523 ns     19162514 ns           35

WithF16 Support

Run on (12 X 24 MHz CPU s)CPU Caches:  L1 Data 64 KiB  L1 Instruction 128 KiB  L2 Unified 4096 KiB (x12)Load Average: 3.20, 2.79, 2.86---------------------------------------------------------------------Benchmark                           Time             CPU   Iterations---------------------------------------------------------------------SparkyuvYcCbcCrc444ToRGBA8   16379008 ns     16325651 ns           43SparkyuvYcCbcCrc422ToRGBA8   16312538 ns     16259651 ns           43SparkyuvYcCbcCrc420ToRGBA8   16300961 ns     16251651 ns           43SparkyuvRGBA8ToYcCbcCrc420   18579207 ns     18525289 ns           38SparkyuvRGBA8ToYcCbcCrc422   19774963 ns     19647667 ns           36SparkyuvRGBA8ToYcCbcCrc444   18945549 ns     18874472 ns           36

Targets

Sparkyuv uses libhwy as backend, so that means on all supported platforms it's uses SIMD for acceleration

Sparkyuv (the same as libhwy) supports 22 targets, listed in alphabetical order of platform:

  • Any:EMU128,SCALAR;
  • Arm:NEON (Armv7+),SVE,SVE2,SVE_256,SVE2_128;
  • IBM Z:Z14,Z15;
  • POWER:PPC8 (v2.07),PPC9 (v3.0),PPC10 (v3.1B, not yet supporteddue to compiler bugs, see #1207; also requires QEMU 7.2);
  • RISC-V:RVV (1.0);
  • WebAssembly:WASM,WASM_EMU256 (a 2x unrolled version of wasm128,enabled ifHWY_WANT_WASM2 is defined. This will remain supported until itis potentially superseded by a future version of WASM.);
  • x86:
    • SSE2
    • SSSE3 (~Intel Core)
    • SSE4 (~Nehalem, also includes AES + CLMUL).
    • AVX2 (~Haswell, also includes BMI2 + F16 + FMA)
    • AVX3 (~Skylake, AVX-512F/BW/CD/DQ/VL)
    • AVX3_DL (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI +VBMI2 + VNNI + VPOPCNT; requires opt-in by definingHWY_WANT_AVX3_DLunless compiling for static dispatch),
    • AVX3_ZEN4 (like AVX3_DL but optimized for AMD Zen4; requires opt-in bydefiningHWY_WANT_AVX3_ZEN4 if compiling for static dispatch)
    • AVX3_SPR (~Sapphire Rapids, includes AVX-512FP16)

[8]ページ先頭

©2009-2025 Movatter.jp