Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

pdf: Improve text with characters outside embedded font limits#30512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
QuLogic merged 4 commits intomatplotlib:text-overhaulfromQuLogic:pdf-text-subsets
Sep 26, 2025

Conversation

QuLogic
Copy link
Member

@QuLogicQuLogic commentedSep 4, 2025
edited
Loading

PR summary

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them asXObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this.

Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again.

There are 3 steps to achieve this change:

  1. Track both character codes and glyphs inCharacterTracker. This class takes care of splitting characters into subsets that fit the desired PDF font type limits. -> moved topdf/ps: Track full character map in CharacterTracker #30566
  2. Output each used font block as a separate subsetted font. Also change the subset prefix to use the glyph indices, which are unique, unlike the character codes. -> first commit here
  3. Generate aToUnicode dictionary for the subset font. We already did this for type 42 fonts, but the implementation was incorrect as it didn't correctly handle non-BMP characters. For type 3, support was added in PDF 1.2, but we produce 1.4; there is a fallback to the glyph names, but it is inconsistent and probably depends on the original font having the right names. -> second commit here

In the future, we may wish to extend the implementation inCharacterTracker to "compress" the character map it produces (i.e., if you use 255 characters all from a different 256-sized block with type 3, you get 255 fonts, but we could compress that to a single font.) I tried to avoid hard-coding any assumptions that the mapping is block-by-block, but it is possible that something slipped through, so I do not want to spend too much time on that right now.

Formerly, withmulti_font_type3.pdf (after adding the emoji to the test), copying the text in evince would produce:

There are basic charactersABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz0123456789 !”#$%&’()*+,-./:;¡=¿?@[“]ˆ˙‘—–˝˜and accented charactersÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿin between!

and withmulti_font_type42.pdf:

There are basic charactersABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~and accented charactersÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏin between!

and now we get for both type 3 and 42:

There are basic charactersABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~and accented charactersÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏in between!

Note how in the third line for type 3:

  1. the quotes are 'curly' instead of straight quotes
  2. the chevrons<> are inverted exclamation/question marks¡¿
  3. the backslash\ is a curly opening double quote
  4. the caret^, underscore_, and tilde~ are {circumflex, dot, tilde} accents/smaller glyphsˆ˙˜
  5. the braces{} are em-dash and curly quotes—˝
  6. the pipe| is en-dash
    Everything from the seventh to second-last line is missing in type 3 since it's outside of the 256 limit, and all the emoji are missing from type 42 since that's outside the 65536 limit.

This depends on#30520,#30335,#30566, and#30567.

PR checklist

anntzer reacted with hooray emoji
@anntzer
Copy link
Contributor

This is great and would also allow getting rid of _get_pdf_charprocs. I'll try to have a look at#30335 to start...

@anntzer
Copy link
Contributor

anntzer commentedSep 5, 2025
edited
Loading

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)
I still need to properly review the next one (charmap tracking) but that can also come next by itself?

@QuLogic
Copy link
MemberAuthor

I split the type3 encoding to#30520, but the loop merge has conflicts with the glyph index change.

@QuLogic
Copy link
MemberAuthor

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)

Split the loop merge as well.

@QuLogicQuLogic linked an issueSep 17, 2025 that may beclosed by this pull request
@QuLogicQuLogic moved this fromWaiting for other PR toReady for Review inFont and text overhaulSep 19, 2025
@QuLogicQuLogic marked this pull request as ready for reviewSeptember 19, 2025 07:36
For character codes outside the embedded font limits (256 for type 3 and65536 for type 42), we output them as XObjects instead of using textcommands. But there is nothing in the PDF spec that requires anyspecific encoding like this.Since we now support subsetting all fonts before embedding, split eachfont into groups based on the maximum character code (e.g., 256-entrygroups for type 3), then switch text strings to a different font subsetand re-map character codes to it when necessary.This means all text is true text (albeit with some strange encoding),and we no longer need any XObjects for glyphs. For users of non-Englishtext, this means it will become selectable and copyable again.Fixesmatplotlib#21797
For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF1.2), and for Type 42 fonts, correct the Unicode encoding, which shouldbe UTF-16BE, not UCS2.
These characters are outside the BMP and should test subset splittingfor type 42 output in PDF.
@QuLogic
Copy link
MemberAuthor

Rebased without images (moved totext-overhaul-figures branch) so that it can be merged.

@QuLogicQuLogic merged commita1ed4ef intomatplotlib:text-overhaulSep 26, 2025
34 of 35 checks passed
@github-project-automationgithub-project-automationbot moved this fromReady for Review toDone inFont and text overhaulSep 26, 2025
@QuLogicQuLogic deleted the pdf-text-subsets branchSeptember 26, 2025 01:49
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@tacaswelltacaswelltacaswell approved these changes

@anntzeranntzeranntzer approved these changes

Assignees
No one assigned
Projects
Status: Done
Milestone
v3.11.0
Development

Successfully merging this pull request may close these issues.

[Bug]: Math fonts (Type 3) incorrectly embedded in PDF?
3 participants
@QuLogic@anntzer@tacaswell

[8]ページ先頭

©2009-2025 Movatter.jp