Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comments

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448

Open
TransZAllen wants to merge 4 commits intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_8_on_newest_dev
Open

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448
TransZAllen wants to merge 4 commits intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_8_on_newest_dev

Conversation

@TransZAllen
Copy link

  • [ √ ] I carefully read thecontribution guidelines and agree to them.
  • [ √ ] I have tested the API againstNewPipe.
  • [ √ ] I agree to create a pull request forNewPipe as soon as possible to make it compatible with the changed API.

… URL parameters.- Add `V`, `LANG`, `TLANG` constants to `YoutubeParsingHelper`- Implement `extractVideoId()`, `extractLanguageCode()`, `extractTranslationCode()`- Add `extractQueryParam()` utility in `Utils.java`
- Add core deduplicated logic/method- Reproduce bug with the YouTube video:https://www.youtube.com/watch?v=b7vmW_5HSpE- Introduce `SubtitleDeduplicator.java` to check and remove duplicates, storing results in cache.- Add `SubtitleOrigin` and `SubtitleState` enums to model subtitle type and state.- Ensure cache directory is recreated if missing.
…ntegrate deduplicated subtitles, calling `checkAndDeduplicate()` to remove duplicates and store results in cache.
@TransZAllen
Copy link
Author

TransZAllen commentedJan 30, 2026
edited
Loading

Related issue

Scope of changes

This PR involvestwo repositories:

  • NewPipeExtractor (main changes)
    Implements subtitle deduplication logic in
    SubtitleDeduplicator.

  • NewPipe (supporting changes)
    Initializescache/subtitle_cache directory and ensures locally cached
    subtitle files can still be manually downloaded as '*.srt'.

Reproduction case

Android device, duplicated subtitles visible during playback

YouTube video used for testing:
https://www.youtube.com/watch?v=b7vmW_5HSpE

Subtitle cache location

Cached subtitle files (*.ttml) are stored at:

/storage/emulated/0/Android/data/<package_name>/cache/subtitle_cache

The directory name corresponds tosubCacheDir
defined inSubtitleDeduplicator.

Cache file naming

Cached subtitle filenames are intentionally descriptive,
so their meaning can be understood without reading the code
(e.g. source, format, origin, subtitle state). For example:
cache/subtitle_cache $ ls -l
total 48
-rw-rw---- 1 u0_a579 sdcard_rw 1214 2026-01-29 17:44 b7vmW_5HSpE--en--auto_generated--original.ttml
-rw-rw---- 1 u0_a579 sdcard_rw 42426 2026-01-29 17:44 b7vmW_5HSpE--en-GB--human_provided--deduplicated.ttml

Cache lifecycle & storage impact

Do cached subtitle files need to be deleted?

No.

SubtitleDeduplicator does not delete cached subtitle files,
regardless of whether duplication is detected.

Why keep cached subtitles?

  1. If a remote subtitle download fails, a previously cached version
    can be reused.

  2. In practice, download failures are rare:

    • User-uploaded and auto-generated YouTube subtitles were
      consistently downloadable in tests.
    • Auto-translated subtitles showed a higher failure rate,
      but that feature is not merged intodev branch and is out of scope here.

Storage considerations

  • Subtitle files are small in size.
  • Even with many cached subtitles, storage usage grows slowly.
  • In the worst case, Android will notify users of low storage
    and suggest clearing app cache, which includes NewPipe’s subtitle cache.

Unit tests

extractor/src/test/java/org/schabi/newpipe/extractor/utils/SubtitleDeduplicatorTest.java

Tests focus on thecore deduplication logic:
detecting duplicated adjacent subtitle segments and verifying
the resulting output.

Why SubtitleDeduplicator operates on raw TTML text

SubtitleDeduplicator intentionally operates on raw TTML textbefore XML entity decoding.
Deduplication is limited to lightweight, string-level normalization to avoid double
subtitle parsing into the screen-display and SRT manually download layer.

This design is intended to be practical and simple. At this stage, the goal is only to detect obviously
duplicated subtitle segments from the same TTML source, not to fully interpret
or normalize subtitle semantics.

Difference fromSrtFromTtmlWriter

These two components serve different purposes:

  • SrtFromTtmlWriter

    • Performs full TTML XML parsing
    • Decodes entities
    • Resolves tags (<span>,<br>, etc.)
    • Generates SRT for manual download
  • SubtitleDeduplicator

    • A lightweight pre-processing utility
    • Doesnot parse XML
    • Doesnot decode entities
    • Performs minimal string-level normalization only

Note on subtitle caching

SubtitleDeduplicator always fetches remote subtitle content to ensure the latest version is used when detecting duplicated entries.

During playback, however, ExoPlayer may serve subtitle data from its internal cache (cache/exoplayer) if a cached version is available. As a result, there is a potential inconsistency where the subtitle content displayed to the user may not immediately reflect a recently updated remote subtitle.

This is intentional and won’t be changing for now:

  • YouTube subtitles don’t update often, so while users might encounter outdated cached subtitles, the chance is really low.
  • ExoPlayer’s caching is part of the player, and I’m not sure how much code changing it would require, but that’s outside the scope of this PR.
  • This PR is all about fixing subtitle duplication, not touching the caching setup between the extractor and player.

@TransZAllen
Copy link
Author

TransZAllen commentedJan 30, 2026
edited
Loading

The fix has been tested with a YouTube video link:https://www.youtube.com/watch?v=b7vmW_5HSpE

Before the fix, the subtitle is shown as follows:

After applying the fix, the subtitle is displayed as follows:

Copy link
Member

@AudricVAudricV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we don't want NewPipe Extractor to download files directly, so your approach must be changed, especially as you do not delete files. Also, I would avoid downloading each subtitle to avoid reaching rate limits.

The extractor is not an Android library, therefore Android specific comments should be removed.

If YouTube provides incorrect subtitles, this should be not to the extractor to fix them in my opinion. It makes more sense to be fixed with a custom ExoPlayer component in the app side for me.

@AudricVAudricV added bugIssue is related to a bug youtubeservice, https://www.youtube.com/ labelsJan 30, 2026
@TransZAllen
Copy link
Author

@AudricV

Thanks for the feedback, it’s helpful for me to better understand the intended boundaries of NewPipeExtractor.

I’m preparing some follow-up comments to explain these commits, especially around subtitle downloading. I’m also taking some time to think about whether this design makes sense.

I’ll add more comments soon.

@TransZAllen
Copy link
Author

TransZAllen commentedFeb 1, 2026
edited
Loading

About

“we don't want NewPipe Extractor to download files directly”

:

@AudricV

Just to make sure I understand correctly: currently, the extractor only provides
subtitle URLs, and the actual downloading is done later on the app side
(either by ExoPlayer or by the manual subtitle download feature), right?

My original idea was to fix duplicated subtitles as early as possible and
in a centralized place — at the source where subtitle URLs are produced.
That’s why I chosegetSubtitles(final MediaFormat format) in
YoutubeStreamExtractor.java. If the source is deduplicated, all later code
would receive clean subtitles.

However, I now realize that my changes effectively moved the subtitle downloading
responsibility. Previously, subtitle URLs were passed through the extractor and
only downloaded on the app side. With this change, subtitles are downloaded
insideSubtitleDeduplicator instead.

At first, I thought this was acceptable since subtitles are eventually downloaded
anyway, and this could even reduce network requests by avoiding separate downloads
for playback (via ExoPlayer) and for manual SRT downloads. But after tracing the
code path more, I see that fromgetSubtitles() in NewPipeExtractor to
VideoPlaybackResolver.resolve() on the app side, subtitles are still handled
purely as URLs, without any download happening in the extractor.

So, performing file downloads inside NewPipeExtractor crosses its intended boundary, right?

@sonarqubecloud
Copy link

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@AudricVAudricVAudricV requested changes

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

bugIssue is related to a bugyoutubeservice, https://www.youtube.com/

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@TransZAllen@AudricV

[8]ページ先頭

©2009-2026 Movatter.jp