Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Import, maintain and export tag metadata to/from audio files and a dynamically created SQLite table. Automates incremental tag cleanup, enrichment and standardisation for your digital audio library at scale using pre-scripted SQL queries and Polars, achieving quality and consistency in your metadata not possible with a tagger

License

NotificationsYou must be signed in to change notification settings

audiomuze/tagminder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PLEASE NOTE TAGMINDER IS A WORK IN PROGRESS AND IN THE PROCESS OF BEING REDEVELOPED LEVERAGING POLARS DATAFRAMES SO THE DOCUMENTATION IS OUT OF DATE.Legacy code is contained in ztm.py (the main, monolitchic script) and tags2db3-polarsv2.py (the import/export tags to/from database script).

The rewritten code is for the time being comprised of standalone scripts:

tags2db-polars-multidrive-optimised.py - ingests and exports tags from / to a SQlite database and your music files

01...xx.py accomplish specific things working on the database created by ingesting tags

each of these scripts write to a changelog table showing the rowid, tagname, old value and new value determined by the script, so you can inspect the changes before exporting back to tags.

Tagging music has (for me) always been a time consuming, mind-numbing, prone to human error and inconsistency, and laborious task. Tagminder has reduced it to something that is now almost entirely automated, ensuring consistency throughout your digital music library.

tagminder is comprised of two Python scripts. One imports audio metadata from underlying audio files into a dynamically created SQLite database and can export metadata back to the underlying files (but only if you specifically ask it to). The other processes the imported metadata in the SQLite database in order to correct anomalies, improve consistency and enrich the metadata where possible.

It enables you to affect mass updates / changes using SQL and ultimately write those changes back to the underlying files. It leverages thepuddletag codebase to read/write tags so you need to either install puddletag, or at least pull it from the git repo to be able to access its code, specifically puddletag/puddlestuff/audioinfo.

Tags are read/written using the Mutagen library as used in puddletag. Requires Python 3.x. and the following Python libraries:

from collections import OrderedDictimport csvimport osfrom os.path import exists, dirnameimport reimport sqlite3import sysimport timeimport uuidimport pandas as pdimport numpy as npfrom string_grouper import match_strings, match_most_similar, \    group_similar_strings, compute_pairwise_similarities, \    StringGrouper

Furthermore, it's a great tool to use when adding new music to your music collection and wanting to ensure consistency in tag treatment, contributor names, MBIDs, file naming and directory naming. It's now at a point functionally where I tag using Picard (to pull MusicBrainz metadata), then run tagminder against the files, export its changes sight unseen and do a final review using Puddletag to pick up the few edge cases that aren't worth otherwise coding to automate.

It's a Work in Progress:

I work on tagminder when I have the time. ztm.py is a pretty complete standlone tag processor and is the codebase of the original tagminder. I've since been working on rewriting components of tagminder leveraging Polars and the vectorised processing it offers, making some very significant improvements in processing speed, especially when handling large collections. The recoded functionality is being written as standalone scripts for the time being and are named a...z and also represent the logical sequence you'd want to run them in as some of the scripts leverage the work of what comes before them in order to get the best outcomes in terms of metadata quality and consistency. Ultimately I plan on bringing all the scripts that make up tagminder into a more modularised codebase leveraging common functions rather than duplicating code unnecessarily.

Apart from the advantages that come with Polars, the biggest change to tagminder is that it now maintains a changelog of all changes it processes to tags in your music, recording the rowid, the tag affected, the old value, the replacement value, the responsible script, and the date and timestamp of the change.

This makes inspecting changes pretty simple in that every row in the changelog is easily reviewed by browsing the changelog table itself, or you can add context to the change by linking it back to the underlying track metadata using an inner join based on rowid e.g.

query to inspect outcomes:

select distinct old_value,new_value from changelog order by old_value;

inspect changelog:

select changelog.rowid, alib.albumartist, alib.album, column, old_value, alib.releasetype from changelog inner join alib on alib.rowid == changelog.rowid;

General philosophy and rationale

Make sense from chaos

I have a relatively large music collection and rely on good metadata to enhance my ability to explore and listen to my music collection in useful and interesting ways.

Taggers are great but can only take you so far. Tag sources also vary in consistency and quality, often times including issues like adding 'feat. artist' entries to track titles or artist and performer tags. This makes it more difficult for a music server to correctly identify performers and identify a track as a performance of a particular song, and thus include the performance alongside other performances of the same song. It also means 'feat. xx ' type entries don't give rise to metadata in a form that can be used to browse your library.

tagminder lets you automatically address these sorts of issues and does a lot of cleanup work that is difficult to do at scale or consistently using a tagger. It delivers consistent and repeatable results, whether you're handling 1,000 or 1,000,000 tracks; this is simply an impossible task prone to inconsistency and human error when tackled via a tagger.

Preserve your prior work

tagminder takes your existing tags as a given, not trying to second guess you by replacing your metadata with externally sourced metadata, but rather it looks for common issues in your metadata and solves those automatically where it is possible to get a reliable result. It also leverages existing metadata related to tracks and contributors already in your library to enrich albums without e.g. composer metadata or tracks/albums without any genre metadata.

MusicBrainz aware

Music servers are increasingly leveraging MusicBrainz MBIDs when present. tagminder seeks to add MusicBrainz MBIDs to your metadata where MBIDs are already available in your existing metadata e.g. if one performance by an artist happens to have a MBID included in its metadata, tagminder will replicate that MBID in every other performance that contains the same performer name.

To do this, it builds a table of distinct artist/performer/composer and albumartist names that have an associated MBID in your tags and then replicates that MBID to all occurences of that artist/performer/composer in your tag metadata. If your music server is MusicBrainz aware, there's a good chance adding MBID's to your tags will prevent it from merging the work of unrelated artists in your music collection that share the same name.

If you happen to have namesakes within your metadata (i.e. same artist/performer/composer name but with different MBIDs) these artist/performer/composer MBIDs will not be replicated as there would be no way for tagminder to know which MBID to apply. After running tagminder look forREF_namesakes* tables in the database - any records therein represent contributors requiring manual disambiguation by adding the appropriate MBID to the matching records in the alib table.

It can also go one step further, leveraging a dump of musicbrainz contributors and MBIDs against which to validate and/or add MBID metadata when the reference table _REF_mb_disambiguated is present. This would be the most effective way to validate and add musicbrainz identifiers to your music without having to resort to retagging and risking metadata edits you've made being overwritten by a tagger.

If you want to create your own table by downloading from the MusicBrainz database dump, _REF_mb_disambiguated contains 3 fields:mbid - the musicbrainz identifierentity - the contributor namelentity - lowercase representation of entity

Leaves your files untouched unless you explicitly choose to export changes

tagminder writes changes to a database table and logs which tracks have had metadata changes. It does not make changes to your files unless you explicitly invoke tags2db.py using its export option. All tables in the database can be viewed and edited using a SQLite database editor likeSQLiteStudio orDB Browser for SQLite. This enables you to browse your metadata and inspect tags to see exactly what would be written to files if you chose to export your changes to the underlying files.

In addition to running the automated changes you're also able to manually edit any records using the aforementioned database editors to further enhance/correct metadata issues manually, or code and run your own SQL queries if you're so inclined.

Backing out changes is easy

All originally-ingested records are written to a rollback table, so in the event you've made changes to your metadata you don't like, you can simply reinstate your old tags by exporting from the rollback table.

Reducing the need for incremental file backups

If your music collection is static in terms of filename and location, you can also use the metadata database as a means of backing up and versioning metadata simply by keeping various iterations of the database. This obviates the need to overwrite a previous backup of the underlying music files, reducing storage needs, backup times and complexity.

Getting metadata current after restoring a dated backup of your music files is as simple as exporting the most recent database against the restored files. The added benefit is it eliminates the need to create incremental backups of your music files simply because you've augmented the metadata - just backup the database and as long as your file locations remain static you have everything you need - the audio files and their metadata.

By default tagminder generates a gen4 uuid for all files, which would be added to your tags on exporting changes. A future update will remove dependency on static filenames and locations by instead referencing the UUID to ascertain which files to write to on exporting changes from the database. The UUID would then be referenced rather than file path. This would have the effect of making your metadata impervious to file move and rename operations (the code has been written, I've just not had a chance to incorporate it - will do so as I work to refactor tagminder and carry out most operations in a Polars dataframe to enhance tagminder's speed and efficiency by leveraging vectorised operations.

Understanding the scripts

tags2db-polars-multidrive.py (used to import / expport tracks between files and database)

Handles the import and export from/to the underlying files and SQLite database. It is the means of getting your tags in and out of your underlying audio files.

This is where the puddletag dependency originates. I've modified Keith's (puddletag's original author) Python 2.x tags to database code to run under Python 3. To get it to work, all that's required is that you pull a copy ofpuddletag source then copy tags2db.py into the puddletag root folder so that it has access to puddletag's code library.

You do not need a functioning puddletag with all dependencies installed to be able to use tags2db.py, albeit in time you might find puddletag handy for some cleansing/ editing that's best left to human intervention.

This code uses parallel processing and is able to concurrently ingest tags from multiple drives, providing massive gains over sequential processing without thrashing drives or causing I/O bottlenecks. Be careful not to specify two ingestion paths on the same drive because that will only serve to thrash the drive making it do what drives hate most - parallel reads.

What--chunk-size does

--chunk-size controls how many files are processed per batch (chunk) by each worker.

Key Behaviors

SettingEffectTrade-offs
Small (e.g., 500)- More frequent updates
- Lower memory use per worker
- Higher overhead (more scheduling)
- Slower for large libraries
Medium (e.g., 2000)- Balanced throughput/memory
- Default in your script
- Moderate overhead
- Good for most systems
Large (e.g., 5000)- Maximizes CPU utilization
- Fewer sync delays
- Higher RAM use
- Delayed progress updates

When to Adjust It

ScenarioRecommended--chunk-sizeReason
Low RAM (<32GB)500–1000Avoids memory spikes
High RAM (64GB+)2000–5000Maximizes CPU usage (fewer chunks = less overhead)
Network Storage1000–2000Balances I/O latency and CPU
Debugging100Faster feedback (smaller batches complete quicker)

Example: Optimizing for an AMD Ryzen 7 PRO 4750GEWith 64GB RAM and 16 threads:

# High-throughput setting (large chunks)python tags2db-polars-multidrive.py import db.sqlite D:\Music --chunk-size 5000 --workers 16# Balanced setting (default)python tags2db-polars-multidrive.py import db.sqlite D:\Music --chunk-size 2000 --workers 16# Low-memory setting (for background tasks)python tags2db-polars-multidrive.py import db.sqlite D:\Music --chunk-size 500 --workers 8

TL;DR

  • Smaller--chunk-size:

    • Safer for RAM
    • Worse for CPU utilization
  • Larger--chunk-size:

    • Faster for big libraries
    • Needs more RAM
  • Sweet spot: Start with 2000 and adjust based on your system monitor.

What--workers does

--workers sets the number of parallel processes (CPU cores/threads) used to scan files and process tags.

Key Behaviors

SettingEffectTrade-offs
Low (e.g., 4)- Light on CPU/RAM
- Good for HDDs or shared systems
- Slower processing
- Underutilizes modern CPUs
Default (None)- Auto-scales to (drives × 8)
- Capped at 32 (yourmax_total_workers)
- Balanced for most systems
High (e.g., 16)- Maximizes CPU usage (Ryzen 7 PRO 4750GE = 16 threads)
- Fastest for SSDs
- Risk of I/O bottlenecks on HDDs
- Higher RAM use

When to Adjust It

ScenarioRecommended--workersReason
Multi-drive (HDDs)8Avoids I/O contention (HDDs hate parallel seeks)
Single SSD16Maximizes Ryzen 7's 16 threads
Background Task4Leaves CPU free for other apps
Debugging1Easier error tracing (single-threaded)

Example: Optimizing for an AMD Ryzen 7 PRO 4750GE

# Max performance (16 threads, SSD)python tags2db-polars-multidrive.py import db.sqlite D:\Music --workers 16 --chunk-size 5000# Balanced (8 threads, HDD)python tags2db-polars-multidrive.py import db.sqlite D:\Music --workers 8 --chunk-size 2000# Default (auto-scales to drives × 8)python tags2db-polars-multidrive.py import db.sqlite D:\Music E:\Music  # Uses 16 workers (2 drives × 8)

tagminder.py (currently parading as ztm.py as it'll be deprecated at some juncture)

Does the heavy lifting where metadata is concerned, handling the cleanup of tags in the SQL table 'alib'. A SQL trigger flags any changed records, whether they're changed by way of a SQL update or a manual edit (the trigger field 'sqlmodded' is incremented every time a tag value in a record is updated).

This enables tagminder to generate a database 'export.db' containing only changed records, enabling you to write changes only to those files that have had their metadata modified in the database by tagminder or the user.As a bonus tagminder creates a text file called affected_files.csv every time it is run, listing the individual files that have been upated.A user executed bash shell script addsec2modtime.sh reads that file and adds 1 second to the last modified date of every file listed therein. This ensures that any update scan by a music server (whether batch or real-time) is able to detect the underlying files that need rescanning as opposed to rescanning all files in your collection.

At present, Tagminder's specific capabilities are as follows:

General tag cleanup

  • strips all spurious tags from the database so that your files only contain the sanctioned tags listed in tagminder.py (you can obviously modify to suit your needs)

  • trims all text fields to remove leading and trailing spaces

  • removes all spurious CR/LF occurrences in text tags (you'll be surprised how many there are). It does not process the LYRICS or REVIEW tags.

  • replaces all grave accent apostrophes with ASCII and ISO 8859 compliant apostrophe: " ' "

  • removes PERFORMER tags where they match or are already present in the ARTIST tag

  • sorts and eliminates duplicate tag values in tags

Tag standardisation

  • merges ALBUM and VERSION tags into ALBUM tag to get around Logitechmediaserver (LMS), Navidrome and other music servers merging different versions of an album into a single album. VERSION is left intact making it simple to reverse with an UPDATE query

  • adds [bit depth/sampling rate kHz], [Mixed Res] or [DSD] to end of all album names where an album is not redbook (16/44.1). This is because very few music servers differentiate different releases properly if they share exactly the same name, and the dev's typically don't see getting this right as a priority, and if they do they completely overengineer their solution rather than use tags

  • sets COMPILATION = 1 for all Various Artists albums and 0 for all others. Tests for presence or otherwise of ALBUMARTIST and whether __dirname of album begins with ‘VA -’ to make its determination. Does the same for all albums where the __dirname begins with ‘OST - ’ (denoting Orignal Sountrack).

  • removes 'Various Artists' from ALBUMARTIST tag

  • writes out multiple TAGNAME=value entries rather than TAGNAME=value1\value2 delimited tag entries, and in doing so respects the underlying file type's tagging 'specification' (if one considers the bull that's been conjured over the years to be standards)

  • normalises RELEASETYPE entries toFirst Letter Caps for better presentation in music server front-ends that leverage RELEASETYPE (support for RELEASETYPE was added to Logitechmediaserver massively improving its ability to list an artist's work in a meaningful manner rather than as one long unstructured list)

  • adds MusicBrainz identifiers to contributors (artists, albumartists, composers, engineers and producers) leveraging what already exists in your file tags or where a master table of MBID's exists it leverages that. Where a contributor name is associated with > 1 MBID in your tags these contributors are ignored so as not to conflate contributors. Check for tablesINF_namesakes* for contributors requiring manual disambiguation and confirmation

  • Makes albumartist, artist, composer, engineer, producer text case consistent with their representation in the MusicBrainz ecosystem. If they don't exist in MusicBrainz it converts them to Firstlettercaps. Can also replace the text case of artist names in _REF_mb_disambiguated with matching names found in table _REF_contributor_matched_on_allmusic. So if you want to change the text case of a contributor throughout your collection, just add a record to _REF_contributor_matched_on_allmusic and populate the name in the text case of your choosing - records in _REF_mb_disambiguated are always updated to reflect the text case in _REF_contributor_matched_on_allmusic prior to being applied elsewhere.

  • removes zero padding from discnumber and track tags

  • Substantially implementsRYM Capitalisation rules for English language insofar as is possible without resorting to leveraging a LLM to understand word context.

Handling of ‘Live’ in album names and track titles

  • removes all instances and variations of Live entries from track titles and moves or appends that to the SUBTITLE tag as appropriate and ensures that the LIVE tag is set to 1 where this is not already the case. It does not corrupt track names where the word ‘Live’ is part of a song title

  • removes (live) from end of all album names, sets LIVE = '1' where it's not already set to '1' and appends (Live) to SUBTITLE tag where this is not already the case

  • ensures LIVE tag is set to 1 for all Live performances where [(live...)] and its many variations appears in TITLE or SUBTITLE tags

Handling of Feat. in track title and artist tags

  • removes most instances and variations of Feat. entries from ARTIST and TITLE tags and appends \ delimited performer names to the ARTIST tag

Identifying duplicated FLAC audio content

  • identifies all duplicated albums based on records in the alib table. The code assumes every folder contains an album and relies on the md5sum embedded in properly-encoded FLAC files. It basically creates a concatenated string from the sorted md5sum of all tracks in a folder and compares that against the same for all other folders. If the strings match you have a 100% match of the audio stream and thus a duplicate album, irrespective of what tags / metadata might tell you. You can confidently remove all but one of the matched folders.
  • If any FLAC files are missing the md5sum or the md5sum is zero then a table is created listing all folders containing FLAC files that should be reprocessed by the official FLAC encoder usingflac -f -8 --verify *.flac. Be careful not to delete duplicates where the concatenated md5sum is a bunch of zeroes or otherwise empty - re-encode these files and re-run tagminder.

Renaming of music files and directories based on tag metadata and file attributes

Renames audio files as follows:

  • if compilation is set to 1: file renaming: 'discnumber-track - artist - title.ext' ; folder renaming: 'VA - album [release] [bit depth sample rate]'.
  • if compilation is set to 0: file renaming: 'discnumber-track - title.ext' ; folder renaming: 'albumartist - album [release] [bit depth sample rate]'In all instances [bit depth sample rate] are only included where an album is not redbook.

Files and directories are renamed in-situ rather than being moved elsewhere in directory tree. This means all other files associated with an album remain in the renamed folders.

Normalising artist, albumartist, composer, engineer and producer names and getting them consistent throughout your collection. (record labels to be incorporated in future).

Tagminder includes the capability to affect mass changes across hundreds of thousands of records almost instantaneously. Music Servers typically employ database models that mean '10CC', '10cc', '10 cc' and '10cc.' are four different artists. Tagminder includes a transformation function that enables you to transform all instances of names like 10CC, 10cc. and 10 cc to 10cc throughout your collection in a single operation, without having to write any code. These transformation rules need only be captured once, and are then available for all future metadata ingestion, ensuring that your collection achieves a level of consistency that would otherwise be very difficult (if not impossible) to attain and maintain.

To aid in identifying variations of contributor names that may be the same artist (e.g. the 10cc examples above) tagminder uses string-grouper to compare all unique contributor names in your metatada and present these to you in a table showing possible matches with a condifence level. All that's required from you is to insert 1 or 0 in the field indicating whether or not the name on the left should be replaced with the name on the right, making it trivial to populate the disambigation table used to drive normalisation of names throughout your music. tagminder identifies names it thinks might represent the same contributor, then eliminates any you have previously confirmed are false-positives or require replacement by reference to matching names in _REF_disambiguation_workspace where false positives and replacement required are represented as (status=0/1)respectively. The remaining names can be found in table _INF_string_grouper_possible_namesakes, for consideration by the user.

image

Identifying different versions of an album

If you're a music fanatic you may have multiple releases of the same album. At some point your rational mind may get the better of you and you might want to get rid of a few versions that are substantially the same ... same track count, same dynamic range, same bit depth and sampling rate. Tagminder can point these out for you and auto-select some candidates for culling, leaving you with a table of versions to peruse and edit/override or accept versions it has selected as candidates for removal. Tagminder will not flag a version as a candidate for removal if any of the following keywords are present in the directory name:

audiophile label signifier
afz
audio fidelity
compact classics
dcc
fim
gzs
mfsl
mobile fidelity
mofi
mastersound
sbm
xrcd

Whilst tagminder will never remove the versions for you, the table contains everything you need to be able to export the directory paths of those versions you're sure you want to let go of. A bash script can then do the dirty work or you can work through it manually. Versions can be found in the table _INF_versions.

Pointing out missing metdata and other useful information

Whilst assessing and improving your metadata consistency tagminder populates a number of tables along the way. All tables that begin withINF as a prefix contain data you may want to peruse because they point to metadata or library issues you may want to address. The tables and their contents are described below:

table namepurpose
_INF___dirpaths_with_FLACs_to_killlist of duplicate albums you can delete, leaving behind only one copy
_INF___dirpaths_with_same_contentlist of albums that are duplicated (as in every track has an identical audio stream as one or more other albums
_INF_albums_missing_artistalbums containing tracks without a track artist
_INF_albums_missing_tracknumbersalbums containing tracks without a track number
_INF_albums_with_duplicated_tracknumbersalbums containing tracks with a track number appearing > 1x
_INF_albums_with_nameless_tracksalbums containing tracks without a track title
_INF_albums_with_no_genrealbums with no genre tags
_INF_albums_with_no_yearalbums with no year tags
_INF_missing_tracknumbersmissing track sequences by album
_INF_nonstandard_FLACSFLAC files without the embedded md5 of the audio stream
_INF_string_grouper_possible_namesakespossible namesakes for disambiguation or correction to ensure consistenctcy of contributor name
_INF_tracks_without_artisttracks without an artist tag
_INF_tracks_without_titletracks without a title tag
_INF_versionsalbums where multiple versions are present in library. killit == 'Investigate' means version has same key attributes as other versions. killit == '1' means a higher DR version has been identified that is either same or higher sampling rate and bit depth

TODO:

Refer issues list, filter on enhancements. Refactor all code to leverage Polars DF wherever possible, leveraging vectorisation and significantly improving performance.

USAGE:

I generally tag with Picard or another semi-automted metadata source, then run the lot though tagminder, then use Puddletag for fine tuning, then re-process the lot through tagminder to ensure I've not introduced any inconsistencies through manual tagging.

I strongly suggest writing the SQLite database to /tmp as its 'alib' table is dynamically modified every time a new tag is encountered when tags are being imported from audio files. (albeit the refactored code handles the entire import in memory and only writes the database at the end).

It'll work on physical disk, but it'll take longer. It'll also trigger a lot of writes whilst ingesting metadata and dynamically altering the table to ingest new tags, so you probably want to avoid hammering a SSD by ensuring that you're not writing the database directly to SSD. Use /tmp!

First import tags from your files into a nominated database:

python /path.to/puddletag/tags2db.py import /tmp/dbname.db /path/to/import/from

Let that run - it'll take a while to ingest tags from your library, writing each file's metadata to a table called 'alib'

Run tagminder.py against the same database:

python ~/tagminder.py /tmp/dbname.db

It'll report its workings and stats as it goes.

When it's done the resulting (changed records only) are written to 'export.db', which can be exported back to the underlying files like so:

python /path.to/puddletag/tags2db.py export /tmp/export.db /path/imported/from

This will overwrite the tags in the associated files, replacing it with the metadata tags stored in 'export.db'

Workflow

  • run it once against your entire music collection to process updates en-mass

  • thereafter use tagminder to cleanup tags for any music you want to add to your music collection. For me that means tagging via Picard followed by Puddletag (to leverage tag sources other than MusicBrainz, inspect tags, standardise filenames, rename folders etc.) and then running tagminder to pick up anything I may have overlooked.

About

Import, maintain and export tag metadata to/from audio files and a dynamically created SQLite table. Automates incremental tag cleanup, enrichment and standardisation for your digital audio library at scale using pre-scripted SQL queries and Polars, achieving quality and consistency in your metadata not possible with a tagger

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp