Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork0
iscc/iscc-sct
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Caution
This is a proof of concept. All releases with version numbers below v1.0.0 may break backwardcompatibility and produce incompatible Semantic Text-Codes. The algorithms of thisiscc-sct
repository are experimental and not part of the officialISO 24138:2024 standard.
iscc-sct
is aSemantic-Code Text implementation for theISCC(International Standard Content Code). The Semantic-Code Text is a new ISCC-UNIT for semantic textidentification. The algorithm creates simmilar (low hamming distance) codes for semantically similartext inputs across different languages. The SCT ISCC-UNIT is a compact binary code created from abinarized document-vector text-embeddings.
# Install the packagepip install iscc-sct# Generate a semantic codepython -c"import iscc_sct as sct; print(sct.create('Your text here').iscc)"# Or use the CLIsct"path/to/textfile.txt"
The ISCC is a combination of various similarity preserving fingerprints and an identifier fordigital media content.
ISCCs are generated algorithmically from digital content, just like cryptographic hashes. However,instead of using a single cryptographic hash function to identify data only, the ISCC uses variousalgorithms to create a composite identifier that exhibits similarity-preserving properties (softhash or Simprint).
The component-based structure of the ISCC identifies content at multiple levels of abstraction. Eachcomponent is self-describing, modular, and can be used separately or with others to aid in variouscontent identification tasks. The algorithmic design supports content deduplication, databasesynchronization, indexing, integrity verification, timestamping, versioning, data provenance,similarity clustering, anomaly detection, usage tracking, allocation of royalties, fact-checking andgeneral digital asset management use-cases.
Feature | ISCC Content-Code Text | ISCC Semantic-Code Text |
---|---|---|
Focus | Lexical similarity | Semantic similarity |
Cross-lingual | No | Yes |
Use case | Near-duplicate detection | Semantic similarity, translations |
The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicatematching. The ISCC Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a moreabstract and broader semantic similarity. It is engineered to be robust against a wide range ofvariations and, most remarkably, translations of text that cannot be matched based on lexicalsimilarity alone.
One of the most interesting aspects of the Semantic Text-Code is its ability to generate(near)-identical codes for translations or paraphrased versions of the same text. This meansthat the same content, expressed in different languages, can be identified and linked, opening upnew possibilities for cross-lingual content identification and similarity detection.
- Semantic Similarity: Utilizes deep learning models to generate codes that reflect the semanticessence of text.
- Translation Matching: Creates nearly identical codes for text translations, enablingcross-lingual content identification.
- Bit-Length Flexibility: Supports generating codes of various bit lengths (up to 256 bits),allowing for adjustable granularity in similarity detection.
- ISCC Compatible: Generates codes fully compatible with the ISCC specification, facilitatingseamless integration with existing ISCC-based systems.
Ensure you have Python 3.10 or newer installed on your system. Install the library using:
pip install iscc-sct
For systems with GPU CUDA support, enhance performance by installing with:
pip install iscc-sct[gpu]
Generate a Semantic Text-Code using the create function:
>>>import iscc_sctas sct>>> text="This is some sample text. It can be a longer document or even an entire book.">>> sct.create(text,bits=256){ "iscc": "ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI", "characters": 77}
For granular (per chunk) feature outputs:
>>>import iscc_sctas sct>>> text="This is some sample text. It can be a longer document or even an entire book.">>> sct.create(text,bits=256,granular=True){ "iscc": "ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI", "characters": 77, "features": [ { "maintype": "semantic", "subtype": "text", "version": 0, "byte_offsets": false, "simprints": [ { "simprint": "XZjeSfdyVi0", "offset": 0, "size": 77, "content": "This is some sample text. It can be a longer document or even an entire book." } ] } ]}
Tip
By default, granular features (simprints) report their offsets as character positions. If thebyte_offsets
option is enabled (via the ISCC_SCT_BYTE_OFFSETS environment variable or as anoption in code), the offsets will be computed on the UTF-8 representation of the text. This can beuseful when you need to retrieve individual text chunks via random access from remote storage.
importiscc_sctassct# Generate codes for two textstext1="""An ISCC applies to a specific digital asset and is a data-descriptor deterministically constructedfrom multiple hash digests using the algorithms and rules in this document. This document does notprovide information on registration of ISCCs."""text2="""Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, derdeterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in diesemDokument erstellt wird. Dieses Dokument enthält keine Informationen über die Registrierung von ISCCs."""code1=sct.create(text1)code2=sct.create(text2)distance=sct.iscc_distance(code1.iscc,code2.iscc)print(f"Hamming distance in bits:{distance}")
The installation also provides a sct command-line tool:
usage: sct [-h] [-b BITS] [-g] [-d] [path]Generate Semantic Text-Codesfor text files.positional arguments: path Path to text files (supports glob patterns) or'gui' to launch Gradio demo.options: -h, --help show thishelp message andexit -b BITS, --bits BITS Bit-Length of Code (default 256) -g, --granular Activate granular processing. -d, --debug Show debugging messages.
Text Input → Text Chunking → Embedding Generation → Vector Aggregation → Binarization → ISCC Encoding
iscc-sct
employs the following process:
- Splits the text into overlaping chunks (using syntactically sensible breakpoints).
- Uses a pre-trained deep learning model for text embedding.
- Generates feature vectors capturing essential characteristics of the chunks.
- Aggregates these vectors and binarizes them to produce a Semantic Text-Code.
- Prefixes the binarized vector with the matching ISCC header, encodes it with base32, and adds the"ISCC:" prefix.
This process ensures robustness to variations and translations, enabling cross-lingual matchingbased on a short Simprint.
ISCC-SCT can be configured using environment variables:
Environment Variable | Description | Default |
---|---|---|
ISCC_SCT_BITS | Default bit-length of generated code | 64 |
ISCC_SCT_MAX_TOKENS | Maximum tokens per chunk | 127 |
ISCC_SCT_OVERLAP | Maximum token overlap between chunks | 48 |
See iscc_sct/options.py for more configuration settings.
- The embedding model will be downloaded on first execution
- CPU vs GPU: On systems with CUDA-compatible GPUs, install with
pip install iscc-sct[gpu]
forsignificantly faster processing.
We welcome contributions to enhance the capabilities and efficiency of this proof of concept. Fordevelopment, install the project in development mode usingPoetry:
git clone https://github.com/iscc/iscc-sct.gitcd iscc-sctpoetry install
If you have suggestions for improvements or bug fixes, please open an issue or pull request. Formajor changes, please open an issue first to discuss your ideas.
We particularly welcome recommendations for other multilingual text embedding models trained withMatryoshka Representation Learning (MRL) and optimized for binarization. Such contributions couldsignificantly improve the performance and efficiency of the ISCC Semantic Text-Code generation.
This repository also provides an interactive Gradio demo that allows you to explore the capabilitiesof ISCC Semantic Text-Code. The demo showcases:
- Generation of ISCC Semantic Text-Codes for input texts
- Comparison of two texts and their similarity based on the generated codes
- Visualization of text chunking and granular matches
- Adjustable parameters like ISCC bit-length and maximum tokens per chunk
You can access the live version of the Gradio demo at:https://huggingface.co/spaces/iscc/iscc-sct
To run the Gradio demo locally, you first need to install theiscc-sct
package with the optionaldemo
dependency:
pip install iscc-sct[demo]
This will ensure that Gradio and other necessary dependencies for the demo are installed.
After installation, you can use thesct
command-line tool that comes with the package:
sct gui
This command will launch the Gradio interface in your default web browser, allowing you to interactwith the demo on your local machine.
- The semantic matching works best for texts with at least several sentences.
- Very short texts (a few words) may not generate reliable semantic codes.
- Performance may vary across different language pairs.
- The model size is approximately 450MB, which may impact initial loading time.
Arabic, Armenian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (China), Chinese (Taiwan),Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, French (Canada),Galician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian,Japanese, Kannada, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi,Mongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian,Serbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian,Urdu, Vietnamese.
If you use ISCC-SCT in your research, please cite:
@software{iscc_sct,author ={Pan, Titusz},title ={ISCC-SCT: Semantic Text-Code for the International Standard Content Code},url ={https://github.com/iscc/iscc-sct},version ={0.1.4},year ={2025},}
The current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) while stillsplitting at lexically sensible boundaries with an overlap of up to 48 tokens. Seetext-splitter.
Cross-document chunk matching via granular Simprints can likely be improved significantly with asemantically aware and shift-resistant chunking strategy. Better shift resistance would improve thechances that the bounderies detected for semantically similar text sequences in different documentsare aligned.
A text embedding model trained withMatryoshka Representation Learning may yield better results withshort 64-bit Semantic Text-Codes.
A text embedding model with support for a largermax_token
size (currently 128) may yieldhigher-order granular simprints based on larger chunks of text.
- Text Chunking:text-splitter
- Text Embeddings:Sentence-Transformers
About
ISCC - Semantic Code Text
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.