Movatterモバイル変換

docarray/docarrayPublic

NotificationsYou must be signed in to change notification settings
Fork234
Star3.1k

Represent, send, store and search multimodal data

docs.docarray.org/

License

Apache-2.0 license

3.1k stars 234 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 920 Commits
.github		.github
docarray		docarray
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

Repository files navigation

The data structure for multimodal data

⬆️DocArray v2: We are currently working on v2 of DocArray. Keep reading here if you are interested in thecurrent (stable) version, or check out thev2 alpha branchandv2 roadmap!

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API.

🚪Door to multimodal world: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data. The foundation data structure ofJina,CLIP-as-service,DALL·E Flow,DiscoArt etc.

🧑‍🔬Data science powerhouse: greatly accelerate data scientists' work on embedding, k-NN matching, querying, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.

🚡Data in transit: optimized for network communication, ready-to-wire at anytime with fast and compressed serialization in Protobuf, bytes, base64, JSON, CSV, DataFrame. Perfect for streaming and out-of-memory data.

🔎One-stop k-NN: Unified and consistent API for mainstream vector databases that allows nearest neighbor search including Elasticsearch, Redis, AnnLite, Qdrant, Weaviate.

👒For modern apps: GraphQL support makes your server versatile on request and response; built-in data validation and JSON Schema (OpenAPI) help you build reliable web services.

🐍Pythonic experience: as easy as a Python list. If you can Python, you can DocArray. Intuitive idioms and type annotation simplify the code you write.

🛸IDE integration: pretty-print and visualization on Jupyter notebook and Google Colab; comprehensive autocomplete and type hints in PyCharm and VS Code.

DocArray was released under the open-sourceApache License 2.0 in January 2022. It is currently a sandbox project underLF AI & Data Foundation.

Documentation

Install

Requires Python 3.7+

pip install docarray

or via Conda:

conda install -c conda-forge docarray

Commonly used features can be enabled viapip install "docarray[common]".

Get Started

DocArray consists of three simple concepts:

Document: a data structure for easily representing nested, unstructured data.
DocumentArray: a container for efficiently accessing, manipulating, and understanding multiple Documents.
Dataclass: a high-level API for intuitively representing multimodal data.

Let's see DocArray in action with some examples.

Example 1: represent multimodal data in a dataclass

You can easily represent the following news article card withdocarray.dataclass and type annotation:

fromdocarrayimportdataclass,Documentfromdocarray.typingimportImage,Text,JSON@dataclassclassWPArticle:banner:Imageheadline:Textmeta:JSONa=WPArticle(banner='https://.../cat-dog-flight.png',headline='Everything to know about flying with pets, ...',meta={'author':'Nathan Diller','Column':'By the Way - A Post Travel Destination',    },)d=Document(a)

Example 2: text matching in 10 lines

Let's search for top-5 similar sentences ofshe smiled too much in "Pride and Prejudice":

fromdocarrayimportDocument,DocumentArrayd=Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()da=DocumentArray(Document(text=s.strip())forsind.text.split('\n')ifs.strip())da.apply(Document.embed_feature_hashing,backend='process')q= (Document(text='she smiled too much')    .embed_feature_hashing()    .match(da,metric='jaccard',use_scipy=True))print(q.matches[:5, ('text','scores__jaccard__value')])

[['but she smiled too much.',   '_little_, she might have fancied too _much_.',   'She perfectly remembered everything that had passed in',   'tolerably detached tone. While she spoke, an involuntary glance',   'much as she chooses.”'],   [0.3333333333333333, 0.6666666666666666, 0.7, 0.7272727272727273, 0.75]]

Here the feature embedding is done by simplefeature hashing and distance metric isJaccard distance. You have better embeddings? Of course you do! We look forward to seeing your results!

Example 3: external storage for out-of-memory data

When your data is too big, storing in memory is not the best idea. DocArray supportsmultiple storage backends such as SQLite, Weaviate, Qdrant and AnnLite. They're all unified underthe exact same user experience and API. Take the above snippet: you only need to change one line to use SQLite:

da=DocumentArray(    (Document(text=s.strip())forsind.text.split('\n')ifs.strip()),storage='sqlite',)

The code snippet can still runas-is. All APIs remain the same, the subsequent code then runs in an "in-database" manner.

Besides saving memory, you can leverage storage backends for persistence and faster retrieval (e.g. on nearest-neighbor queries).

Example 4: complete workflow of visual search

Let's use DocArray and theTotally Looks Like dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in/left and/right. Images that share the same filename appear similar to the human eye. For example:

left/00018.jpg	right/00018.jpg	left/00131.jpg	right/00131.jpg

Given an image from/left, can we find the most-similar image to it in/right? (without looking at the filename).

Load images

First we load images. Youcan go toTotally Looks Like's website, unzip and load images as below:

fromdocarrayimportDocumentArrayleft_da=DocumentArray.from_files('left/*.jpg')

Or you can simply pull it from Jina AI Cloud:

left_da=DocumentArray.pull('jina-ai/demo-leftda',show_progress=True)

NoteIf you have more than 15GB of RAM and want to try using the whole dataset instead of just the first 1,000 images, remove[:1000] when loading the files into the DocumentArraysleft_da andright_da.

You'll see a progress bar to indicate how much has downloaded.

To get a feeling of the data, we can plot them in one sprite image. You need matplotlib and torch installed to run this snippet:

left_da.plot_image_sprites()

Apply preprocessing

Let's do some standard computer vision pre-processing:

fromdocarrayimportDocumentdefpreproc(d:Document):return (d.load_uri_to_image_tensor()# load        .set_image_tensor_normalization()# normalize color        .set_image_tensor_channel_axis(-1,0)    )# switch color axis for the PyTorch model laterleft_da.apply(preproc)

Did I mentionapply works in parallel?

Embed images

Now let's convert images into embeddings using a pretrained ResNet50:

importtorchvisionmodel=torchvision.models.resnet50(pretrained=True)# load ResNet50left_da.embed(model,device='cuda')# embed via GPU to speed up

This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in.embed(...).

Visualize embeddings

You can visualize the embeddings via tSNE in an interactive embedding projector. You will need to have pydantic, uvicorn and FastAPI installed to run this snippet:

left_da.plot_embeddings(image_sprites=True)

Fun is fun, but our goal is to match left images against right images, and so far we have only handled the left. Let's repeat the same procedure for the right:

Pull from Cloud	Download, unzip, load from local
right_da= (DocumentArray.pull('jina-ai/demo-rightda',show_progress=True) .apply(preproc) .embed(model,device='cuda')[:1000])	right_da= (DocumentArray.from_files('right/*.jpg')[:1000] .apply(preproc) .embed(model,device='cuda'))

Pull from Cloud

Download, unzip, load from local

right_da= (DocumentArray.pull('jina-ai/demo-rightda',show_progress=True)    .apply(preproc)    .embed(model,device='cuda')[:1000])

right_da= (DocumentArray.from_files('right/*.jpg')[:1000]    .apply(preproc)    .embed(model,device='cuda'))

Match nearest neighbors

Now we can match the left to the right and take the top-9 results.

left_da.match(right_da,limit=9)

Let's inspect what's insideleft_da matches now:

forminleft_da[0].matches:print(d.uri,m.uri,m.scores['cosine'].value)

left/02262.jpg right/03459.jpg 0.21102left/02262.jpg right/02964.jpg 0.13871843left/02262.jpg right/02103.jpg 0.18265384left/02262.jpg right/04520.jpg 0.16477376...

Or shorten the loop to a one-liner using the element and attribute selector:

print(left_da['@m', ('uri','scores__cosine__value')])

Better see it.

(DocumentArray(left_da[8].matches,copy=True)    .apply(lambdad:d.set_image_tensor_channel_axis(0,-1        ).set_image_tensor_inv_normalization()    )    .plot_image_sprites())

Here we reversed the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so you can visualize them using image sprites.

Quantitative evaluation

Serious as you are, visual inspection is surely not enough. Let's calculate the recall@K. First we construct the groundtruth matches:

groundtruth=DocumentArray(Document(uri=d.uri,matches=[Document(uri=d.uri.replace('left','right'))])fordinleft_da)

Here we created a new DocumentArray with real matches by simply replacing the filename, e.g.left/00001.jpg toright/00001.jpg. That's all we need: if the predicted match has the identicaluri as the groundtruth match, then it is correct.

Now let's check recall rate from 1 to 5 over the full dataset:

forkinrange(1,6):print(f'recall@{k}',left_da.evaluate(groundtruth,hash_fn=lambdad:d.uri,metric='recall_at_k',k=k,max_rel=1        ),    )

recall@1 0.02726063829787234recall@2 0.03873005319148936recall@3 0.04670877659574468recall@4 0.052194148936170214recall@5 0.0573470744680851

You can also use other metrics likeprecision_at_k,ndcg_at_k,hit_at_k.

If you think a pretrained ResNet50 is good enough, let me tell you withFinetuner you can do much better withjust another ten lines of code.

Save results

You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form:

left_da.save('left_da.bin')

To reuse that DocumentArray's data, useleft_da = DocumentArray.load('left_da.bin').

If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:

left_da.push('my_shared_da')

Now anyone who knows the tokenmy_shared_da can pull and work on it.

left_da=DocumentArray.pull('<username>/my_shared_da')

Intrigued? That's only scratching the surface of what DocArray is capable of.Read our docs to learn more.

Support & talk to us

Join ourDiscord server and chat with other community members about ideas.
Join ourpublic meetings where we discuss the future of the project.

DocArray is a trademark of LF AI Projects, LLC

About

Represent, send, store and search multimodal data

docs.docarray.org/

Code of conduct

Security policy

Activity

Custom properties

Stars

3.1k stars

Watchers

46 watching

Forks

234 forks

Report repository

Releases163

💫 Patch v0.40.1 Latest

Mar 21, 2025

+ 162 releases

Contributors76

+ 62 contributors

Languages

Python99.6%
Other0.4%

Movatterモバイル変換

License

docarray/docarray

Folders and files

Latest commit

History

Repository files navigation

Documentation

Install

Get Started

Example 1: represent multimodal data in a dataclass

Example 2: text matching in 10 lines

Example 3: external storage for out-of-memory data

Example 4: complete workflow of visual search

Load images

Apply preprocessing

Embed images

Visualize embeddings

Match nearest neighbors

Quantitative evaluation

Save results

Support & talk to us

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases163

Uh oh!

Contributors76

Languages