Movatterモバイル変換

Intugle/data-toolsPublic

NotificationsYou must be signed in to change notification settings
Fork38
Star140

The GenAI-powered toolkit for automated data intelligence.

intugle.github.io/data-tools/

License

Apache-2.0 license

140 stars 38 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
.github		.github
.vscode		.vscode
docscontent		docscontent
docsite		docsite
evals_ground_truth/relationships		evals_ground_truth/relationships
notebooks		notebooks
sample_data		sample_data
src/intugle		src/intugle
tests		tests
.gitignore		.gitignore
.ruff.toml		.ruff.toml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
server.json		server.json
uv.lock		uv.lock

Repository files navigation

The GenAI-powered toolkit for automated data intelligence.

Transform Fragmented Data into Connected Semantic Data Model

Overview

Intugle’s GenAI-powered open-source Python library builds a semantic data model over your existing data systems. At its core, it discovers meaningful links and relationships across data assets — enriching them with profiles, classifications, and business glossaries. With this connected knowledge layer, you can enable semantic search and auto-generate queries to create unified data products, making data integration and exploration faster, more accurate, and far less manual.

Who is this for?

Data Engineers & Architects often spend weeks manually profiling, classifying, and stitching together fragmented data assets. With Intugle, they can automate this process end-to-end, uncovering meaningful links and relationships to instantly generate a connected semantic layer.
Data Analysts & Scientists spend endless hours on data readiness and preparation before they can even start the real analysis. Intugle accelerates this by providing contextual intelligence, automatically generating SQL and reusable data products enriched with relationships and business meaning.
Business Analysts & Decision Makers are slowed down by constant dependence on technical teams for answers. Intugle removes this bottleneck by enabling natural language queries and semantic search, giving them trusted insights on demand.

Features

Semantic Data Model - Transform raw, fragmented datasets into an intelligent semantic graph that captures entities, relationships, and context — the foundation for connected intelligence.
Business Glossary & Semantic Search: Auto-generate a business glossary and enable search that understands meaning, not just keywords — making data more accessible across technical and business users.
Data Products - Instantly generate SQL and reusable data products enriched with context, eliminating manual pipelines and accelerating data-to-insight.
Conceptual Search - Generate data product plans from natural language queries, bridging the gap between business questions and executable data product definitions. Learn more in thedocumentation.

Supported Integrations

Category	Integrations
Data Warehouses	Snowflake, Databricks
Databases	SQLite, PostgreSQL, SQL Server, MySQL
Local	Pandas, DuckDB (CSV, Parquet, Excel)

Streamlit App

Theintugle library includes a Streamlit application that provides an interactive web interface for building and visualizing semantic data models.

Github.Streamlit.Demo.mp4

To use the Streamlit app, installintugle with thestreamlit extra:

pip install intugle[streamlit]

You can launch the Streamlit application using theintugle-mcp command oruvx:

intugle-streamlit# Or using uvxuvx --from intugle[streamlit] intugle-streamlit

Open the URL provided in your terminal (usuallyhttp://localhost:8501) to access the application. For more details, refer to theStreamlit App documentation.

To run the app in a cloud environment like Google Colab, please refer to ourStreamlit quickstart notebook.

Getting Started

Installation

For Windows and Linux, you can follow these steps. For macOS, please see the additional steps in the macOS section below.

Before installing, it is recommended to create a virtual environment:

python -m venv .venvsource .venv/bin/activate

Then, install the package:

pip install intugle

macOS

For macOS users, you may need to install thelibomp library:

brew install libomp

If you installed Python using the official installer from python.org, you may also need to install SSL certificates by running the following command in your terminal. Please replace3.XX with your specific Python version. This step is not necessary if you installed Python using Homebrew.

/Applications/Python\3.XX/Install\Certificates.command

Configuration

Before running the project, you need to configure a LLM. This is used for tasks like generating business glossaries and predicting links between tables.

You can configure the LLM by setting the following environment variables:

LLM_PROVIDER: The LLM provider and model to use (e.g.,openai:gpt-3.5-turbo) following LangChain'sconventions
API_KEY: Your API key for the LLM provider. The exact name of the variable may vary from provider to provider.

Here's an example of how to set these variables in your environment:

export LLM_PROVIDER="openai:gpt-3.5-turbo"export OPENAI_API_KEY="your-openai-api-key"

Quickstart

For a detailed, hands-on introduction to the project, please see our quickstart notebooks:

Domain	Notebook	Open in Colab
Healthcare	`quickstart_healthcare.ipynb`
Tech Manufacturing	`quickstart_tech_manufacturing.ipynb`
FMCG	`quickstart_fmcg.ipynb`
Sports Media	`quickstart_sports_media.ipynb`
Databricks Unity Catalog [Health Care]	`quickstart_healthcare_databricks.ipynb`	Databricks Notebook Only
Snowflake Horizon Catalog [ FMCG ]	`quickstart_fmcg_snowflake.ipynb`	Snowflake Notebook Only
Native Snowflake with Cortex Analyst [ Tech Manufacturing ]	`quickstart_native_snowflake.ipynb`
Native Databricks with AI/BI Genie [ Tech Manufacturing ]	`quickstart_native_databricks.ipynb`
Streamlit App	`quickstart_streamlit.ipynb`
Conceptual Search	`quickstart_conceptual_search.ipynb`
Composite Relationships Prediction	`quickstart_basketball_composite_links.ipynb`

These datasets will take you through the following steps:

Generate Semantic Model → The unified layer that transforms fragmented datasets, creating the foundation for connected intelligence.
- 1.1 Profile and classify data → Analyze your data sources to understand their structure, data types, and other characteristics.
- 1.2 Discover links & relationships among data → Reveal meaningful connections (PK & FK), including composite keys, across fragmented tables.
- 1.3 Generate a business glossary → Create business-friendly terms and use them to query data with context.
- 1.4 Enable semantic search → Intelligent search that understands meaning, not just keywords—making data more accessible across both technical and business users.
- 1.5 Visualize semantic model→ Get access to enriched metadata of the semantic layer in the form of YAML files and visualize in the form of graph
Build Unified Data Products → Simply pick the attributes across your data tables, and let the toolkit auto-generate queries with all the required joins, transformations, and aggregations using the semantic layer. When executed, these queries produce reusable data products.

Documentation

For more detailed information, advanced usage, and tutorials, please refer to our fulldocumentation site.

Usage

The core workflow of the project involves using theSemanticModel to build a semantic layer, and then using theDataProduct to generate data products from that layer.

fromintugleimportSemanticModel# Define your datasetsdatasets= {"allergies": {"path":"path/to/allergies.csv","type":"csv"},"patients": {"path":"path/to/patients.csv","type":"csv"},"claims": {"path":"path/to/claims.csv","type":"csv"},# ... add other datasets}# Build the semantic modelsm=SemanticModel(datasets,domain="Healthcare")sm.build()# Access the profiling resultsprint(sm.profiling_df.head())# Access the discovered linksprint(sm.links_df)

For detailed code examples and a complete walkthrough, please see ourquickstart notebooks.

Data Product

Once the semantic model is built, you can use theDataProduct class to generate unified data products from the semantic layer.

fromintugleimportDataProduct# Define an ETL modeletl= {"name":"top_patients_by_claim_count","fields": [    {"id":"patients.first","name":"first_name",    },    {"id":"patients.last","name":"last_name",    },    {"id":"claims.id","name":"number_of_claims","category":"measure","measure_func":"count"    }  ],"filter": {"sort_by": [      {"id":"claims.id","alias":"number_of_claims","direction":"desc"      }    ],"limit":10  }}# Create a DataProduct and build itdp=DataProduct()data_product=dp.build(etl)# View the data product as a DataFrameprint(data_product.to_df())

Semantic Search

The semantic search feature allows you to search for columns in your datasets using natural language. It is built on top of theQdrant vector database.

For full setup instructions (including Docker commands and environment variables), please refer to theSemantic Search Documentation.

Usage

Once you have built the semantic model, you can use thesearch method to perform a semantic search. The search function returns a pandas DataFrame containing the search results, including the column's profiling metrics, category, table name, and table glossary.

fromintugleimportSemanticModel# Define your datasetsdatasets= {"allergies": {"path":"path/to/allergies.csv","type":"csv"},"patients": {"path":"path/to/patients.csv","type":"csv"},"claims": {"path":"path/to/claims.csv","type":"csv"},# ... add other datasets}# Build the semantic modelsm=SemanticModel(datasets,domain="Healthcare")sm.build()# Perform a semantic searchsearch_results=sm.search("reason for hospital visit")# View the search resultsprint(search_results)

For detailed code examples and a complete walkthrough, please see ourquickstart notebooks.

MCP Server

Intugle includes a built-in MCP (Model Context Protocol) server that exposes your semantic layer to AI assistants and LLM-powered clients. Its main purpose is to allow agents to understand your data's structure by using tools likeget_tables andget_schema.

Once your semantic model is built, you can start the server with a simple command: