Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Datagenerator for Data Services

License

NotificationsYou must be signed in to change notification settings

frischHWC/datagen

Repository files navigation

Goal of this project is to generate meaningful Data into various Big Data Services and File systems.

Built originally for Cloudera Data Platform, it works with S3, ADLS, GCS, local file systems, and open-source services (such as kafka, hive etc…​).

A quick demo

What is it ?

Datagen is a Web Server with a UI and a sets of APIs (with a swagger) to interact with.

Datagen is able to:

  • Push Data to multiple systems (HDFS, S3, ADLS, GCS, Kafka, Hive, HBase etc…​)

  • Handle multiple formats (Avro, Parquet, ORC, JSON, CSV)

  • Generate structured data of different types (string, integer, timestamp, uuid etc…​)

  • Provide pre-defined data filterable (countries, 40K+ cities, 35K+ names, phone pattern etc…​ )

  • Make data respecting rules (min, max, repartition, pattern, regex etc…​)

  • Add relations between data generated (one column value depends on other’s values with possible complex evaluation)

  • Generate unstructured data leveraging AI (OpenAI, Bedrock, Ollama, and even any model locally embedded)

  • Schedule generation of data

  • Analyze existing data to mimic it (currently in alpha version)

Public Documentation

Refer to this documentation to understand Datagen, know how to deploy and use it:

Dependencies

Datagen is made with:

  • Java 17

  • Maven 3.9

It is built for:

  • CDP 7.1.9+

  • AWS, GCP, Azure

  • Docker & K8s

Datagen is made as a Spring Boot project with dependencies on SDK to interact with various services.

Deployment

Datagen can be run in two ways:

  • As a standalone web server, anywhere, but you’ll need to provide configurations manually

  • Fully integrated into CDP using a parcel, a CSD and available as service through Cloudera Manager (configuration is automatic)

Standalone Web Server

To make a fast and easy deployment, here is a procedure:

wget https://datagen-repo.s3.eu-west-3.amazonaws.com/1.0.0/standalone/datagen-standalone-files.tar.gz
tar -xvzf datagen-standalone-files.tar.gzcd datagen_standalone-1.0.0/
./launch.sh

By default, TLS is de-activated and a user/password is required (admin/admin by default).

On CDP

Refer to this detailed procedure available onDatagen’s Documentation

By default, TLS is activated and a user/password is required (admin/admin by default).


[8]ページ先頭

©2009-2025 Movatter.jp