- Notifications
You must be signed in to change notification settings - Fork47
Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.
License
cylondata/cylon
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Cylon is a fast, scalable distributed memory data parallel libraryfor processing structured data. Cylon implements a set of relational operators to process data.While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces(Python and Java ) are provided to seamlessly integrate with existing applications, enablingboth data and AI/ML engineers to invoke data processing operators in a familiar programming language.By default it works with MPI for distributing the applications.
Internally Cylon usesApache Arrow to represent the data in a column format.
The documentation can be found athttps://cylondata.org
Email -cylondata@googlegroups.com
Mailing List -Join
We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. The Conda binaries need Ubuntu 16.04 or higher.
conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7conda activate cylon-0.4.0
Now lets run our first Cylon application inside the Conda environment. The following code creates two DataFrames and joins them.
frompycylonimportDataFrame,CylonEnvfrompycylon.netimportMPIConfigdf1=DataFrame([[1,2,3], [2,3,4]])df2=DataFrame([[1,1,1], [2,3,4]])# local mergedf3=df1.merge(right=df2,on=[0,1])print("Local Merge")print(df3)
Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of theprogram will run. They will each load two DataFrames in their memory and do a distributed join among the DataFrames.The results will be created in the parallel processes as well.
frompycylonimportDataFrame,CylonEnvfrompycylon.netimportMPIConfigimportrandom# distributed joinenv=CylonEnv(config=MPIConfig())df1=DataFrame([random.sample(range(10*env.rank,15*(env.rank+1)),5),random.sample(range(10*env.rank,15*(env.rank+1)),5)])df2=DataFrame([random.sample(range(10*env.rank,15*(env.rank+1)),5),random.sample(range(10*env.rank,15*(env.rank+1)),5)])df2.set_index([0],inplace=True)print("Distributed Join")df3=df1.join(other=df2,on=[0],env=env)print(df3)
You can run the above program in the Conda environment by using the following command. It usesmpirun
command with 2 parallel processes.
mpirun -np 2 python<name of your python file>
Refer to the documentation on how to compile Cylon
Cylon uses the Apache Lincense Version 2.0
About
Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.