Python script that generates random csv

Question 1

I have a Python script that given some configuration, generates a randomCSV file that can be used for testing purposes.

I want to know if it adheres to the best Python and coding practices. It works for cases where I need upwards of 10K+ rows fast enough for my requirements, so I am not too worried about performance althoughinputs on performance are also appreciated.

Input:

Schema: as a dict, information about each column name, data type and some other constraints (like fixed length/in a range/ from a given list)
Number of rows
Name of the output CSV file

Script:

import random as rndimport csvfrom abc import ABC, abstractmethod# csv creator, creates a csv files with a given configroundPrecision = 3class BoundType(ABC):    def __init__(self, dtype, params):        self.dType = dtype        self.params = params    @abstractmethod    def generate(self):        passclass FixedLength(BoundType):    # params is length    def generate(self):        length = self.params.get("len", 1)        if self.dType == "int":            return rnd.randint(10 ** (length - 1), 10 ** length - 1)        elif self.dType == "float":            return FixedLength("int", self.params).generate() + round(rnd.random(), roundPrecision)        elif self.dType == "string":            alphabet = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")            word = [rnd.choice(alphabet) for _ in range(length)]            return ''.join(word)        else:            return Noneclass FixedRange(BoundType):    # params is range    def generate(self):        lo, hi = (self.params.get("lohi"))        if self.dType == "int":            return rnd.randint(lo, hi)        elif self.dType == "float":            return round(rnd.uniform(lo, hi), roundPrecision)        else:            return Noneclass FromPossibleValues(BoundType):    # params is a list    def generate(self):        possibleval = self.params.get("set", set())        return rnd.choice(possibleval)def createcsv(rows, filename, schema):    with open(f'./output/{filename}.csv', 'w', encoding='UTF8', newline='') as f:        writer = csv.writer(f)        writer.writerow(schema.keys())        for _ in range(rows):            writer.writerow([x.generate() for x in schema.values()])

Test:

from csvGen.csvGenerator import FixedLength, FixedRange, FromPossibleValues, createcsvschema = {    "col1": FixedLength("int", {"len": 5}),    "col2": FixedLength("float", {"len": 5}),    "col3": FixedLength("string", {"len": 5}),    "col4": FixedRange("int", {"lohi": (10, 15)}),    "col5": FixedRange("float", {"lohi": (5.5, 6.7)}),    "col6": FromPossibleValues("int", {"set": [1, 2, 3, 4, 5]}),    "col7": FromPossibleValues("int", {"set": [1.1, 2.2, 3.3]}),    "col8": FromPossibleValues("int", {"set": ["A", "AB"]})}rows = 10fileName = "eightVals"createcsv(rows, fileName, schema)

This is what the output looks like for the given test :

col1	col2	col3	col4	col5	col6	col7	col8
51685	71830.471	PAXBK	12	6.192	1	2.2	AB
60384	42341.991	RHNUK	11	6.037	1	1.1	AB
73505	30997.171	DVOGT	10	6.69	5	2.2	A
60528	85072.731	FWWXW	10	5.761	1	2.2	A
23048	65401.245	EVPUX	13	6.474	4	1.1	AB
74748	66969.774	PEULP	15	6.546	3	2.2	AB
88763	34749.184	VOAUO	10	6.402	4	2.2	AB
77351	44566.163	JOBQF	13	5.683	1	2.2	AB
50820	73002.154	EACZT	15	5.711	1	1.1	AB
53037	89225.572	YTLBI	13	6.328	1	2.2	AB

Question 2

@Peter, your edit was mostly good, but please don't change the code comments for grammar/spelling, as that's an aspect of review.

Question 3

This seems overbuilt, and would also be more tersely expressed with Numpy/Pandas. Additionally, your manual loops - whereas you're satisfied with current performance - will be faster in vectorised form.

This program can be as simple as

import numpy as npimport pandas as pdfrom string import ascii_uppercasefrom numpy.random import default_rnground_precision = 3nrows = 10rand = default_rng()array_3 = rand.choice(tuple(ascii_uppercase), size=(nrows, 5))df = pd.DataFrame({    'col1': rand.integers(low=1e4, high=1e5, size=nrows),    'col2': rand.uniform(low=1e4, high=1e5, size=nrows).round(decimals=round_precision),    'col3': pd.DataFrame(array_3).agg(''.join, axis=1),    'col4': rand.integers(low=10, high=16, size=nrows),    'col5': rand.uniform(low=5.5, high=6.7, size=nrows).round(decimals=round_precision),    'col6': rand.integers(low=1, high=6, size=nrows),    'col7': rand.choice(np.linspace(start=1.1, stop=3.3, num=3), size=nrows),    'col8': rand.choice(('A', 'AB'), size=nrows),})df.to_csv('eightVals.csv', index=None)

with output

col1,col2,col3,col4,col5,col6,col7,col890510,54337.357,TITJJ,12,6.254,2,2.2,A82827,10498.364,MLZFY,11,5.881,5,2.2,A81292,95873.98,UOHVL,14,5.931,2,1.1,AB44278,69859.781,RGQVO,11,6.492,2,2.2,AB26424,18106.356,XPCYV,15,6.176,5,2.2,AB12324,52143.779,VUEWA,15,5.959,2,3.3,A38774,34881.201,URXXL,15,6.545,2,2.2,AB25801,31699.204,XEWGS,14,6.427,3,3.3,A59555,98153.322,WYGBV,11,6.062,1,2.2,A44449,86569.519,MDNXN,15,6.229,1,3.3,A

If this needs to be reusable (and I'm not entirely convinced that it is, given what you've shown), you can adapt the above to utility functions. I don't know that classes are necessary.

Question 4

Storing the config separately makes a lot of sense in data pipelines IMHO. Classes don't seem necessary just for that indeed though.

Question 5

I like how you started designing withgenerate method as interface for your generators. I like how you define your schema. I like also how the final loop and code works.

Really can't agree with the way you implemented yourgenerate methods. First it was incredibly confusing and it took me unnecessarily too long to figure out, what are these classes about. To be more specific:

Why passing params as dict? Why not pass named parameters and make them instance variables? It would make code much more readable and easier to use. Only reason I can think about is that you could instantiate all those classes with same type of parameters (but you don't need to do that).
That bigif in yourgenerate method is code smell. Each of those if branches should be it's own class with it's owngenerate implementation.
NamesFixedLength andFromPossibleValues can be better.

Some possibilities examples how to implement your classes better (imho):

class FixedLengthIntGenerator(BoundType):    def __init__(self, length):        self.length = length    def generate(self):        length = self.length            return rnd.randint(10 ** (length - 1), 10 ** length - 1)# TODO: other classes for string, float, range ...class FromPresetChoicesGenerator(BoundType):    def __init__(self, choices):        self.choices = choices    def generate(self):       return rnd.choice(self.choices)

Yes it is more verbose, but it is more readable, extensible, different implementations are not so bound. Also it's not necessary to pass this "dType" anymore. Instead, you choose what implementation to use.

Question 6

If you use a class having two methods, one of which is__init__, you should be using a function instead. Additionally using inheritance without a single call tosuper() or any of the parent class' features, is a big no no for me.

Reinderien 71.2k5 gold badges76 silver badges257 bronze badges · Accepted Answer · 2022-10-18 01:00:09Z

This seems overbuilt, and would also be more tersely expressed with Numpy/Pandas. Additionally, your manual loops - whereas you're satisfied with current performance - will be faster in vectorised form.

This program can be as simple as

import numpy as npimport pandas as pdfrom string import ascii_uppercasefrom numpy.random import default_rnground_precision = 3nrows = 10rand = default_rng()array_3 = rand.choice(tuple(ascii_uppercase), size=(nrows, 5))df = pd.DataFrame({    'col1': rand.integers(low=1e4, high=1e5, size=nrows),    'col2': rand.uniform(low=1e4, high=1e5, size=nrows).round(decimals=round_precision),    'col3': pd.DataFrame(array_3).agg(''.join, axis=1),    'col4': rand.integers(low=10, high=16, size=nrows),    'col5': rand.uniform(low=5.5, high=6.7, size=nrows).round(decimals=round_precision),    'col6': rand.integers(low=1, high=6, size=nrows),    'col7': rand.choice(np.linspace(start=1.1, stop=3.3, num=3), size=nrows),    'col8': rand.choice(('A', 'AB'), size=nrows),})df.to_csv('eightVals.csv', index=None)

with output

col1,col2,col3,col4,col5,col6,col7,col890510,54337.357,TITJJ,12,6.254,2,2.2,A82827,10498.364,MLZFY,11,5.881,5,2.2,A81292,95873.98,UOHVL,14,5.931,2,1.1,AB44278,69859.781,RGQVO,11,6.492,2,2.2,AB26424,18106.356,XPCYV,15,6.176,5,2.2,AB12324,52143.779,VUEWA,15,5.959,2,3.3,A38774,34881.201,URXXL,15,6.545,2,2.2,AB25801,31699.204,XEWGS,14,6.427,3,3.3,A59555,98153.322,WYGBV,11,6.062,1,2.2,A44449,86569.519,MDNXN,15,6.229,1,3.3,A

If this needs to be reusable (and I'm not entirely convinced that it is, given what you've shown), you can adapt the above to utility functions. I don't know that classes are necessary.

Storing the config separately makes a lot of sense in data pipelines IMHO. Classes don't seem necessary just for that indeed though.

Movatterモバイル変換

Stack Exchange Network

Python script that generates random csv

2 Answers2

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS