Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A custom reader for delimited files in Python. Ability to ingest big data files.

License

NotificationsYou must be signed in to change notification settings

canimus/alphareader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

canimus

After several attempts to try thecsv package orpandas for reading large files with custom delimiters, I ended up writting a little program that does the job without complaints.

AlphaReader is a high performant, pure python, 15-line of code library, that reads chunks of bytes from your files, and retrieve line by line, the content of it.

The inspiration of this library came by having to extract data from a MS-SQL Server database, and having to deal with theCP1252 encoding. By default AlphaReader takes this encoding as it was useful in our use case.

It works also withHDFS through thepyarrow library. But is not a depedency.

CSVs

# !cat file.csv# 1,John,Doe,2010# 2,Mary,Smith,2011# 3,Peter,Jones,2012>reader=AlphaReader(open('file.csv','rb'),encoding='cp1252',terminator=10,delimiter=44)>next(reader)> ['1','John','Doe','2010']

TSVs

# !cat file.tsv# 1    John    Doe    2010# 2    Mary    Smith  2011# 3    Peter   Jones  2012>reader=AlphaReader(open('file.tsv','rb'),encoding='cp1252',terminator=10,delimiter=9)>next(reader)> ['1','John','Doe','2010']

XSVs

# !cat file.tsv# 1¦John¦Doe¦2010# 2¦Mary¦Smith¦2011# 3¦Peter¦Jones¦2012>ord('¦')>166>chr(166)>'¦'>reader=AlphaReader(open('file.tsv','rb'),encoding='cp1252',terminator=10,delimiter=166)>next(reader)> ['1','John','Doe','2010']

HDFS

# !hdfs dfs -cat /raw/tsv/file.tsv# 1    John    Doe    2010# 2    Mary    Smith  2011# 3    Peter   Jones  2012>importpyarrowaspa>fs=pa.hdfs.connect()>reader=AlphaReader(fs.open('/raw/tsv/file.tsv','rb'),encoding='cp1252',terminator=10,delimiter=9)>next(reader)> ['1','John','Doe','2010']

Transformations

# !cat file.csv# 1,2,3# 10,20,30# 100,200,300>fn=lambdax:int(x)>reader=AlphaReader(open('/raw/tsv/file.tsv','rb'),encoding='cp1252',terminator=10,delimiter=44,fn_transform=fn)>next(reader)> [1,2,3]>next(reader)> [10,20,30]

Chain Transformations

# !cat file.csv# 1,2,3# 10,20,30# 100,200,300>fn_1=lambdax:x+1>fn_2=lambdax:x*10>reader=AlphaReader(open('/raw/tsv/file.tsv','rb'),encoding='cp1252',terminator=10,delimiter=44,fn_transform=[int,fn_1,fn_2])>next(reader)> [20,30,40]>next(reader)> [110,210,310]

Caution

>reader=AlphaReader(open('large_file.xsv','rb'),encoding='cp1252',terminator=172,delimiter=173)>records=list(reader)# Avoid this as it will load all file in memory

Limitations

  • No support formulti-byte delimiters
  • Relatively slower performance thancsv library. Usecsv and dialects when your files have\r\n terminators
  • Transformations are per row, perhaps vectorization could aid performance

Performance

  • 24MB file loaded withlist(AlphaReader(file_handle))
tests/test_profile.py::test_alphareader_with_encoding--------------------------------------------------------------------------------- live log call INFO     root:test_profile.py:22          252343functioncallsin 0.386 seconds    Ordered by: cumulativetime   ncalls  tottime  percall  cumtime  percall filename:lineno(function)   119605    0.039    0.000    0.386    0.000 .\alphareader\__init__.py:39(AlphaReader)   122228    0.266    0.000    0.266    0.000 {method'split' of'str' objects}     2625    0.005    0.000    0.054    0.000 {method'decode' of'bytes' objects}     2624    0.001    0.000    0.049    0.000 .\Python-3.7.4\lib\encodings\cp1252.py:14(decode)     2624    0.048    0.000    0.048    0.000 {built-in method _codecs.charmap_decode}     2625    0.027    0.000    0.027    0.000 {method'read' of'_io.BufferedReader' objects}        1    0.000    0.000    0.000    0.000 .\__init__.py:5(_validate)        1    0.000    0.000    0.000    0.000 {built-in method _codecs.lookup}

About

A custom reader for delimited files in Python. Ability to ingest big data files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp