Movatterモバイル変換


[0]ホーム

URL:


SciPy

Importing data withgenfromtxt

NumPy provides several functions to create arrays from tabular data.We focus here on thegenfromtxt function.

In a nutshell,genfromtxt runs two main loops. The firstloop converts each line of the file in a sequence of strings. The secondloop converts each string to the appropriate data type. This mechanism isslower than a single loop, but gives more flexibility. In particular,genfromtxt is able to take missing data into account, whenother faster and simpler functions likeloadtxt cannot.

Note

When giving examples, we will use the following conventions:

>>>importnumpyasnp>>>fromioimportStringIO

Defining the input

The only mandatory argument ofgenfromtxt is the source ofthe data. It can be a string, a list of strings, or a generator. If asingle string is provided, it is assumed to be the name of a local orremote file, or an open file-like object with aread method, forexample, a file orio.StringIO object. If a list of stringsor a generator returning strings is provided, each string is treated as oneline in a file. When the URL of a remote file is passed, the file isautomatically downloaded to the current directory and opened.

Recognized file types are text files and archives. Currently, the functionrecognizesgzip andbz2 (bzip2) archives. The type ofthe archive is determined from the extension of the file: if the filenameends with'.gz', agzip archive is expected; if it ends with'bz2', abzip2 archive is assumed.

Splitting the lines into columns

Thedelimiter argument

Once the file is defined and open for reading,genfromtxtsplits each non-empty line into a sequence of strings. Empty or commentedlines are just skipped. Thedelimiter keyword is used to definehow the splitting should take place.

Quite often, a single character marks the separation between columns. Forexample, comma-separated files (CSV) use a comma (,) or a semicolon(;) as delimiter:

>>>data=u"1, 2, 3\n4, 5, 6">>>np.genfromtxt(StringIO(data),delimiter=",")array([[ 1.,  2.,  3.],       [ 4.,  5.,  6.]])

Another common separator is"\t", the tabulation character. However,we are not limited to a single character, any string will do. By default,genfromtxt assumesdelimiter=None, meaning that the lineis split along white spaces (including tabs) and that consecutive whitespaces are considered as a single white space.

Alternatively, we may be dealing with a fixed-width file, where columns aredefined as a given number of characters. In that case, we need to setdelimiter to a single integer (if all the columns have the samesize) or to a sequence of integers (if columns can have different sizes):

>>>data=u"  1  2  3\n  4  5 67\n890123  4">>>np.genfromtxt(StringIO(data),delimiter=3)array([[   1.,    2.,    3.],       [   4.,    5.,   67.],       [ 890.,  123.,    4.]])>>>data=u"123456789\n   4  7 9\n   4567 9">>>np.genfromtxt(StringIO(data),delimiter=(4,3,2))array([[ 1234.,   567.,    89.],       [    4.,     7.,     9.],       [    4.,   567.,     9.]])

Theautostrip argument

By default, when a line is decomposed into a series of strings, theindividual entries are not stripped of leading nor trailing white spaces.This behavior can be overwritten by setting the optional argumentautostrip to a value ofTrue:

>>>data=u"1, abc , 2\n 3, xxx, 4">>># Without autostrip>>>np.genfromtxt(StringIO(data),delimiter=",",dtype="|U5")array([['1', ' abc ', ' 2'],       ['3', ' xxx', ' 4']],      dtype='|U5')>>># With autostrip>>>np.genfromtxt(StringIO(data),delimiter=",",dtype="|U5",autostrip=True)array([['1', 'abc', '2'],       ['3', 'xxx', '4']],      dtype='|U5')

Thecomments argument

The optional argumentcomments is used to define a characterstring that marks the beginning of a comment. By default,genfromtxt assumescomments='#'. The comment marker mayoccur anywhere on the line. Any character present after the commentmarker(s) is simply ignored:

>>>data=u"""#...# Skip me !...# Skip me too !...1, 2...3, 4...5, 6 #This is the third line of the data...7, 8...# And here comes the last line...9, 0...""">>>np.genfromtxt(StringIO(data),comments="#",delimiter=",")[[ 1.  2.] [ 3.  4.] [ 5.  6.] [ 7.  8.] [ 9.  0.]]

New in version 1.7.0:Whencomments is set toNone, no lines are treated as comments.

Note

There is one notable exception to this behavior: if the optional argumentnames=True, the first commented line will be examined for names.

Skipping lines and choosing columns

Theskip_header andskip_footer arguments

The presence of a header in the file can hinder data processing. In thatcase, we need to use theskip_header optional argument. Thevalues of this argument must be an integer which corresponds to the numberof lines to skip at the beginning of the file, before any other action isperformed. Similarly, we can skip the lastn lines of the file byusing theskip_footer attribute and giving it a value ofn:

>>>data=u"\n".join(str(i)foriinrange(10))>>>np.genfromtxt(StringIO(data),)array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])>>>np.genfromtxt(StringIO(data),...skip_header=3,skip_footer=5)array([ 3.,  4.])

By default,skip_header=0 andskip_footer=0, meaning that no linesare skipped.

Theusecols argument

In some cases, we are not interested in all the columns of the data butonly a few of them. We can select which columns to import with theusecols argument. This argument accepts a single integer or asequence of integers corresponding to the indices of the columns to import.Remember that by convention, the first column has an index of 0. Negativeintegers behave the same as regular Python negative indexes.

For example, if we want to import only the first and the last columns, wecan useusecols=(0,-1):

>>>data=u"1 2 3\n4 5 6">>>np.genfromtxt(StringIO(data),usecols=(0,-1))array([[ 1.,  3.],       [ 4.,  6.]])

If the columns have names, we can also select which columns to import bygiving their name to theusecols argument, either as a sequenceof strings or a comma-separated string:

>>>data=u"1 2 3\n4 5 6">>>np.genfromtxt(StringIO(data),...names="a, b, c",usecols=("a","c"))array([(1.0, 3.0), (4.0, 6.0)],      dtype=[('a', '<f8'), ('c', '<f8')])>>>np.genfromtxt(StringIO(data),...names="a, b, c",usecols=("a, c"))    array([(1.0, 3.0), (4.0, 6.0)],          dtype=[('a', '<f8'), ('c', '<f8')])

Choosing the data type

The main way to control how the sequences of strings we have read from thefile are converted to other types is to set thedtype argument.Acceptable values for this argument are:

  • a single type, such asdtype=float.The output will be 2D with the given dtype, unless a name has beenassociated with each column with the use of thenames argument(see below). Note thatdtype=float is the default forgenfromtxt.
  • a sequence of types, such asdtype=(int,float,float).
  • a comma-separated string, such asdtype="i4,f8,|U3".
  • a dictionary with two keys'names' and'formats'.
  • a sequence of tuples(name,type), such asdtype=[('A',int),('B',float)].
  • an existingnumpy.dtype object.
  • the special valueNone.In that case, the type of the columns will be determined from the dataitself (see below).

In all the cases but the first one, the output will be a 1D array with astructured dtype. This dtype has as many fields as items in the sequence.The field names are defined with thenames keyword.

Whendtype=None, the type of each column is determined iteratively fromits data. We start by checking whether a string can be converted to aboolean (that is, if the string matchestrue orfalse in lowercases); then whether it can be converted to an integer, then to a float,then to a complex and eventually to a string. This behavior may be changedby modifying the default mapper of theStringConverter class.

The optiondtype=None is provided for convenience. However, it issignificantly slower than setting the dtype explicitly.

Setting the names

Thenames argument

A natural approach when dealing with tabular data is to allocate a name toeach column. A first possibility is to use an explicit structured dtype,as mentioned previously:

>>>data=StringIO("1 2 3\n 4 5 6")>>>np.genfromtxt(data,dtype=[(_,int)for_in"abc"])array([(1, 2, 3), (4, 5, 6)],      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])

Another simpler possibility is to use thenames keyword with asequence of strings or a comma-separated string:

>>>data=StringIO("1 2 3\n 4 5 6")>>>np.genfromtxt(data,names="A, B, C")array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

In the example above, we used the fact that by default,dtype=float.By giving a sequence of names, we are forcing the output to a structureddtype.

We may sometimes need to define the column names from the data itself. Inthat case, we must use thenames keyword with a value ofTrue. The names will then be read from the first line (after theskip_header ones), even if the line is commented out:

>>>data=StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")>>>np.genfromtxt(data,skip_header=1,names=True)array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

The default value ofnames isNone. If we give any othervalue to the keyword, the new names will overwrite the field names we mayhave defined with the dtype:

>>>data=StringIO("1 2 3\n 4 5 6")>>>ndtype=[('a',int),('b',float),('c',int)]>>>names=["A","B","C"]>>>np.genfromtxt(data,names=names,dtype=ndtype)array([(1, 2.0, 3), (4, 5.0, 6)],      dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])

Thedefaultfmt argument

Ifnames=None but a structured dtype is expected, names are definedwith the standard NumPy default of"f%i", yielding names likef0,f1 and so forth:

>>>data=StringIO("1 2 3\n 4 5 6")>>>np.genfromtxt(data,dtype=(int,float,int))array([(1, 2.0, 3), (4, 5.0, 6)],      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

In the same way, if we don’t give enough names to match the length of thedtype, the missing names will be defined with this default template:

>>>data=StringIO("1 2 3\n 4 5 6")>>>np.genfromtxt(data,dtype=(int,float,int),names="a")array([(1, 2.0, 3), (4, 5.0, 6)],      dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')])

We can overwrite this default with thedefaultfmt argument, thattakes any format string:

>>>data=StringIO("1 2 3\n 4 5 6")>>>np.genfromtxt(data,dtype=(int,float,int),defaultfmt="var_%02i")array([(1, 2.0, 3), (4, 5.0, 6)],      dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')])

Note

We need to keep in mind thatdefaultfmt is used only if some namesare expected but not defined.

Validating names

NumPy arrays with a structured dtype can also be viewed asrecarray, where a field can be accessed as if it were anattribute. For that reason, we may need to make sure that the field namedoesn’t contain any space or invalid character, or that it does notcorrespond to the name of a standard attribute (likesize orshape), which would confuse the interpreter.genfromtxtaccepts three optional arguments that provide a finer control on the names:

deletechars
Gives a string combining all the characters that must be deleted fromthe name. By default, invalid characters are~!@#$%^&*()-=+~\|]}[{';:/?.>,<.
excludelist
Gives a list of the names to exclude, such asreturn,file,print… If one of the input name is part of this list, anunderscore character ('_') will be appended to it.
case_sensitive
Whether the names should be case-sensitive (case_sensitive=True),converted to upper case (case_sensitive=False orcase_sensitive='upper') or to lower case(case_sensitive='lower').

Tweaking the conversion

Theconverters argument

Usually, defining a dtype is sufficient to define how the sequence ofstrings must be converted. However, some additional control may sometimesbe required. For example, we may want to make sure that a date in a formatYYYY/MM/DD is converted to adatetime object, or that a stringlikexx% is properly converted to a float between 0 and 1. In suchcases, we should define conversion functions with theconvertersarguments.

The value of this argument is typically a dictionary with column indices orcolumn names as keys and a conversion functions as values. Theseconversion functions can either be actual functions or lambda functions. Inany case, they should accept only a string as input and output only asingle element of the wanted type.

In the following example, the second column is converted from as stringrepresenting a percentage to a float between 0 and 1:

>>>convertfunc=lambdax:float(x.strip("%"))/100.>>>data=u"1, 2.3%, 45.\n6, 78.9%, 0">>>names=("i","p","n")>>># General case .....>>>np.genfromtxt(StringIO(data),delimiter=",",names=names)array([(1.0, nan, 45.0), (6.0, nan, 0.0)],      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

We need to keep in mind that by default,dtype=float. A float istherefore expected for the second column. However, the strings'2.3%'and'78.9%' cannot be converted to float and we end up havingnp.nan instead. Let’s now use a converter:

>>># Converted case ...>>>np.genfromtxt(StringIO(data),delimiter=",",names=names,...converters={1:convertfunc})array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

The same results can be obtained by using the name of the second column("p") as key instead of its index (1):

>>># Using a name for the converter ...>>>np.genfromtxt(StringIO(data),delimiter=",",names=names,...converters={"p":convertfunc})array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

Converters can also be used to provide a default for missing entries. Inthe following example, the converterconvert transforms a strippedstring into the corresponding float or into -999 if the string is empty.We need to explicitly strip the string from white spaces as it is not doneby default:

>>>data=u"1, , 3\n 4, 5, 6">>>convert=lambdax:float(x.strip()or-999)>>>np.genfromtxt(StringIO(data),delimiter=",",...converters={1:convert})array([[   1., -999.,    3.],       [   4.,    5.,    6.]])

Using missing and filling values

Some entries may be missing in the dataset we are trying to import. In aprevious example, we used a converter to transform an empty string into afloat. However, user-defined converters may rapidly become cumbersome tomanage.

Thegenfromtxt function provides two other complementarymechanisms: themissing_values argument is used to recognizemissing data and a second argument,filling_values, is used toprocess these missing data.

missing_values

By default, any empty string is marked as missing. We can also considermore complex strings, such as"N/A" or"???" to represent missingor invalid data. Themissing_values argument accepts three kindof values:

a string or a comma-separated string
This string will be used as the marker for missing data for all thecolumns
a sequence of strings
In that case, each item is associated to a column, in order.
a dictionary
Values of the dictionary are strings or sequence of strings. Thecorresponding keys can be column indices (integers) or column names(strings). In addition, the special keyNone can be used todefine a default applicable to all columns.

filling_values

We know how to recognize missing data, but we still need to provide a valuefor these missing entries. By default, this value is determined from theexpected dtype according to this table:

Expected typeDefault
boolFalse
int-1
floatnp.nan
complexnp.nan+0j
string'???'

We can get a finer control on the conversion of missing values with thefilling_values optional argument. Likemissing_values, this argument accepts different kind of values:

a single value
This will be the default for all columns
a sequence of values
Each entry will be the default for the corresponding column
a dictionary
Each key can be a column index or a column name, and thecorresponding value should be a single object. We can use thespecial keyNone to define a default for all columns.

In the following example, we suppose that the missing values are flaggedwith"N/A" in the first column and by"???" in the third column.We wish to transform these missing values to 0 if they occur in the firstand second column, and to -999 if they occur in the last column:

>>>data=u"N/A, 2, 3\n4, ,???">>>kwargs=dict(delimiter=",",...dtype=int,...names="a,b,c",...missing_values={0:"N/A",'b':" ",2:"???"},...filling_values={0:0,'b':0,2:-999})>>>np.genfromtxt(StringIO(data),**kwargs)array([(0, 2, 3), (4, 0, -999)],      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])

usemask

We may also want to keep track of the occurrence of missing data byconstructing a boolean mask, withTrue entries where data was missingandFalse otherwise. To do that, we just have to set the optionalargumentusemask toTrue (the default isFalse). Theoutput array will then be aMaskedArray.

Shortcut functions

In addition togenfromtxt, thenumpy.lib.io moduleprovides several convenience functions derived fromgenfromtxt. These functions work the same way as theoriginal, but they have different default values.

ndfromtxt
Always setusemask=False.The output is always a standardnumpy.ndarray.
mafromtxt
Always setusemask=True.The output is always aMaskedArray
recfromtxt
Returns a standardnumpy.recarray (ifusemask=False) or aMaskedRecords array (ifusemaske=True). Thedefault dtype isdtype=None, meaning that the types of each columnwill be automatically determined.
recfromcsv
Likerecfromtxt, but with a defaultdelimiter=",".

Table Of Contents

Previous topic

I/O with NumPy

Next topic

Indexing

Quick search

  • © Copyright 2008-2018, The SciPy community.
  • Last updated on Jul 24, 2018.
  • Created usingSphinx 1.6.6.

[8]ページ先頭

©2009-2025 Movatter.jp