Loading and Examining Data¶
Loading a data file¶
The first step in using the Stoner module is to load some data from a measurement.:
from Stoner import Data
d=Data('my_data.txt')
In this example we have loaded data from my_data.txt
which should be in the
current directory
The Stoner.Data class is actually a shorthand for importing the Stoner.core.data.Data
class which in turn is a superset of many of the classes in the Stoner package. This includes code to automatically
detect the format of many of the measurement files that we use in our research.
The native file format for the Stoner package is known as the TDI 1.5 format - a tab delimited text file
that stores arbitrary metadata and a single 2D data set. It closely matches the DataFile
class of the
Stoner.Core
module.
Note
Data
will also read a related text format where the first column of the first line contains the string
TDI Format=Text 1.0 which are produced by some of the LabVIEW rigs used by the Device Materials Group in
Cambridge.
The Various Flavours of the DataFile
Class¶
To support a variety of different input file formats, the Stoner package provides a slew of subclasses of the base
DataFile
class. Each subclass typically provides its own version of the DataFile._load()
method that
understands how to reqad the relevant file.
Base Classes and Generic Formats¶
DataFile
Tagged Data Interchange Format 1.5 – the default format produced by the LabVIEW measurement rigs in the CM Physics group in Leeds
Stoner.formats.generic.CSVFile
Reads a generic comma separated value file. The
Stoner.FileFormats.CSVFile.load()
routine takes four additional parameters to the constructor and load methods. In addition to the two extra arguments used for the BigBlue& variant, a further two parameters specify the deliminators for the data and header rows. :py:class:`Stoner.FileFormats.CSVFile` also offers a **save* method to allow data to be saved in a simple deliminated text way (see Section Saving Data for details).Stoner.formats.generic.JustNumbersFile
This is a subclass of CSVFile dedicated for reading a text file that consists purely of rows of numbers with no header or metadata.
Stoner.formats.instruments.SPCFile
Loads a Raman scan file (.spc format) produced by the Rensihaw and Horiba Raman spectrometers. This may also work for other instruments that produce spc files, but has not been extensively tested.
Stoner.formats.generic.TDMSFile
Loads a file saved in the National Instruments TDMS format
Stoner.formats.sinstruments.QDFile
Loads data from various Quantum Design instruments, cincluding PPMS, MPMS and SQUID VSM.
Stoner.formats.simulations.OVFFile
OVF files are output by a variety of micomagnetics simulators. The standard was designed for the OOMMF code. This class will handle rectangualr mesh files with text or binary formats, versions 1.0 and 2.0
Classes for Specific Instruments (Mainly ones owned by the CM Physics Group in Leeds)¶
Stoner.formats.instruments.VSMFile
The text files produced by the group’s Oxford Instruments VSM
Stoner.formats.rigs.BigBlueFile
Datafiles were produced by VB Code running on the Big Blue cryostat. The
Stoner.FileFormats.BigBlueFile
version of theDataFile.load()
andDataFile
constructors takes two additional parameters that specify the row on which the column headers will be found and the row on which the data starts. This class is now largely obsolete as no new data files have been produced in this format for more than 5 years.Stoner.forms.instruments.XRDFile
Loads a scan file produced by Arkengarthdale - the group’s Brucker D8 XRD Machine.
Stoner.formats.rig.MokeFile
Loads a file from Leeds Condensed Matter Physics MOKE system in it’s old vb6-code. Like the BigBlueFile, this format is largely obsolete now.
Stoner.formats.rigs.FmokeFile
Loads a file from Dan Allwood’s Focussed MOKE System in Sheffield.
Stoner.forms.instruments.LSTemperatureFile
Loads and saves data from Lakeshore inc’s .340 Temperature Calibration file format.
Classes for Instruments at Major Facilities¶
Stoner.formats.facilities.BNLFile
Loads a SPEC file from Brookhaven (so far only tested on u4b files but may well work with other synchrotron data). Produces metadata Snumber: Scan number, Stype: Type of scan, Sdatetime: date time stamp for the measurement, Smotor: z motor position.
Stoner.formats.facilities.OpenGDAFile
Reads an ascii scan file generated by OpenGDA – a softwaresuite used for synchrotron such as Diamond.
Stoner.formats.facilities.RasorFile
Simply an alias for OpenGDAFile used for the RASOR instrument on I10 at Diamond.
Stoner.formats.facilities.SNSFile
Reads the ascii export file format from QuickNXS the data reduction software used on the BL-4A Beam Line at the SNS in Oak Ridge.
Stoner.formats.facilities.MDAASCIIFile
This class will read some variants of the output of mda2ascii as used at the APS in Argonne.
Stoner.HDF5.SLS_STXMFile
This reads an HDF file from the Swiss Light Source Pollux beam line as a data file (as opposed to an image)
Stoner.formats.maximus.MAximusSpectra
This reads a .hdr/.xsp spectra file from the Maximus STXM beamline at Bessy.
These classes can be used directly to load data from the appropriate format.:
import Stoner
d=Stoner.Data.load("my_data.txt")
v=Stoner.Data.load("my_VSM_data.fld", filetype=Stoner.formats.instruments.VSMFile)
c=Stoner.Data.load('data.csv',1,0,',',', filetype=Stoner.formats.generic.CSVFile)
Note
The Data.load()
is a class method, meaning it creates (and returns) a new
instance of the Data
class. Most of the methods of Data
objects will return a copy of
the modified instance, allowing a several methods to be chained together into a single operation.
Sometimes you won’t know exactly which subclass of Data
is the one
to use. Unfortunately, there is no sure fire way of telling, but Data.load()
will try to do
the best it can and will try all of the subclasses in memory in turn to see if one will
load the file without throwing an error. If this succeeds then the actual type of file that
worked is stored in the metadata of the loaded file.
Warning
The automatic loading assumes that each load routine does sufficient sanity checking that it will
throw and error if it gets bad data. Whilst one might wish this was always true it relies on
whoever writes the load method to make sure of this ! If you want to stop the automatic guessing
from happening use the auto_load=False
keyword in the load() method, or provide an explicit filetype
parameter.
You can also specify a filetype parameter to the Data.load()
method or directly to the
Stoner.Data
constructor as illustrated below to load a simple text file of un labelled numbers:
from Stoner import Data
d=Data("numbers.txt",filetype="JustNumbers",column_headers=["z","I","dI"],setas="xye")
If filetype is a string, it can match either the complete name of the subclass to use to load the file, or part of it.
Loading Data from a string or iterable object¶
In some circumstances you may have a string representation of a DataFile
object and want to
transform this into a proper DataFile
object. This might be, for example, from transmitting
the data over a network connection or receiving it from another program. In these situations the
left shift operator <<
can be used.:
data=Stoner.Core.DataFile() << string_of_data
data=Stoner.Core.DataFile() << iterable_object
The second example would allow any object that can be iterated (i.e. has a next() method that returns lines
of the data file, to be used as the source of the data. The DataFile()
creates an empty object so
that the left shift operator calls the method in DataFile
to read the data in. It also
determines the type of the object data
. This also provides an alternative syntax for reading a file
from disk:
data=Stoner.Core.DataFile()<<open("File on Disk.txt")
Constructing DataFile
s from Scratch¶
The constructor DataFile
, DataFile.__init__()
will try its best to guess what your intention
was in constructing a new instance of a DataFile. First of all a constructor function is called based on the number of positional
arguments were passed:
Single Argument Constructor¶
A single argument passed to DataFile.__init__()
is interpreted as follows:
A string is assumed to be a filename, and therefore a DataFile is created by loading a file.
A 2D numpy array is taken as the numeric data for the new DataFile
A list or other iterable of strings is assumed to represent the column headers
A list or other iterable of numpy arrays is assumed to represent a sequence of columns
A dictionary with string keys and numpy array values of equal length is taken as a set of columns whose header labels are the keys of the dictionaries.
A pandas.DataFrame is used to provide data, column headers and if it has a suitable multi-level column index, the
Stoner.Data.setas
attribute.Otherwise a dictionary is treated as the metadata for the new DataFile instance.
Two Argument Constructor¶
If the second argument is a dictionary or a list or other iterables of strings, then the first argument interpreted as for the sibgle argument constructors above, and then the second argument is interpreted in the same way. (Internally this done by calling the single argument constructor function twice, once for each argument.)
Many Argument Constructor¶
If there are more than two arguments and they are numpy arrays, then they are used as the columns of the new DataFile.
Keyword Arguments in Constructor¶
After the positional arguments are dealt with, any keywords that match attriobutes of the DataFile are used to set the corresponding attribute.
Examples¶
# Load a file from disc, set the setas attribute and column headers
d=Data("filename.txt",setas="xy", column_headers=["X-Data","Y-Data"])
# Create a DataFile from a dictionary:
d=Data({"Temperature":temp_array,"Resistance":res_data})
# The same, but set metadata too
d=Data({"Temperature":temp_array,"Resistance":res_data},{"User":"Fred","Sample":"X234_a","Field":2.4})
# From a pandas DataFrame
df=pd.DataFrame(...)
d=Data(df)
Examining and Basic Manipulations of Data¶
Data Structure¶
Data, Column headers and metadata¶
Having loaded some data, the next stage might be to take a look at it. Internally, data is represented as a 2D numpy masked array of floating point numbers, along with a list of column headers and a dictionary that keeps the metadata and also keeps track of the expected type of the metadata (ie the meta-metadata). These can be accessed like so:
d.data
d.column_headers
d.metadata
Masked Data and Why You Care¶
Masked data arrays differ from normal data arrays in that they include an option to mask or hide individual data elements. This can be useful to temporarily discount parts of your data when, for example, fitting a curve or calculating a mean value or plotting some data. One could, of course, simply ignore the masking option and use the data as is, however, masking does have a number of practical uses.
The data mask can be accessed via the DataFile.mask
attribute of DataFile
:
import numpy.ma as ma
print d.mask
d.mask=False
d.mask=ma.nomask
d.mask=numpy.array([[True, True, Fale,...False],...,[False,True,...True]])
d.mask=lambda x: x[0]<50
d.mask=lambda x:[y<50 for y in x]
The first line is simply the import statement for the numpy masked arrays in order to get the nomask
symbol. The second line will simply print the current mask. The next two examples will unmask all the data
i.e. make the values visible and usable. The next example illustrates using a numpy array of booleans to
set the mask - every element in the mask array that evaluates as a boolean True will be masked and every
False value unmasked. So far the semantics here are the same as if one had accessed the mask directly on
the data via d.data.mask
but the final two examples illustrate an extension that setting the
DataFile
mask attribute allows. If you pass a callable object to the mask attribute it will
be executed, passing each row of the data array to the user supplied function as a numpy array. The user
supplied function can then either return a single boolean value – in which case it will be used to mask
the entire row – or a list of boolean values to mask individual cells in the current row.
By default when the DataFile
object is printed or saved, data values that have been masked are replaced
with a “fill” value of 10^20.
Warning
This is somewhat dangerous behaviour. Be very careful to remove a mask before saving data if there is any chance that you will need the masked data values again later !
Note
Strictgly speaking, the DataFile.data
attribute is a sub-class of the numpy masked array, DataArray
.
This works the same way as a masked array, but supports some additional magic indexing and attributes discussed below.
Marking Columns as Dimensions: the magic setas attribute¶
Often in a calculation with some data you will be using one column for ‘x’ values and one or more ‘y’ columns
or indeed having ‘z’ column data and uncertainties in all of these (conventionally we call these ‘d’, ‘e’ and ‘f’ columns
so that ‘e’ data is the error in the y data). DataFile
has a concept of marking a column as containing such data and
will then use these by default in many methods when appropriate to have ‘x’ and ‘y’ data.
In addition to identifying columns as ‘x’,’y’, or ‘z’, for data that describes a vector field, you can mark the columns as containing ‘u’, ‘v’, ‘w’ data where (u,v,w) is the vector value at the point (x,y,z). There’s no support at present for uncertainties in (u,v,w) being marked.
Setting Column Types¶
To set which columns contain ‘x’,’y’ etc data use the DataFile.setas
attribute. This attribute can take
a list of single character strings from the set ‘x’,’y’,’z’,’d’,’e’, ‘f’, ‘u’, ‘v’, ‘w’ or ‘.’ where each element of the list refers to
the columns of data in order. To specify that a column has unmarked data use the ‘.’ string. The string ‘-’ can also be used - this
indicats that the current assignment is to be left as is.
Alternately, you can pass DataFile.setas
a string. In the simplest case, the string is just read in the same way that
the list would have been - each character corresponds to one column. However, if the string contains an integer, then the next
non-numeric character will be interpreted that many times, so:
d.setas="3.xy"
d.setas="...xy"
d.setas=['.','.','.','x','y']
There are still more ways of setting column types with the DataFile.setas
attribute:
d.seetas[3]="x"
d.setas["x"]=3
d.setas("3.xy")
d.setas(['.','.','.','x','y'])
d.setas(x=3,y=4)
All achieve the same effect of setting the same columns as containing x and y data.
Once you have identified columns for the various types, you also have access to utility attributes to access those columns:
d.setas="3.xye'
d.x == d.column(3)
d.y == d.column(4)
d.e == d.column(5)
Note that if DataFile.setas
is not set then attempting to use the quick access column attributes will
result in an exception. Once the DataFile.setas
attribute is set, a further set of virtual or derived column attributes
become available.:
d.setas="xyz"
d.r # The magnitude of the point x,y,z
d.q # The angle in the x-y plane relative to the x axis
d.p # The angle relative to the z axis
d.setas="xyz..uvw"
d.r # the magnitude of (u,v,w)
d.q # The angle in the x-y plane of (u,v) relative to +x
d.p # The angle of (u,v,w) relatiuve to +z
Where fewer than 3 or 6 dimensions are specified, these virtual columns fallback to working with the appropriate reduced number dimensions.
There are some more convenience ways to set which columns to use as x,y,z etc.:
d.setas={"x":0,"y":"Y Column title"}
d.x=0
d.y="Y Column title"
d.setas["Temperature"]="y"
In each of these cases, the DataFile
will try to work out what you intended to achieve for maximum flexibility
and convenience when writing code. However it can be fooled if one of your columns is called ‘x’ or ‘y’ !
Reading Column Types¶
The normal representation of DataFile.setas
is as a list, but it also has a string conversion available. You can also find which column
has been assigned as ‘x’, ‘y’ etc. by treating the DataFile.setas
as a dictionary:
d.column_headers=["One","Two","three","Four"]
d.setas="xy.z"
print list(d.setas) #['x','y','.','z']
print d.setas #'xy.z'
print d.setas['x'] # "One"
print d.setas["#x"] # 0
Note that the DataFile.setas
attribute supports reading keys that are either the single letter t get the name of the column or the letter
preceded by a # character to get the number of the column.
Alternatively, and equivalently, you can access the column indexes via attributes of DataFile.setas
:
d.setas.has_xcol # True d.setas.has_ucol # False d.setas.ycol # [1] d.setas.xcol # 0
Implied Column Assignments¶
If you do not specify the column types via the setas attributes, then DataFile
will try to guess sensible columns assignments based on the number
of columns in your data file. These default assignments are only done at the point at which the DataFile.setas
attribute is consulted. The default
assignments are:
Number of Columns |
Assignments |
---|---|
2 |
x, y |
3 |
x, y, e |
4 |
x, d, y, e |
5 |
x, y, u, v, w |
6 |
x, y, z, u, v, w |
Swapping and Rotating Column Assignments¶
Finally, if the DataFile.setas
attribute has been set with x, y (and z) columns then these assignments can be
swapped around by the invert operator ~. This either swaps x and y with eir associated errorbars for 2-D datasets, or rotates
x to y, y to z and z to x )again with their associated errors bars.:
d.setas="xye"
print d.setas
>>> ['x','y','e']
e=~d
print e.setas
>>> ['y','x','d']
Printing the Complete DataFile
¶
If the optional tabulate package is installed, then a pretty formatted representation of the DataFile
can be generated using:
print(repr(d))
This will give something like:
================================ ============= ============ ===========
TDI Format 1.5 Temperature Resistance Column 2
index 0 1 2
================================ ============= ============ ===========
Stoner.class{String}= b"'Data'" 291.6 4.7878 0.04687595
Measurement Title{String}= b"''" 291.6 4.78736 1.125022
Timestamp{String}= b"''" 291.6 4.78788 2.187542
User{String}= b"''" 291.6 4.78758 3.250062
TDI Format{Double Float}= b'1.5' 291.6 4.78782 4.312583
Loaded as{String}= b"'DataFile'" 291.6 4.7878 5.375103
291.6 4.78748 6.453249
291.6 4.7878 7.515769
291.6 4.78789 8.57829
If more columns exist in the DataFile
then the repr method attempts to pick ‘interesting’ columns. Thealgorithm will prioritise showing columns
that have been assigned a meaning with the DataFile.setas
attribute. If there are space for further columns, then the last column will be shown
and other columns that follow from any that are marked in DataFile.setas
. If no columns are marked as interesting, then the first n-2 columns and
the last column will be shown.:
==================== ===================== ===================== ==================== =================== ============= ==============
TDI Format 1.5 Magnetic Field (Oe) Moment (emu) M. Std. Err. (emu) Transport Action .... Map 16
index 3 (x) 4 (y) 5 6 71
==================== ===================== ===================== ==================== =================== ============= ==============
Stoner.class{String} 0.990880310535431 -5.83640043200129e-06 8.83351337362955e-09 1.0 nan
= b"'Data'" 0.965473115444183 -5.82915649064088e-06 1.23199616933718e-08 1.0 ... nan
Title,{String}= 1.16873073577881 -5.82160118818014e-06 8.92093033864879e-09 1.0 ... nan
b'None' 1.13061988353729 -5.8201045325249e-06 1.07142611311115e-08 1.0 ... nan
Fileopentime{String} 1.33387744426727 -5.83922456812945e-06 1.02539653165381e-08 1.0 ... nan
= b"'3540392668.062 1.49902403354645 -5.81961870971478e-06 1.04490646832536e-08 1.0 ... nan
The table header lists the column titles, numerical indices for each column and the assignment in the DataFile.setas
attribute.
If the file has more than 256 rowns, then the first 128 rows and last 128 rows will be shown with a row of … to show the split.
Many of the methods in the Stoner package return a copy of the current Stoner.Data
object and in ipython consoles and jupyter notebooks
these will get printed out using the table formats above. This may be more than is required, in which case you can set options in the Stoner
package to control the output format.:
from Stoner import Options
Options.short_data_repr = True # or "short_repr" for short representations of all objects.
print(repr(d))
>>> TDI_Format_RT.txt(<class 'Stoner.Data'>) of shape (1676, 3) (xy.) and 6 items of metadata
Working with columns of data¶
This is all very well, but often you want to examine a particular column of data or a particular row:
d.column(0)
d.column('Temperature')
d.column(['Temperature',0])
In the first example, the first column of numeric data will be returned. In the
second example, the column headers will first be checked for one labeled exactly
Temperature and then if no column is found, the column headers will be
searched using Temperature as a regular expression. This would then
match Temperature (K) or Sample Temperature. The third
example results in a 2 dimensional numpy array containing two columns in the
order that they appear in the list (ie not the order that they are in the data
file). For completeness, the DataFile.column()
method also allows one to
pass slices to select columns and should do the expected thing.
There are a couple of convenient short cuts. Firstly the floormod operator //
is an alias for the DataFile.column()
method and secondly for working
with cases where the column headers are not the same as the names of any of the attributes
of the DataFile
object:
d//"Temperature"
d.Temperature
d.column('Temperature')
all return the same data.
Whenever the Stoner package needs to refer to a column of data, you cn specify it in a number of ways:-
As an integer where the first column on the left is index 0
- As a string. if the string matches a column header exactly then the index of that column is returned.
If the string fails to match any column header it is compiled as a regular expression and then that is tried as a match. If multiple columns match then only the first is returned.
- As a regular expression directly - this is similar to the case above with a string, but allows you to pass a pre-compiled regular
expression in directly with any extra options (like case insensitivity flags) set.
As a slice object (ee.g.
0:10:2
) this is expanded to a list of integers.
- As a list of any of the above, in which case the column finding routine is called recursively in turn for each element of the list and
the final result would be to use a list of column indices.
- For example::
import re col=re.compile(‘^temp’,re.IGNORECASE) d.column(col)
Working with complete rows of data¶
Rows don’t have labels, so are accessed directly by number:
d[1]
d[1:4]
The second example uses a slice to pull out more than one row. This syntax also supports the full slice syntax which allows one to, for example, decimate the rows, or directly pull out the last fews rows in the file.
Special Magic When Working with Subsets of Data¶
As mentioned above, the data in a DataFile
is a special siubclass of numpy’s Masked Array - DataArray
.
A DataArray understands that columns can have names and can be assigned to hold specific types of data - x,y,z values etc. In
fact, the logic used for the column names and setas attribute in a DataFile
is actually supplied by the
DataArray
. When you index a DataFile or it’s data, the resulting data remembers it’s column names and assignments
and these can be used directly:
r=d[1:4]
print r.x,r.y
In addition to the column assignments, DataArray
also keeps a track of the row numbers and makes them available via
the i attribute.:
d.data.i # [0,1,2,3...,len(d)]
r=d[10]
r.i # 10
r.column_headers
You can reset the row numbers by assigning a value to the i attribute.
A single column of data also gains a .name attribute that matches its column_header:
c=d[:,0]
c.name == c.column_headers[0] #True
Manipulating the metadata¶
What happens if you use a string and not a number in the above examples ?:
d['User']
in this case, it is assumed that you meant the metadata with key User. To get a list of possible keys in the metadata, you can do:
d.dir()
d.dir('Option\:.*')
In the first case, all of the keys will be returned in a list. In the second,
only keys matching the pattern will be returned – all keys containing
Option:. For compatibility with normal opython semantics: DataFile.keys()
is
synonymous with DataFile.dir()
.
If the string you supply to get the metadata item does not exactly match an item of metadata, then it is interpreted as a regular expression to try and match against all the items of metadata. In this case, rather than returning a single item, all of the matching metadata is returned as a dictionary. PAssing a compiled regular epxression as the item name also has the same effect - this is useful if the regular expression you want to match is also an exact match to one particular metadata name.
We mentioned above that the metadata also keeps a note of the expected type of the data. You can get at the metadata type for a particular key like this:
d.metadata.type('User')
to get a dictionary of all of the types associated with each key you could do:
dict(zip(d.dir(),d.metadata.type(d.dir())))
but an easier way would be to use the typeHintedDict.types
attribute:
d.metadata.types
More on Indexing the data¶
There are a number o other forms of indexing supported for DataFile
objects.:
d[10,0]
d[0:10,0]
d[10,'Temp']
d[0:10,['Voltage','Temp']
The first variant just returns the data in the 11th row, first column (remember indexing starts at 0). The second variant returns the first 10 values in the first column. The third variant demonstrates that columns can be indexed by string as well as number, and the last variant demonstrates indexing multiplerows and columns – in this case the first 10 values of the Voltage and Temp columns.
You might think of the data as being a list of records, where each column is a
field in the record. Numpy supports this type of structured record view of data
and the DataFile
object provides the DataFile.records
attribute to d this. This read-only attribute is just providing an alternative
view of the same data.:
d.records
Finally the DataFile.dict_records
atrtibute does the same thing, but presetns the data as an array of dictionaries, where the
keys are the column names and each dictionary represents a single row.
Selecting Individual rows and columns of data¶
Many of the function in the Stoner module index columns by searching the column
headings. If one wishes to find the numeric index of a column then the
DataFile.find_col()
method can be used:
index=d.find_col(1)
index=d.find_col('Temperature')
index=d.find_col('Temp.*')
index=d.find_col('1')
index=d.find_col(1:10:2)
index=d.find_col(['Temperature',2,'Resistance'])
index=d.find_col(re.compile(r"^[A-Z]"))
DataFile.find_col()
takes a number of different forms. If the argument
is an integer then it returns (trivially) the same integer, a string argument is
first checked to see if it exactly matches one of the column headers in which
case the number of the matching column heading is returned. If no exact match is
found then a regular expression search is carried out on the column headings. In
both cases, only the first match is returned. If the string still doesn’t match, then
the string is checked to see if it can be cast to an integer, in which case the integer value is used.
The final three examples given above both return a list of indices, firstly using a slice construct - in this case the result is trivially the same as the slice itself, and in the second example by passing a list of column headers to look for. The final example uses a compiled regular expression. Unlike passing a string which contains a regular expression, passing a compiled regular expression will return a list of all columns that match. This distinction allows you to use a unique partial string to match just one column - but if you really want all possible columns that would match the pattern, then you can compile the regular expression and pass that instead.
This is the function that is used internally by DataFile.column()
,
DataFile.search()
etc and for this reason the trivial integer and slice
forms are implemented to allow these other functions to work with multiple
columns.
Sometimes you may want to iterate over all of the rows or columns in a data set. This can be done quite easily:
for row in d.rows():
print row
for column in d.columns():
print column
......
If there is no mask set, then the first example could also have been written more compactly as:
for row in d:
print row
......
Note
DataFile.rows()
and DataFile.columns()
both take an optional parameter not_masked. If this is True then these iterator
methods will skip over any rows/columns with masked out data values. When iterating over the DataFile
instance directly the
masked rows are skipped over.
Searching, sectioning and filtering the data¶
Searching¶
In many cases you do not know which rows in the data file are of interest - in this case you want to search the data.:
d.search('Temperature',4.2,accuracy=0.01)
d.search('Temperature',4.2,['Temperature','Resistance'])
d.search('Temperature',lambda x,y: x>10 and x<100)
d.search('Temperature',lambda x,y: x>10 and
x<1000 and y[1]<1000,['Temperature','Resistance'])
The general form is DataFile.search(<search column>,<search term>[,<listof return columns>])
The first example will return all the rows where the value of the Tenperature column is 4.2. The second example is the same, but only returns the values from the Temperature, and Resistance columns. The rules for selecting the columns are the same as for the DataFile.column method above – strings are matched against column headers and integers select column by number.
The third and fourth examples above demonstrate the use of a function as the
search value. This allows quite complex search criteria to be used. The function
passed to the search routine should take two parameters – a floating point
number and a numpy array of floating point numbers and should return either
True or False. The function is evaluated for each row in the
data file and is passed the value corresponding to the search column as the
first parameter while the second parameter contains a list of all of the values
in the row to be returned. If the search function returns True, then the row is
returned, otherwise it isn’t. In thr last example, the final parameter can
either be a list of columns or a single column. The rules for indexing columns
are the same as used for the DataFile.find_col()
method.
The ‘accuracy’ keyword parameter sets the level of accuracy to accept when testing equality or ranges (i.e. when the value parameter is a float or a tuple) - this avoids the problem of rounding errors with floating point arithmetic. The default is accuracy is 0.0.
Filtering¶
Sometimes you may want not to get the rows of data that you are looking for as a
separate array, but merely mark them for inclusion (or exclusion) from subsequent
operations. This is where the masked array (see ‘Masked Data and Why You Care) comes into its own.
To select which rows of data have been masked off, use the DataFile.filter()
method.:
d.filter(lambda r:r[0]>5)
d.filter(lambda r:r[0]>5,['Temp'])
With just a single argument, the filter method takes a complete row at a time and passes it to the first argument, expecting to get a boolean response (or list olf booleans equal in length to the number of columns). With a second argument as in the second example, you can specify which columns are passed to the filtering function in what order. The second argument must be a list of things which can be used to index a column (ie strings, integers, regular expressions).
Selecting¶
A very powerful way to get at just the dat rows you want is to make use of the DataFile.select()
method.
This offers a simple way to query which rows have columns matching some criteria.:
d.select(Temp=250)
d.select(Temp__ge=150)
d.select(T1__lt=4,T2__lt=5).select(Res__between=(100,200))
The general form is to provide keyword arguments that are something that can be used to index a column, followed by a double
underscore, followed by and operator. Where more than one keyword argument is supplied, the results of testing each row are logically
ORed. The result of chaining together two separate calls to select will, however, logically AND the two tests. So, in the examples above,
the, first line will assume an implicit equality test and give only those rows with a column Temp equal to 250. The second line gives an
explicit greater than or equal to test for the same column. The third line will select first those rows that have column T1 less than 4.2 or
column T2 less than 5 and then from those select those rows which have a column Res between 100 and 200. The full list of operators is given in
Stoner.Folders.baseFolder.select()
.
Sectioning¶
Another option is to construct a new DataFile object from a section of the data - this is particularly useful where the DataFile represents data correspondi ng to a set of (x,y,z) points. For this case the :py:,eth:DataFile.section method can be used:
d.setas="x..y.z."
slab=d.section(x=5.2)
line=d.section(x=4.7,z=2)
thick_slab=d.section(z=(5.0,6.0))
arbitrary=d.section(r=lambda x,y,z:3*x-2*y+z-4==0)
After the x, y, z data columns are identified, the DataFile.section()
method works with
‘x’, ‘y’ and ‘z’ keyword arguments which ar then used to search for matching data rows (the arguments to
these keyword arguments follow the same rules as the DataFile.search()
method).
A final way of searching data is to look for the closest row to a given value. For this the eponymous method may be used:
r=d.closest(10.3,xcol="Search Col")
r=d.closest(10.3)
If the xcol parameter is not supplied, the value from the DataFile.setas
attribute is used. Since the returned row
is an instance of the DataArray
that has been taken from the original data, it will know what row number it was and
will make that available via it’s i attribute.
Find out more about the data¶
Another question you might want to ask is, what are all the unique values of data in a given column (or set of columns). The Python numpy package has a function to do this and we have a direct pass through from the DataFile object for this:
d.unique('Temp')
d.unique(column,return_index=False, return_inverse=False)
The two optional keywords cause the numpy routine to return the
indices of the unique and all non-unique values in the array. The
column is specified in the same way as the DataFile.column()
method does.
Copying Data¶
One of the characteristics of Python that can confuse those used to other
programming languages is that assignments and argument passing is by reference
and not by value. This can lead to unexpected results as you can end up modifying variables
you were not expecting ! To help with creating genuine copies of data Python provides the copy module.
Whilst this works with DataFile objects, for convenience, the DataFile.clone
attribute is
provided to make a deep copy of a DataFile object.
Note
This is an attribute not a method, so there are no brackets here !
t=d.clone
Modifying Data¶
Appending data¶
The simplest way to modify some data might be to append some columns or rows.
The Stoner module redefines two standard operators, +
and &
to
have special meanings:
a=Stoner.DataFile('some_new_data.txt')
add_rows=d+a
add_columns=d&a
In these example, a is a second DataFile object that contains some data. In the first example, a new DataFile object is created where the contents of a are added as new rows after the data in d. Any metadata that is in a and not in d are added to the metadata as well. There is a requirement, however, that the column headers of d and a are the same – ie that the two DataFile objects appear to represent similar data.
In the second example, the data in a is added as new columns after the data from d. In this case, there is a requirement that the two DataFile objects have the same number of rows.
These operators are not limited just to DataFile objects, you can also add numpy arrays to the DataFile object to append additional data.:
import numpy as np
x=np.array([1,2,3])
new_data=d+x
y=np.array([1,2,3],[11,12,13],[21,22,23],[31,32,33]])
new_data=d+y
z={"X":1.0,"Y":2.1,"Z":7.5}
new_data=d+z
new_data=d+[x,y,z]
column=d.column[0]
new_data=d&column
In the first example above, we add a single row of data to d. This
assumes that the number of elements in the array matches the number of columns
in the data file. The second example is similar but this time appends a 2
dimensional numpy array to the data. The third example demonstrates adding data from a dictionary.
In this case the keys of the dictionary are used to determine which column the values are added
to. If their columns that don’t match one of the dictionary keys, then a NaN
is inserted. If there
are keys that don’t match columns labels, then new columns are added to the data set, filled with NaN
.
In the fourth example, each element in the list is added in turn to d. A similar effect would be achieved by doing
new_data=d+x+y+z
.
The last example appends a numpy array as a column to d. In this case the requirement is that the numpy array has the same or fewer rows of data as d.
Working with Columns of Data¶
Changing Individual Columns of Data¶
The DataFile.data
attribute is not simply a 2D numpy array, but a special subclass DataArray
, but still
can be directly modified like any other numpy array like class might be. If, however, the DataFile.setas
attribute has
been used to identify columns as containing x,y,z,u,v,w,d,e or f type data, then the correspondign attributes can be written
to as well as read to directly modify the data without having to keep track any further of which column(s) is indexed.
Thus the following will work:
d.setas="x..y..z"
d.x=d.x-5.0*d.y/d.z
d.y=d.z**2
d.z=np.ones(len(d))
When writing to the column attriobutes you must supply a numpy array with the correct number of elements (although DataFile will try to spot and correct if the array needs to be transposed first). If you specify more than one column has a particular type then you should supply a 2D array with the corresponding number of columns of data setting the attribute.
In order to preserve the behaviour that allows you to set the column assignments by setting the attribute to an index type, the
DataFile
checks to see if you are setting something that might be a column index or a numpy array. Thus the following
also works:
d.x="Temp" # Set the Temp column to be x data
d.x=np.linspace(0,300,len(d)) # And set the column to contain a linear array of values from 0 to 300.
You cannot set the p,q, or r attributes like this as they are calculated on the fly from the cartesian coordinates. On the otherhand you can do an efficient conversion to polar coordinates with:
d.setas="xyz"
(d.x,d.y,d.z)=(d.p,d.q,d.r)
Rearranging Columns of Data¶
Sometimes it is useful to rearrange columns of data. DataFile
offers a couple of methods to help with this.:
d.swap_column('Resistance','Temperature')
d.swap_column('Resistance','Temperature',headers_too=False,setas_too=False)
d.swap_column([(0,1),('Temp','Volt'),(2,'Curr')])
d.reorder_columns([1,3,'Volt','Temp'])
d.reorder_columns([1,3,'Volt','Temp'],header_too=False,setas_too=False)
The DataFile.swap_column()
method takes either a either a tuple (or just a pair of arguments) of column names, indices or a list of such
tuples and swaps the columns accordingly, whilst the DataFile.reorder_columns()
method takes a
list of column labels or indices and constructs a new data matrix out of those columns in the new order.
The headers_too=False
options, as the name suggests, cause the column headers not be rearranged.
Note
The addition of the setas_too keyword to swap the column assignments around as well is new in 0.5rc1
Renaming Columns of Data¶
As a convenience, DataFile
also offers a useful method to rename data columns:
d.rename('old_name','new_name')
d.rename(0,'new_name')
Alternatively,of course, one could just edit the column_headers attribute.
Inserting Columns of Data¶
The append columns operator & will only add columns to the end of a
dataset. If you want to add a column of data in the middle of the data set then
you should use the DataFile.add_column()
method.:
d.add_column(numpy.array(range(100)),header='Column Header')
d.add_column(numpy.array(range(100)),header='Column Header',index=Index)
d.add_column(lambda x: x[0]-x[1],header='Column Header',func_args=None)
The first example simply adds a column of data to the end of the dataset and
sets the new column headers. The second variant inserts the new column before
column Index. Index follows the same rules as for the
DataFile.column()
method. In the third example, the new column data is
generated by applying the specified function. The function is passed s dingle
row as a 1D numpy array and any of the keyword, argument pairs passed in a
dictionary to the optional func_args argument.
The DataFile.add_column()
method returns a copy of the DataFile object
itself as well as modifying the object. This is to allow the method to be chained
up with other methods for more compact code writing.
Deleting Rows of Data¶
Removing complete rows of data is achieved using the DataFile.del_rows()
method.:
d.del_rows(10)
d.del_rows(1:100:2)
d.del_rows('X Col',value)
d.del_rows('X Col',lambda x,y:x>300)
d.del_rows('X Col',(100,200))
d.del_rows(;X Col',(100,200),invert=True)
The first variant will delete row 10 from the data set (where the first row will
be row 0). You can also supply a list or slice (as in the second example) to
DataFile.del_rows()
to delete multiple rows.
If you do not know in advance which row to delete, then the remiaining
variants provide more advanced options. The third variant searches for and
deletes all rows in which the specified column contains value. The
fourth variant selects which ros to delete by calling a user supplied function
for each row. The user supplied function is the same in form and definitition as
that used for the DataFile.search()
method:
def user_func(x_val,row_as_array):
return True or False
The final two variants above, use a tuple to select the data. The final example makes
use of the invert keyword argument to reverse the sense used to selkect tows. In both cases
rows are deleted(kept for invert = True) if the specified column lies between the maximum and minimum
values of the tuple. The test is done inclusively. Any length two iterable object can be used
for specifying the bounds. Finally, if you call DataFile.del_rows()
with no arguments at all, then
it removes all rows where at least one column of data is masked out.:
d.filter(lambda r:r[0]>50) # Mask all rows where the first column is greater than 50
d.del_rows() # Then delete them.
For simple caases where the row to be deleted can be expressed as an integer or list of integers, the subtration operator can be used.:
e=d-2
e=d-[1,2,3]
d-=-1
The final form looks stranger than it is - it simply deletes the last row of data in place.
Deleting Columns of Data¶
Deleting whole columns of data can be done by referring to a column by index or
column header - the indexing rules are the same as used for the
DataFile.column()
method.:
d.del_column('Temperature')
d.del_column(1)
Again, there is an operator that can be used to achieve the same effect, in this case the modulus operator %.:
e=d%"temperature"
e=d%1
d%=-1
Sorting Data¶
Data can be sorted by one or more columns, specifying the columns as a number or string for single columns or a list or tuple of strings or numbers for multiple columns. Currently only ascending sorts are supported.:
d.sort('Temp')
d.sort(['Temp','Gate'])
Saving Data¶
Only saving data in the TDI format and as comma or tab deliminated formats is supported.
For example:
d.save()
d.save(filename)
d=Stoner.CSVFile(d)
d.save()
d.save(filename,'\t')
In the first case, the filename used tosave the data is determined from the filename attribute of the DataFile object. This will have been set when the filewas loaded from disc.
If the filename attribute has not been set eg if the DataFile object was
created from scratch, then the DataFile.save()
method will cause a dialogue
box to be raised so that the user can supply a filename.
In the second variant, the supplied filename is used and the filename attribute
is changed to match this ie d.filename
will always return the last
filename used for a load or save operation.
The third is similar but convert the file to cvs
format while the fourth also
specifies that the eliminator is a tab character.
Exporting Data to pandas¶
The Stoner.Data.to_pandas()
method can be used to convert a Stoner.Data
object to
a pandas.DataFrame. The numerical data will be transferred directly, with the DataFrame columns being set up
as a two level index of column headers and column assignments. The Stoner library registers an additional
metadata extension attribute for DataFrames that provides thin sub-class wrapper around the same regular expression
based and type hinting dictionary that is used to store metadata in Stoner.Data.metadata
.
The pandas.DataFrame produced by the Stoner.Data.to_pandas()
method is reversibly convertible back to an identical
Stoner.Data
object by passing the DataFrame into the constructor of the Stoner.Data
object.