Working with Lots of Files

A common case is that you have measured lots of data and now have a large stack of data files sitting in a tree of directories on disc and need to process all of them with some code. The Stoner.folders contains classes to make this job much easier.

For the end-user, the top level classes are DataFolder for Stoner.Data and Stoner.Image.ImageFolder doe xollections of Stoner.Image.ImageFile s. These are designed to complement the corresponding data classes Stoner.Data and Stoner.ImageFile. Like Stoner.Core.Data, Stoner.folders.DataFolder is exported directly from the Stoner package, whilst the Stoner.Image.ImageFolder is exported from the Stoner.Image sub-paclkage.

DataFolder and it’s friends are essentially containers for Stoner.Data (or similar classes from the Stoner.Image package) and for other instances of DataFolder to alow a nested heirarchy to be built up. The DataFolder supports both sequence-like and mapping-like interfaces to both the Stoner.Core.Data objects and the ‘sub’-DataFolder objects (meaning that they work like both a list or a dictionary). DataFolder is also lazy about loading files from disc - if an operation doesn’t need to load a file it generally won’t bother to keep memory usage down and speed up.

Their are further variants that can work with compressed zip archives - Stoner.Zip.ZipFolder and for storing multiple files in a single HDF5 file - Stoner.HDF5.HDF5Folder.

Finally, for the case of image files, there is a specialised Stoner.Image.ImageStack class that is optimised for image files of the same dimension and stores the images in a single 3D numpy array to allow much faster operations (at the expense of taking more RAM).

In the documentation below, expcet where noted explicitly, you can use a Stoner.Image.ImageFolder in place of the DataFolder, but working with Stoner.Image.ImageFile instead of Stoner.Data.

Basic Operations

Building a (virtual) Folder of Data

The first thing you probably want to do is to get a list of data files in a directory (possibly including its subdirectories) and probably matching some sort of filename pattern.:

from Stoner import DataFolder

In this very simple example, the DataFolder class is imported in the first line and then a new instance f is created. The optional pattern keyword is used to only collect the files with a .dat extension. In this example, it is assumed that the files are readable by the Stoner.Core.Data general class, if they are in some other format then the ‘type’ keyword can be used:

from Stoner.FileFormats import XRDFile

Strictly, the class pointed to be a the type keyword should be a sub class of Stoner.Core.metadataObject and should have a constructor that undersatands the initial string parameter to be a filename to load the object from. The class is then available via the DataFolder.type attribute and a default instance of the class is available via the DataFolder.instance attribute.

Additional parameters needed for the class’s constructor can be passed via a dictionary to the extra_args keyword of the DataFolder constructor.

To specify a particular directory to look in, simply give the directory as the first argument - otherwise the current duirectory will be used.:


If you pass False into the constructor as the first argument then the DataFolder will display a dialog box to let you choose a directory. If you add the multifile keyword argument and set it to True then you can use the dialog box to select multiple individual files.

More Options on Reading the Files on Disk

The pattern argument for DataFolder can also take a list of multiple patterns if there are different filename types in the directory tree.:


Sometimes a more complex filename matching mechanism than simple ‘’globbing’’ is useful. The pattern keyword can also be a compiled regular expression:

import re

The second case illustrates a useful feature of regular expressions - they can be used to capture parts of the matched pattern – and in the python version, one can name the capturing groups. In both cases above the DataFolder has the same file members (basically these would be runs produced by the i10 beamline at Diamond), but in the second case the run number (which comes after ‘’i10-’’ would be captured and presented as the run parameter in the metadata when the file was read.


Note that the files are not modified - the extra metadata is only added as the file is read by the DataFlder.

The loading process will also add the metadata key ‘’Loaded From’’ to the file which will give you a note of the filename used to read the data. If the attribute DataFolder.read_means is set to True then additional metadata is set for each file that contains the mean value and standard deviation of each column of data. If you don’t want the file listing to be recursive, this can be suppressed by using the recursive

keyword argument and the file listing can be suppressed altogether with the nolist keyword.:


If you don’t want to create groups for each sub-directory, then set the keyword parameter flat True as shown in the last example above.

Dealing With Revision Numbers

The Leeds CM Physics LabVIEW maeasurement software (aka ‘The One Code’) has a feature that adds a revision number into the filename when it is asked to overwrite a saved data file. This revision number is incremented until a non-colliding filename is created - thus ensuring that data isn’t accidentally overwritten. The downside of this is that sometimes only the latest revision number actually contains the most useful data - in this case the option discard_earlier in the DataFolder.__init__() constructor can be useful, or equivalently the DataFolder.keep_latest() method:

# is equivalent to....

More Goodies for DataFolder s

Since a Stoner.Data represents data in named columns, the DataFolder offers a couple of additional options for actions to take when reading the files in from disk. It is possible to have the mean and satandard deviation of each column of data to be calculated and added as metadata as each file is loaded. The read_means boolean parameter can enable this.

Other Options

Setting the debug parameter will cause additional debugging information to be sent as the code runs.

Any other keyword arguments that are not attributes of DataFolder are instead kept and used to set attributes on the individual Stoner.Data instances as they are loaded from disc. This, for example, can allow one to set the default Stoner.Data.setas attribute for each file.


A particularly useful parameter to set in the DataFolder constructor is the setas parameter - this will ensure that the Lpy:attr:Stoner.Data.setas attribute is set to identify columns of data as x, y etc. as the data files are loaded into the folder - thus allowing subsequent calls to Stoner.Data methods to run without needing to explictly set the columns each time.

All of these keywords to the constructor will set corresponding attributes on the created DataFolder, so it is possible to redo the process of reading the list of files from disk by directly manipulating these attrbutes.

The current root directory and pattern are set in the directory and pattern keywords and stored in the similarly named attributes. The DataFolder.getlist() method can be used to force a new listing of files.:


Manipulating the File List in a Folder

The DataFolder.flatten() method will do the same as passing the flat keyword argument when creating the Lpy:class:DataFolder - although the search for folders on disk is recursive, the resulting DataFolder contains a flat list of files.

You can also use the Stoner.folders.groups.GroupsDict.prune() - which is aliased as DataFolder.prune() method to remove groups (including nested groups) that have no data files in them. If you supply a name keyword to the Stoner.folders.groups.GroupsDict.prune() method it will instead remove any sub-folder with a matching name (and all sub-folders within it):

Root---> (0 files)
     |-> A--> (0 files)
     |    |
     |    |--> B--> (5 files)
     |    |     |
     |    |     |--> C--> (0 files)
     |    |     |     |
     |    |     |     |--> D (0files)
     |    |     |
     |    |     |--> E--> (0 files)
     |    |
     |    |--> F--> (0 files)
     |-->G--> (2 files)

root.groups.prune() will have the effect of removing sub-folders C, D, E, and F

Root---> (0 files)
     |-> A--> (0 files)
     |    |
     |    |--> B--> (5 files)
     |-->G--> (2 files)

root.groups.prune(name=”B”) will have the effect of removing sub-folders C, D, and F

Root---> (0 files)
     |-> A--> (0 files)
     |    |
     |    |--> F--> (0 files)
     |-->G--> (2 files)

In contrast, the Stoner.folders.groups.GroupsDict.keep() method will retain the tree branches that contain the groups that match the name parameter. For example,

root.groups.keep(“B”) will have the effect of deleting everything except the folders A, B, C, D and E.

Root---> (0 files)
     |-> A--> (0 files)
          |--> B--> (5 files)
                |--> C--> (0 files)
                |     |
                |     |--> D (0files)
                |--> E--> (0 files)

The Stoner.folders.groups.GroupsDict.compress() is useful when a DataFolder contains a chain of sub-folers that have only one sub-folder in them - as can result when reading one specific directory from a deep directory tree. The DataFolder.compress() method adjusts the virtual tree so that the root group is at the first level that contains more than just a single sub-folder.

Root---> (0 files)
     |-> A--> (0 files)
          |--> B--> (0 files)
                |--> C--> (5 files)

root.groups.compress will reformat the DataFolder to:

Root/A/B/C---> (5 files)

Stoner.folders.groups.GroupsDict.compress() takes a keyword argument keep_terminal which will keep the final group if set to True. In the example above, root.compress(keep_terminal=True) gives:

Root/A/B--> (0 files)
        |-->C--> (5 files)

You can also use the sorted filenames in a DataFolder to reconstruct the directory structure as groups by using the DataFolder.unflatten() method. Alternatively the invert operator ~ will flatten and unflatten a DataFolder:

g=~f # Flatten (if f has groups) or unflatten (if f has no groups)


The unary invert operator ~ will always create a new DataFolder before doing the DataFolder.flatten() or DataFolder.unflatten() - so that the original DataFolder is left unchanged. In contrast the DataFolder.flatten() and DataFolder.unflatten() methods will change the DataFolder as well as return a copy of the changed DataFolder.

If you need to combine multiple DataFolder objects or add Stoner.Core.Data objects to an existing DataFolder then the arithmetic addition operator can be used:



This will firstly combine all the files and then recursively merge the groups. If each DataFolder instance has the same groups, then they are merged with the addition operator.


Strictly, the last example is adding an instance of the DataFolder.type to the DataFolder - type checking is carried out to ensure that this is so.

Getting a List of Files

To get a list of the names of the files in a DataFolder, you can use the attribute. Sub-DataFolder s also have a name (essentially a string key to the dictionary that holds them), this can be accessed via the DataFolder.lsgrp generator fumnction.:



Both the and the DataFolder.lsgrp are generators, so they only return enties as they are iterated over. This is (roughly) in line with the Python 3 way of doing things - if you actually want the whole list then you should wrap them in a list().

If you just need the actual filename part and not the directory portion of the filename, the generator DataFile.basenames will do this.

As well as the list of filenames, you can get at the underlying stored objects through the DataFolder.files attribute. This will return a list of either instances of the stored Stoner.Core.Data type if they have already been loaded or the filename if they haven’t been loaded into memory yet.:


The various subfolder are stored in a dictionary in the DataFolder.groups attribute.


Both the files and groups in a DataFolder can be accessed either by integer index or by name. If a string name is used and doesn’t exactly match, then it is interpreted as a regular expression and that is matched instead. This only applies for retrieving tiems - for setting items an exact name or integer index is required.

Doing Something With Each File

A DataFolder is an object that you can iterate over, lading the Stoner.Core.Data type object for each of the files in turn. This provides an easy way to run through a set of files, performing the same operation on each:

for f in folder:

or even more compacts:

[f.normalise('mac116','macc119').save() for f in DataFolder(pattern='*.tdi',type=Stoner.Data)]

of even (!):


This last example illustrates a special ability of a DataFolder to use the methods of the type of Stoner.Data inside the DataFolder. The special DataFolder.each attribute (which is actually a Stoner.folders.each_item instance) provides special hooks to let you call methods of the underlying :py:attr:`DataFolder.type class on each file in the DataFolder in turn. When you access a method on DataFolder.each that is actually a method of the DataFile, they call a method that wraps a call to each Stoner.Data in turn. If the method on Stoner.Data returns the Stoner.Data back, then this is stored in the DataFolder. In this case the result back` to the user is the revised DataFolder. If, on the otherhand, the method when executed on the Data returns some other return value, then the user is returned a list of all of those return values. For example:

# ret will be a copy of folder as Data,interpolate returns a copy of itself.

# ret is a list of tuples as the return value of Data.span() is a tuple

What happens if the anaylysis routine you want to run through all the items in DataFolder is not a method of the Stoner.Data class, but a function written by you? In this case, so long as you write your custom analysis function so that the first positional argument is the Stoner.Data to be analysed, then the following syntax can be used:

def my_analysis(data,arg1,arg2,karg=True)
    """Some sort of analysis function with some arguments and keyword argument that works
    on some data *data*."""
    return data.modified()


(or alternatively using the matrix multiplication operator @):


(my_analysis@f) creates the callable object that iterates my_analysis over f, the second set of parenthesis above jsut calls this iterating object.

If the return value of the function is another instance of Stoner.Data (or whatever is being stored as the items in the DataFolder) then it will replace the items inside the DataFolder. The call to DataFolder.each will also return a simple list of the return values. If the function retuns something else, then you can have it added to the metadata of each item in the DataFolder by adding a _return keyword that can either be True to use the function name as the metadata name or a string to specify the name of the metadata to store the return value explicitly.

Thus, if your analysis function calcualtes some parameter that you want to call beta you might use the following:


DataFolder is also indexable and has a length:


For the second case of indexing, the code will search the list of file names for a matching file and return that (roughly equivalent to doing f.files.index(“filename”)]) But see Sorting, Filtering and Grouping Data Files for creating a sub DataFolder with a named index.

Working on the Metadata

Since each object inside a DataFolder will be some form of Stoner.Core.metadataObject, the DataFolder provides a mechanism to access the combined metadata of all of the Stoner.Core.metadataObject s it is storing via a DataFolder.metadata attribute. Like DataFolder.each this is actually a special class (in this case combined_metadata_proxy) that manages the process of iterating over the contents of the DataFolder to get and set metadata on the individual Stoner.Data objects.

Indexing the DataFolder.metadata will return an array of the requested metadata key, with one element from each data file in the folder. If the metadata key is not present in all files, then the array is a masked array and the mask is set for the files where it is missing.:

>>> masked_array(data=['Er', --, 'None', 'FeNi'],
         mask=[False,  True, False, False],

Writing to the contents of the DataFolder.metadata will simple set the corresponding metadata value on all the files in the folder.:

>>> array([12.56, 12.56, 12.56, 12.56])

The :py:meth:`combined_metadata_proxy.slice” method procides more control over how the metadata stored in the data folder can be returned.:

>>> [{'Startupaxis-X': 2},
     {'Startupaxis-X': 2},
     {'Startupaxis-X': 2},
     {'Startupaxis-X': 2}]

>>> [{'Datatype,Comment': 1, 'Startupaxis-X': 2},
     {'Startupaxis-X': 2, 'Datatype,Comment': 1},
     {'Datatype,Comment': 1, 'Startupaxis-X': 2},
     {'Datatype,Comment': 1, 'Startupaxis-X': 2}]
>>> [2, 2, 2, 2]

==========================  ===============
TDI Format 1.5                Startupaxis-X
index                                     0
==========================  ===============
Stoner.class{String}= Data                2
==========================  ===============

As can be seen from these examples, the combined_metadata_proxy.slice() method will default to returning eiother a list of dictionaries of )oif values_only is True, just a list, but the output parameter can change this. The options for output are:

  • “dict” or dict (the default if values_only is False)

    return a list of dictionary subsets of the metadata

  • “list” or list (the default if values_only is True)

    return a list of values of each item pf the metadata. If only item of metadata is requested, then just rturns a list.

  • “array” or np.array

    return a single array - like list above, but returns as a numpy array. This can create a 2D array from multiple keys

  • “Data” or Stoner.Data

    returns the metadata in a Stoner.Data object where the column headers are the metadata keys.

  • “smart”

    switch between dict and list depending whether there is one or more keys.

The combined_metadata_proxy.slice() will search for matching etadata names by string - including using glob patterns -

root.metadata.slice(“Model:*”) will return all metadata items in all files in the DataFolder that start with ‘Model:’. Since one of the common uses of DatFolder is to fit a series of data files with a model, the combined_metadata_proxy.slice() will also accept a lmfit.Model and will use it to pull the fitting parameters after using a Stoner.DataFolder.curve_fit() or similar method.:

from Stoner.analysis.fitting.models.generic import Gaussian fldr.each.lmfit(Gaussian,result=True) summary=fldr.metadata.slice(Gaussian,output=”data”)

Since combined_metadata_proxy implements a collections.MutableMapping it supplies the standard dictionary like methods such as combined_metadata_proxy.keys(),:py:meth:combined_metadata_proxy.values and combined_metadata_proxy.items() - each of which work with the set of keys common to all the data files in the DataFolder. If you instead want to work with all the keys defined in any of the data files, then there are versions combined_metadata_proxy.all_keys(), combined_metadata_proxy.all_values() and combined_metadata_proxy.all_items(). The combined_metadata_proxy.all provides a list of all the metadata dictionaries for all the data files in the DataFolder.

Using the *output*=”Data” is particularly powerful as it can be used to gather the results from e.g. a curve fitting across lots of datra files into a single Stoner.Data object ready ofr plotting or further analysis.:

result=fldr.metadata.slice(["Temperature:T1","PowerLaw:A","PowerLaw:A error"],output="Data")

In this example all the text files in the current directory tree are read in, a power-law is fitted to the first two columns and the result of the fit is plotted versus a temperature parameter.

Sorting, Filtering and Grouping Data Files


The order of the files in a DataFolder is arbitrary. If it is important to process them in a given order then the DataFolder.sort() method can be used:

f.sort(lambda x:len(x))

The first variant simply sorts the files by filename. The second and third variants both look at the ‘’temperature’’ metadata in each file and use that as the sort key. In the third variant, the revers keyword is used to reverse the order of the sort. In the final variant, each file is loaded in turn and the supplied function is called and evaluated to find a sort key.


The DataFolder.filter() method can be used to prune the list of files to be used by the DataFoler:

import re
f.filter(lambda x: x['Temperature']>150)
f.filter(lambda x: x['Temperature']>150,invert=True,copy=True)
f.filterout(lambda x: x['Temperature']>150,copy=True)

The first form performs the filter on the filenames (using the standard python fnmatch module). One can also use a regular expression as illustrated int he second example – although unlike using the pattern keyword in DataFolder.getlist(), there is no option to capture metadata (although one could then subsequently set the pattern to achieve this). The third variant calls the supplied function, passing the current file as a Stoner.Data object in each time. If the function evaluates to be True then the file is kept. The invert keyword is used to invert the sense of the filter (a particularly silly example here, since the greater than sign could simply be replaced with a less than or equals sign !). The copy keyword argument causes the DataFolder to be duplicated before the duplicate is filtered - without this, the filtering will modify the current DataFolder in place. Finally, the DataFolder.filterout() method is an alias for the DataFolder.filter() method with the invert keyword set.

Selecting Data

Selecting data from the DataFolder is somewhat similar to filtering, but allows an east way to build complex selection rules based on metadata values.:


The basic pattern of the method is that each keyword argument determines both the name of the metadata to use as the asis of the selection and also the operation to be performed. The value of the keyword argument is the value use to check. The oepration is seperated from the column name by a double underscore.

In the first example, only those files with a metadata value “temperature_T1” which is 4.2 will be selected, here there is no operator specified, so for a single scalar value it is assumed to be ‘’__eq’’ for equals. For a tuple it would be ‘’__between’’ and for a longer list ‘’__in’’. In the second example, the ‘’__gt’’ (greater than) operator is used and in the third it is ‘’__between’’, but in addition, this is inverted with ‘__not’’. The fourth option illustrates a test with memtadata whose values are strings. In addition, the use of the two keyword arguments is the logical OR of testing for either. The equiavblant process for a logical AND is shown in the sixth example with successive selects (the ‘’__icontains’’ operator is a case insenesitive match). The final example uses a dictionary passed as a non-keyword argument to show how to select memtadata keys that are not valid Python identifiers.


One of the more common tasks is to group a long list of data files into separate groups according to some logical test – for example gathering files with magnetic field sweeps in a positive direction together and those with magnetic field in a negative direction together. The method provides a powerful way to do this. Suppose we have a series of data curves taken at a variety of temperatures and with three different magnetic fields:'temperature') x:"positive" if x['B-Field']>0 else "negative")['temperature',lambda x:"positive" if x['B-Field']>0 else "negative"])

The method splits the files in the DataFolder into several groups each of which share a common value of the arguement supplied to the method. A group is itself another instance of the DataFolder class. As explained above, each DataFolder object maintains a dictionary called DataFolder.groups whose keys are the distinct values of the argument of the methods and whose values are DataFolder objects. So, if our DataFolder f contained files measured at 4.2, 77 and 300K and at fields of 1T and -1T then the first variant would create 3 groups: 4.2, 77 and 300 each one of which would be a DataFolder object congaing the files measured at those temperatures. The second variant would produce 2 groups – ‘’positive’’ containing the files measured with magnetic field of 1T and ‘’negative’’ containing the files measured at -1T. The third variant then goes one stage further and would produce 3 groups, each of which in turn had 2 groups. The groups are accessed via the attribute:


would return a list of the files measured at 4.2K and 1T.

If you try indexing a DataFolder with a string it first checks to see if there is a matching group

with a key of the same string then DataFolder will return the

corresponding group. This allows a more compact navigation through an extended group structure.:['project','sample','device']) # group will take a list
f['ASF']['ASF038']['A'] # Succsive indexing
f['ASF','ASF038','A'] # index with a tuple

The last variant will index through multiple levels of groups and then index for a file with a matching name and then finally index metadata in that file.

If you just ant to create a new empty group in your DataFoler, you can use the DataFolder.add_group() method.:


which will create the new group with a key of ‘’key_value’’.

Reducing Data

An important driver for the development of the DataFolder class has been to aid data reduction tasks. The simplest form of data reduction would be to gather one or more columns from each of a folder of files and return it as a single large table or matrix. This task is easily accomplished by the DataFolder.gather() method:

f.gather("X Data","Y Data")
f.gather("X Data",["Ydata 1","Y Data 2"])

In the first two forms you specify the x column and one or more y columns. In the third form, the x and y columns are determined by the values from the Stoner.Data.setas attribute. (you can set the value of this attribute for all files in the DataFolder by setting the DataFolder.setas attribute.)

A similar operation to DataFolder.gather() is to build a new set of data where each row corresponds to a set of metadata values from each file in the DataFolder. This can be achieved with the DataFolder.extract() method.:


The argument to the DataFolder.extract() method is a list of metadata values to be extracted from each file. The metadata should be convertable to an array type so that it can be included in the final result matrix. Any metadata that doesn’t appear to be so convertible in the first file in the ;py:class:DataFolder is ignored. The column headings of the final results table are the names of the metadata that were used in the extraction.

One task you might want to do would be to work through all the groups in a DataFolder and run some function either with each file in the group or on the whole group. This is further complicated if you want to iterate over all the sub-groups within a group. The DataFolder.walk_groups() method is useful here.:


This will iterate over the complete hierarchy of groups and sub groups in the folder and execute the function func once for each group. If the group parameter is False then it will execute func once for each file. The function fun should be defined something like:

def func(group,list_of-group_keys,arg1,arg2...)

The first parameter should expect and instance of Stoner.Data if group is False or an instance of DataFolder if group is True. The second parameter will be given a list of of strings representing the group key values from the topmost group to the lowest (terminal) group.

The replace_terminal parameter applies when group is True and the function returns a Stoner.Core,DataFile object. This indicates that the group on which the function was called should be removed from the list fo groups and the returned Stoner.Data object should be added to the list of files in the folder. This operation is useful when one is processing a group of files to combined them into a single dataset. Combining a multi-level grouping operation and successive calls to DataFolder.walk_groups() can rapidly reduce a large set of data files representing a multi-dimensional data set into a single file with minimal coding.

In some cases you will want to work with sets of files coming from different groups in order. For example, if above we had a sequence of 10 data files for each field and temperature and we wanted to process the positive and negative field curves together for a given temperature in turn. In this case the DataFolder.zip_groups() method can be useful.:


This would return a list of tuples of Stoner.Data objects where the tuples would be the first positive and first negative field files, then the second of each, then third of each and so. This presupposes that the files started of sorted by some suitable parameter (eg a gate voltage).