![]() |
Project
|
The comprehensive documentation of the O2 Analysis Framework is located at https://aliceo2group.github.io/analysis-framework/.
This document is WIP and provides an idea of what kind of API to expect from the DPL enabled analysis framework. APIs are neither final nor fully implemented in O2.
In order to simplify analysis we have introduced an extension to DPL which allows to describe an Analysis in the form of a collection of AnalysisTask.
In order to create its own task, as user needs to create your own Task deriving from AnalysisTask.
such a task can then be added to a workflow via the adaptAnalysisTask
helper. A full blown example can be built with:
Implementation details:
AnalysisTask
is simply astruct
. Sincestruct
default inheritance policy ispublic
, we can omit specifying it when declaring MyTask.
AnalysisTask
will not actually provide any virtual method, as theadaptAnalysis
helper relies on template argument matching to discover the properties of the task. It will come clear in the next paragraph how this allow is used to avoid the proliferation of data subscription methods.
Once you have an AnalysisTask
derived type, the most generic way which you can use to process data is to provide a process
method for it.
Depending on the arguments of such a function, you get to iterate on different parts of the AOD content.
For example:
will allow you to get a per time frame collection of tracks. You can then iterate on the tracks using the syntax:
Alternatively you can subscribe to tracks one by one via (notice the missing s
):
This has the advantage that you might be able to benefit from vectorization / parallelization.
Implementation notes: as mentioned before, the arguments of the process method are inspected using template argument matching. This way the system knows at compile time what data types are requested by a given
process
method and can create the relevant DPL data descriptions.The distinction between
Tracks
andTrack
above is simply that one refers to the whole collection, while the second is an alias toTracks::iterator
. Notice that we assume that each collection is of typeo2::soa::Table
which carries meta data about the dataOrigin and dataDescription to be used by DPL to subscribe to the associated data stream.
For performance reasons, data is organized in a set of flat table and navigation between objects of different tables has to be expressed explicitly in the process
method. So if you want to get all the tracks for a specific collision, you will have to implement:
the above will be called once per collision found in the time frame, and tracks
will allow you to iterate on all the tracks associated to the given collision.
Alternatively, you might not require to have all the tracks at once and you could do with:
also in this case the advantage is that your code might be up for parallelization and vectorization.
Notice that you are not limited to two different collections, but you could specify more. E.g.:
will be invoked for each v0 associated to a given collision and you will be given the tracks associated to it.
This means that each subsequent argument is associated to all the one preceding it.
For performance reasons, sometimes it's a good idea to split data in separate tables, so that once can request only the subset which is required for a given task. For example, so far the track related information is split in three tables: Tracks
, TrackCovs
, TrackExtras
.
However you might need to get all the information at once. This can be done by asking for a Join
table in the process method:
In order to create new collections of objects, you need two things. First of all you need to define a datatype for it, then you need to specify that your analysis task will create such an object. Notice that in a given workflow, only one task is allowed to create a given type of object.
In order to define the datatype you need to use DEFINE_SOA_COLUMN
and DEFINE_SOA_TABLE
helpers, defined in ASoA.h
. Assuming you want to extend the standard AOD format you will also need Framework/AnalysisDataModel.h
. For example, to define an extra table where to define phi and eta, you first need to define the two columns:
and then you put them together in a table:
Notice that tables are actually just a collections of columns.
Once you have the new data type defined, you can have a task producing it, by using the Produces
helper:
the etaphi
object is a functor that will effectively act as a cursor which allows to populate the EtaPhi
table. Each invocation of the functor will create a new row in the table, using the arguments as contents of the given column. By default the arguments must be given in order, but one can give them in any order by using the correct column type. E.g. in the example above:
Sometimes columns are not backed by actual persisted data, but they are merely derived from it. For example you might want to have different representations (e.g. spherical, cylindrical) for a given persistent representation. You can do that by using the DECLARE_SOA_DYNAMIC_COLUMN
macro.
Notice how the dynamic column is defined as a stand alone column and binds to X and Y only when you attach it as part of a table.
Sometimes it's handy to perform an action when all the data has been processed, for example executing a fit on a histogram we filled during the processing. This can be done by implementing the postRun method.
New tables are not the only kind on objects you want to create, but most likely you would like to fill histograms associated to the objects you have calculated.
You can do so by using the Histogram
helper:
Besides the Produces
helper, which allows you to create a new table which can be reused by others, there is another way to define a single column, via the Defines
helper.
Given a process function, one can of course define a filter using an if condition:
however this has the disadvantage that the filtering will be done for every task which has similar or more restrictive conditions. By declaring your filters upfront you can not only simplify your code, but allow the framework to optimize your processing. To do so, we provide two helpers: Filter
and Partition
.
The most common kind of filtering is when you process objects only if one of its properties passes a certain criteria. This can be specified with the Filter
helper.
filteredTracks will contain only the tracks in the table which pass the condition track::pt > 1
.
You can specify multiple filters which will be applied in a sequence effectively resulting in the intersection of all them.
You can also specify filters on associated quantities:
will process all the collisions which have at least one track with pt > 1
.
Filtering is not the only kind of conditional processing one wants to do. Sometimes you need to divide your data in two or more partitions. This is done via the Partition
helper:
i.e. Filter
is applied to the objects before passing them to the process
method, while Select
objects can be used to do further reduction inside the process
method itself.
Of course it should be possible to filter and partition data in the same task. The way this works is that multiple Filter
s are logically ANDed together and then they will get anded with the OR of all the Select
specified selections.
One of the features of the current framework is the ability to customize on the fly cuts and selection. The idea is to allow that by having a configurable("mnemonic-name-of-the-parameter")
helper which can be used to refer to configurable options. The previous example will then become:
To get combinations of distinct tracks, helper functions from ASoAHelpers.h
can be used. Presently, there are 3 combinations policies available: strictly upper, upper and full. CombinationsStrictlyUpperPolicy
is applied by default if all tables are of the same type, otherwise FullIndexPolicy
is applied.
The number of elements in a combination is deduced from the number of arguments passed to combinations()
call. For example, to get pairs of tracks from the same source, one must specify tracks
table twice:
The combination can consist of elements from different tables (of different kinds):
One can get combinations of elements with the same value in a given column. Input tables do not need to be the same but each table must contain the column used for categorizing. Additionally, you can specify a value to be skipped for grouping as well as the number of elements to be matched with first element in a combination. Again, full, strictly upper and upper policies are available:
For better performance, if the same table is used, Block{Full,StrictlyUpper,Upper}SameIndex
policies should be preferred. selfCombinations()
are a shortuct to apply StrictlyUpperSameIndex policy:
It will be possible to specify a filter for a combination as a whole, and only matching combinations will be then output. Currently, the filter is applied to each element separately. Note that for filter version the input tables are mentioned twice, both in policy constructor and in combinations()
call itself.
Produced tables can be saved to file as TTrees. This process is customized by various command line options of the internal-dpl-aod-writer. The options allow to specify which columns of which table are saved to which tree in which file.
Please be aware, that the functionality of these options is preliminary and might be changed in future.
The options to consider are:
aod-writer-keep
is a comma-separated list of DataOuputDescriptors
.
aod-writer-keep
Each DataOuputDescriptor
is a colon-separated list of 4 items
DataOuputDescriptor
and instructs the internal-dpl-aod-writer, to save the columns columns
of table table
as TTree tree
in folder TF_x
of file file.root
. The selected columns are saved as separate TBranches of TTree tree
.
By default x
is incremented with every time frame. This behavior can be modified with the command line option --ntfmerge
. The value of aod-writer-ntfmerge
specifies the number of time frames to merge into one TF_x
folder.
The first item of a DataOuputDescriptor
(table
) is mandatory and needs to be specified, otherwise the DataOuputDescriptor
is ignored. The other three items are optional and are filled by default values if missing.
The format of table
is
table
tablename
is the name of the table as defined in the workflow definition.
The format of tree
is a simple string which names the TTree the table is saved to. If tree
is not specified then O2tablename
is used as TTree name.
columns
is a slash(/)-separated list of column names., e.g.
columns
The column names are expected to match column names of table tablename
as defined in the respective workflow. Non-matching columns are ignored. The selected table columns are saved as separate TBranches with the same names as the corresponding table columns. If columns
is not specified then all table columns are saved.
file
finally specifies the base name of the file the tables are saved to. The actual file name is file
.root. If file
is not specified the default file name is used. The default file name can be set with the command line option --aod-writer-resfile
. However, if aod-writer-resfile
is missing then the default file name is set to AnalysisResults_trees
.
The aod-writer-keep
option also accepts the string "dangling" (or any leading sub-string of it). In this case all dangling output tables are saved. For the parameters tree
, columns
, and file
the default values (see table below) are used.
aod-writer-ntfmerge
specifies the number of time frames which are merged into a given folder TF_x
. By default this value is set to 1. x
is incremented by 1 at every aod-writer-ntfmerge
time frame.
aod-writer-resfile
specifies the default base name of the results files to which tables are saved. If in any of the DataOutputDescriptors
the file
value is missing it will be set to this default value.
aod-writer-json
specifies the name of a json-file which contains the full information needed to customize the behavior of the internal-dpl-aod-writer. It can replace the other three options completely. Nevertheless, currently all options are supported (see also discussion below).
An example file is shown in the highlighted field below. The relevant information is contained in a json object OutputDirector
. The OutputDirector
can include three different items:
resfile
is a string and corresponds to the aod-writer-resfile
command line option 2.aod-writer-ntfmerge
is an integer and corresponds to the aod-writer-ntfmerge
command line option 3.OutputDescriptors
is an array of objects and corresponds to the aod-writer-keep
command line option. The objects are equivalent to the DataOuputDescriptors
of the aod-writer-keep
option and are composed of 4 items which correspond to the 4 items of a DataOuputDescriptor
.
a. table
is a string b. treename
is a string c. columns
is an array of strings d. filename
is a string
Example json file for the internal-dpl-aod-writer
The information provided with the json file and the information which can be provided with the other command line options is obviously redundant. Anyway, currently all options can be used together. Practically the json-file - if provided - is read first. Then parameters are reset with values specified by other command line options. If any parameter value is still unset then its default value is used.
This hierarchy of the options is summarized in the following table. The columns represent the command line options and the rows the parameters which can be set. The table elements specify the priority a given command line option has to set the value of a given parameter. The last column in the table is the default, which always has the lowest priority. The actual default value is the value shown between brackets.
parameter\option | aod-writer-keep | aod-writer-resfile | aod-writer-ntfmerge | aod-writer-json | default |
---|---|---|---|---|---|
default file name | - | 1. | - | 2. | 3. (AnalysisResults_trees) |
ntfmerge | - | - | 1. | 2. | 3. (1) |
tablename | 1. | - | - | 2. | - |
tree | 1. | - | - | 2. | 3. (tablename ) |
columns | 1. | - | - | 2. | 3. (all columns) |
file | 1. | 2. | - | 3. | 4. (default file name ) |
If in any case two DataOuputDescriptors
are provided which have equal combinations of the tree
and file
parameters then the processing is stopped! It is not possible to save two trees with equal name to a given file.
The internal-dpl-aod-reader reads trees from root files and provides them as arrow tables to the requesting workflows. Its behavior is customized with the following command line options:
aod-file
takes a string as option value, which either is the name of the input root file or, if starting with an @
-character, is an ASCII-file which contains a list of input files.
'aod-reader-json' is a string and specifies a json file, which contains the customization information for the internal-dpl-aod-reader. An example file is shown in the highlighted field below. The relevant information is contained in a json object InputDirector
. The InputDirector
can include the following three items:
resfiles
is a string or an array of strings and corresponds to the aod-file
command line option. As the aod-file
option it can specify a single input file or, when the option value starts with a @
-character, an ASCII file with a list of input files. In addition resfiles
can be an array of strings, which contains a list of input files. 2.fileregex
is a regex string which is used to select the input files from the file list specified by resfiles
. 3.InputDescriptors
is an array of objects, the so called DataInputDescriptors
, which are composed of 4 items.
a. table
is a string and specifies the table to fill. The table
needs to be provided in the format AOD/tablename/0
, where tablename
is the name of the table as defined in the workflow definition. b. treename
is a string and specifies the tree which is to be used to fill table
c. resfiles
is either a string or an array of strings. It specifies a list of possible input files (see discussion of resfiles
above). d. fileregex
is a regular expression string which is used to select the input files from the file list specified by resfiles
The information contained in a DataInputDescriptor
instructs the internal-dpl-aod-reader to fill table table
with the values from the tree treename
in folders TF_x
of the files which are defined by resfiles
and which names match the regex fileregex
.
Of the four items of a DataInputDescriptor
, table
is the only required information. If one of the other items is missing its value will be set as follows:
treename
is set to O2tablename
of the respective table
item.resfiles
is set to resfiles
of the InputDirector
(1. item of the InputDirector
). If that is missing, then the value of the aod-file
option is used. If that is missing, then AnalysisResults_trees.root
is used.fileregex
is set to fileregex
of the InputDirector
(2. item of the InputDirector
). If that is missing, then .*
is used.Example json file for the internal-dpl-aod-reader
When the internal-dpl-aod-reader receives the request to fill a given table tablename
it searches in the provided InputDirector
for the corresponding InputDescriptor
and proceeds as defined there. However, if there is no corresponding InputDescriptor
it falls back to the information provided by the resfiles
and fileregex
options of the InputDirector
and uses O2tablename
as treename
.
The aod-reader-json
option allows to setup the reading of tables in a rather flexible way. Here a few presumably practical cases are discussed:
tableA
and tableB
need to be processed together. Table tableA
was previously saved as tree tableA
to files tableAResults_n.root
, where n
is a number and tableB
was saved as tree tableB
to tableBResults_n.root
. The following json-file could be used to read these tables:tableResult_n.root
, except for one table, namely tableA
, which is saved as tree treeA
in files tableAResult_n.root
.InputDescriptors
have the same number of selected input files. This is internally checked and the processing is stopped if it turns out that this is not the case.InputDescriptors
are corresponding to each other.fileregex
is evaluated with the c++ Regular expressions library. Thus check there for the proper syntax of regexes.We could add a template <typename C...> reshuffle()
method to the Table class which allows you to reduce the number of columns or attach new dynamic columns. A template wrapper could even be used to specify if a given dynamic column should be precalculated (or not). This would come handy to optimize the creation of a RowView, which could bind only the required (dynamic) columns. E.g.: