QualityControl
1.5.1
O2 Data Quality Control Framework
|
This is a resource meant for the developers of the QC. Whenever we learn something useful we put it here. It is not sanitized or organized. Just a brain dump.
git checkout v0.26.1
git checkout -b branch_v0.26.2
git cherry-pick b187ddbe52058d53a9bbf3cbdd53121c6b936cd8
git push upstream -u branch_v0.26.2
git tag -a v0.26.2 -m "v0.26.2"
git push upstream v0.26.2
The config file is stored in git in the branch repo_cleaner
(careful not to update in master instead !). Check out the branch, update the file Framework/script/RepoCleaner/config.yaml and commit it. A PR is necessary but in case of emergency, force-merge it. As soon as it is merged, it will be used by the script.
The config file used to be in aldaqci@aidrefflp01:~/alice-ccdb/config.yaml
but it is not the case any more.
The different cleaning policies available at the moment are: 1_per_hour, 1_per_run, last_only, none_kept, skip
The repo_cleaner is launched every 5 minutes by Jenkins.
Documentation of the repo_cleaner can be found ../Framework/script/RepoCleaner/README.md "here".
Until version 3 of the class MonitorObject, objects were stored in the repository directly. They are now stored within TFiles. The issue with the former way is that the StreamerInfo are lost. To be able to load old data, the StreamerInfos have been saved in a root file "streamerinfos.root". The CcdbDatabase access class loads this file and the StreamerInfos upon creation which allows for a smooth reading of the old objects. The day we are certain nobody will add objects in the old format and that the old objects have been removed from the database, we can delete this file and remove the loading from CcdbDatabase. Moreover, the following lines can be removed :
To generate locally the doxygen doc, do cd sw/BUILD/QualityControl-latest/QualityControl; make doc
. It will be available in doc/html, thus to open it quickly do [xdg-]open doc/html/index.html
.
When we don't see the monitoring data in grafana, here is what to do to pinpoint the source of the problem.
ssh root@aido2mon-gpn.cern.ch
systemctl stop influxdb
nc -u -l 8087
<– make sure to use the proper port (cf your monitoring url)influx
show databases
use qc
<– use the proper database as found in influxdb.confshow series
select count(*) from cpuUsedPercentage
<– use the correct metrics name"url": "influxdb-udp://flptest2.cern.ch:8089"
In case of a need to avoid writing QC objects to a repository, one can choose the "Dummy" database implementation in the config file. This is might be useful when one expects very large amounts of data that would be stored, but not actually needed (e.g. benchmarks).
The QCG server for qcg-test.cern.ch is hosted on qcg-qcg. The config file is /home/qcg/QualityControlGui/config.js
.
Any one in alice-member has access. We use the egroup alice-o2-qcg-access to grant access or not and this egroup contains alice-member plus a few extra. This allows for non ALICE members to access the QCG.
systemctl restart qcg
We use the infologger. There is a utility class, QcInfoLogger
, that can be used. It is a singleton. See the header for its usage.
Related issues : https://alice.its.cern.ch/jira/browse/QC-224
Service discovery (Online mode) is used to list currently published objects by running QC tasks and checkers. It uses Consul to store:
Both lists are updated from within QC task using Service Discovery C++ API:
register
- when a tasks startsderegister
- when tasks endsIf the "health check" is failing, make sure the ports 7777 (Tasks) and 7778 (CheckRuners) are open.
When a QC task starts, it register its presence in Consul by calling register endpoit of Consul HTTP API. The request needs the following fields:
Id
- Task ID (must be unique)Name
- Task name, tasks can have same name when they run on mutiple machinesTags
- List of published objectsChecks
- Array of health check details for Consul, each should contain Name
, Interval
, type of check with endpoint to be check by Consul (eg. "TCP": "localhost:1234"
) and DeregisterCriticalServiceAfter
that defines timeout to automatically deregister service when fails health checks (minimum value 1m
).In order to deregister a service deregister/:Id
endpoint of Consul HTTP API needs to be called. It does not need any additional parameters.
What are the QC integration tests in the FLP Pipeline doing?
Those object names are configurable from Ansible so that we do not have to release a new QCG rpm if we need to update the objects we check. So, if you know something will change modify the following file: https://gitlab.cern.ch/AliceO2Group/system-configuration/-/blob/dev/ansible/roles/flp-deployment-checks/templates/qcg-test-config.js.j2
If this test fail and one wants to investigate, they should first resume the VM in openstack. Then the normal web interfaces are available.
When working on the ansible recipes and deploying with o2-flp-setup, the recipes to modify are in .local/share/o2-flp-setup/system-configuration/
.
https://alice.its.cern.ch/jira/browse/O2-169