Rater bias removal and Weak Ground Truth determination tool

This tool is to estimate Weak Ground Truth (WGT) of your annotated data together with the improved estimation of rater reliability and agreement. The tool outputs separate CSV files containing WGT, reliability, and agreement. The algorithms implemented are described in the paper: Weak Ground Truth Determination of Continuous Human-Rated Data.

The instructions on input and output data format, and on the configuration of the procedure are given below.

License. For research purposes only. Please cite the paper.

Format of input and output data

Input and output data is formatted into comma-delimited CSV files of uniformly sampled data in rows. For each rated scalar quantity one CSV file is formatted and uploaded to this tool.

Input format

Coma delimited CSV file, rater labels in columns, one column for each rater, no time stamps. Integers and decimal numbers in the form -1.234-01 are accepted. An example of an input file is given here.

Output format

Coma delimited CSV file 1 containing WGT values, where well defined (reliability over the threshold), string “nan” otherwise

Coma delimited CSV file 2 – Summary, containing the following data:

  • Overall ICC before and after optimization
  • ICC after optimization
  • The ratio of samples above ICC threshold before and after optimization
  • Rating entropy before and after optimization
  • Rating min-max before and after optimization

Example of an output file is given here – WGT and here – Summary.

The tool also prepares figures of WGT before and after optimization, ICC before and after optimization, the histogram of rating values, and min-max ranges of ratings before and after optimization, all in .png and .pdf format. A sample of images is available in the .zip archive available here.

How to use the tool?

Follow the following procedure:

  1. Fill out the registration form and get access to the tool
  2. Format your data into an input CSV format and select it using [Choose file button]
  3. Provide ICC threshold
  4. Provide sampling time of your ratting procedure (time intervals at which raters were providing ratings)
  5. Provide the length of reliability estimation time interval in number of data points: minimal 40, recommended 80
  6. Provide regularization term for the optimization (if uncertain, leave 0.1)
  7. Provide a range of the rating scale (lower and upper limit)
  8. Name your values (to appear on graphs)
  9. Upload input file: [Submit button]
  10. Be patient. It takes some time to process the data. For the sample provided, containing about 10K ratings of four raters it takes about 30 seconds, to obtain the results.
  11. Download output files to the local computer

Issues and questions: andrej.kosir[at]

Access to the tool

Please fill out the registration form to access the tool.

How to cite?

Plain text:
A. Košir, G. Strle and M. Meža, “Weak Ground Truth Determination of Continuous Human-Rated Data,” in IEEE Access, vol. 9, pp. 4594-4606, 2021, doi: 10.1109/ACCESS.2020.3046293.

author={A. {Košir} and G. {Strle} and M. {Meža}},
journal={IEEE Access},
title={Weak Ground Truth Determination of Continuous Human-Rated Data},
keywords={Reliability;Annotations;Correlation;Machine learning;Distortion;Probabilistic logic;Optimization;Bias removal;continuous data;inter-rater reliability and agreement;weak ground truth},



