Skip to content

CrowdTruth/CrowdTruth-core

Repository files navigation

CrowdTruth

This library processes crowdsourcing results from Amazon Mechanical Turk and CrowdFlower following the CrowdTruth methodology. For more information see http://crowdtruth.org.

Installation

Download the library and install it using python setup.py crowdtruth

Getting Started

Create a folder anywhere on your machine and fill it with raw result files from Amazon Mechanical Turk or CrowdFlower. These files should be unaltered csv files generated by either of the two platforms, and contain on each row a collected judgment. A folder may contain files from both platforms, but the task should be the same. All files in the same folder will be aggregated together, so if there are multiple tasks then the results for each should be put in separate sub folders. An example of this can be seen in the /examples folder.

Once the files are in place the code can be called from the command-line with crowdtruth. The code will detect sub-folders, so you can choose to run it from the main folder so that the results for each sub-folder will be computed. You can also choose to run it only from within a sub-folder. All results for each folder will be saved in results.xlsx, which contains a tab with all crowdsourcing jobs, all units, all workers, all judgments and all annotations.

Configuration

Custom configuration can be added using a config.py file. Currently the following configuration options are available:

  • name: a label to identify the type of configuration.
  • inputColumns: a list of columns that contain original input data. Setting this option allows you to filter out columns you are not interested in. If empty the columns will be identified automatically.
  • outputColumns: a list of columns that contain judgments. Setting this option allows you to filter out columns you are not interested in. If empty the columns will be identified automatically.
  • units: a list of units to use. If empty all units are used.
  • workers: a list of workers to use. If empty all workers are used.
  • jobs: a list of jobs to use. If empty all jobs are used.
  • processJudgments(self, judgments): a function to alter the judgments before they are processed in CrowdTruth. The judgments variable is a Pandas dataframe with all judgments of one input file. This function should always return the same dataframe, with only the input or output columns altered. The identified input columns are stored in self.input.keys() and the output columns in self.output.keys().
  • processResults(self, results): a function to alter the results after they are processed in CrowdTruth. This allows custom metrics to be run, or additional visualizations to be generated. The results variable is a dictionary with a Pandas dataframe for the jobs, units, workers, judgments and annotations. Each of these dataframes may be altered and new dataframes may be added to the dictionary. Each of the dataframes is saved as a tab in the results.xlsx file. Additionally, plots can be generated, which will be saved into the folder that is being processed.