Crowdsourcing Multi-Label Classification for Taxonomy Creation

Crowdsourcing Multi-Label Classification for Taxonomy Creation by Jonathan Bragg, Mausam and Daniel S. Weld.


Recent work has introduced CASCADE, an algorithm for creating a globally-consistent taxonomy by crowdsourcing microwork from many individuals, each of whom may see only a tiny fraction of the data (Chilton et al. 2013). While CASCADE needs only unskilled labor and produces taxonomies whose quality approaches that of human experts, it uses significantly more labor than experts. This paper presents DELUGE, an improved workflow that produces taxonomies with comparable quality using significantly less crowd labor. Specifically, our method for crowdsourcing multi-label classification optimizes CASCADE’s most costly step (categorization) using less than 10% of the labor required by the original approach. DELUGE’s savings come from the use of decision theory and machine learning, which allow it to pose microtasks that aim to maximize information gain.

An extension of work reported at Cascade: Crowdsourcing Taxonomy Creation.

While the reduction in required work is interesting, the ability to sustain more complex workflows looks like the more important.

That will require the development of workflows to be optimized, at least for subject identification.

Or should I say validation of subject identification?

What workflow do you use for subject identification and/or validation of subject identification?

Comments are closed.