Exploring the Use of Deep Learning with Crowdsourcing to Annotate Images

Authors

  • Samreen Anjum University of Texas at Austin
  • Ambika Verma Cognex Corporation
  • Brandon Dang Amazon
  • Danna Gurari University of Texas at Austin

DOI:

https://doi.org/10.15346/hc.v8i2.121

Keywords:

Crowdsourcing, Computer Vision, Deep Learning, Human Machine Collaboration

Abstract

We investigate what, if any, benefits arise from employing hybrid algorithm-crowdsourcing approaches over conventional approaches of relying exclusively on algorithms or crowds to annotate images.  We introduce a framework that enables users to investigate different hybrid workflows for three popular image analysis tasks: image classification, object detection, and image captioning.   Three hybrid approaches are included that are based on having workers: (i) verify predicted labels, (ii) correct predicted labels, and (iii) annotate images for which algorithms have low confidence in their predictions.  Deep learning algorithms are employed in these workflows since they offer high performance for image annotation tasks.  Each workflow is evaluated with respect to annotation quality and worker time to completion on images coming from three diverse datasets (i.e., VOC, MSCOCO, VizWiz). Inspired by our findings, we offer recommendations regarding when and how to employ deep learning with crowdsourcing to achieve desired quality and efficiency for image annotation.

References

Bernstein, M. S, Teevan, J, Dumais, S, Liebling, D, and Horvitz, E. (2012). Direct answers for search queries in the long tail. In

Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 237–246.

Brady, E, Morris, M. R, Zhong, Y, White, S, and Bigham, J. P. (2013). Visual Challenges in the Everyday Lives of Blind People. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2117–2126.

Chen, X, Fang, H, Lin, T.-Y, Vedantam, R, Gupta, S, Dollár, P, and Zitnick, C. L. (2015). Microsoft COCO Captions: Data Collection

and Evaluation Server. arXiv preprint arXiv:1504.00325 (2015).

Cheng, J and Bernstein, M. S. (2015). Flock: Hybrid Crowd-Machine Learning Classifiers. In Proceedings of the 18th ACM Conference

on Computer Supported Cooperative Work & Social Computing. ACM, 600–611.

Chilton, L. B, Little, G, Edge, D, Weld, D. S, and Landay, J. A. (2013). Cascade: Crowdsourcing Taxonomy Creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999–2008.

Cohen, I and Medioni, G. (1999). Detecting and tracking moving objects for video surveillance. In Computer Vision and Pattern

Recognition, 1999. IEEE Computer Society Conference on., Vol. 2. IEEE, 319–325.

Dang, B, Hutson, M, and Lease, M. (2016). MmmTurkey: A crowdsourcing framework for deploying tasks and recording worker

behavior on Amazon Mechanical Turk. arXiv preprint arXiv:1609.00945 (2016).

Deng, J, Dong, W, Socher, R, Li, L.-J, Li, K, and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

Everingham, M, Van Gool, L, Williams, C. K, Winn, J, and Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge.

International Journal of Computer Vision 88, 2 (2010), 303–338. DOI:http://dx.doi.org/10.1007/s11263-009-0275-4

Gaur, Y, Lasecki, W. S, Metze, F, and Bigham, J. P. (2016). The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th Web for All Conference. ACM, 23.

Guinness, D, Cutrell, E, and Morris, M. R. (2018). Caption Crawler: Enabling Reusable Alternative Text Descriptions using Reverse

Image Search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 518.

Guo, A. (2018). Crowd-AI Systems for Non-Visual Information Access in the Real World. In Extended Abstracts of the 2018 CHI

Conference on Human Factors in Computing Systems. ACM, DC09.

Gurari, D, Jain, S, Betke, M, and Grauman, K. (2016). Pull the Plug? Predicting If Computers or Humans Should Segment Images. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 382–391.

Gurari, D, Li, Q, Stangl, A. J, Guo, A, Lin, C, Grauman, K, Luo, J, and Bigham, J. P. (2018). VizWiz grand challenge: Answering

visual questions from blind people. arXiv preprint arXiv:1802.08218 (2018).

Gurari, D, Sameki, M, Wu, Z, and Betke, M. (2016). Mixing Crowd and Algorithm Efforts to Segment Objects in Biomedical Images.

In Medical Image Computing and Computer Assisted Intervention Interactive Medical Image Computation Workshop (2016). 1–8.

Hara, K, Le, V, and Froehlich, J. (2013). Combining crowdsourcing and google street view to identify street-level accessibility problems.

In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 631–640.

Hara, K, Sun, J, Moore, R, Jacobs, D, and Froehlich, J. (2014). Tohme: Detecting Curb Ramps in Google Street View Using Crowdsourcing, Computer Vision, and Machine Learning. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology. ACM, 189–204.

Harrington, R. P and Vanderheiden, G. C. (2013). Crowd caption correction (CCC). In Proceedings of the 15th International ACM

SIGACCESS Conference on Computers and Accessibility. ACM, 45.

Huang, Y, Huang, Y, Xue, N, and Bigham, J. P. (2017). Leveraging complementary contributions of different workers for efficient

crowdsourcing of video captions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM,

–4626.

Kacorri, H, Kitani, K. M, Bigham, J. P, and Asakawa, C. (2017). People with visual impairment training personal object recognizers:

Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5839–5849.

Konyushkova, K, Uijlings, J, Lampert, C. H, and Ferrari, V. (2017). Learning Intelligent Dialogs for Bounding Box Annotation. arXiv

preprint arXiv:1712.08087 (2017).

Krasin, I, Duerig, T, Alldrin, N, Ferrari, V, Abu-El-Haija, S, Kuznetsova, A, Rom, H, Uijlings, J, Popov, S, Veit, A, Belongie, S, Gomes, V, Gupta, A, Sun, C, Chechik, G, Cai, D, Feng, Z, Narayanan, D, and Murphy, K. (2017). OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017).

Laput, G, Lasecki, W. S, Wiese, J, Xiao, R, Bigham, J. P, and Harrison, C. (2015). Zensors: Adaptive, Rapidly Deployable,

Intelligent Sensor Feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM,

–1944.

Lasecki, W. S, Homan, C, and Bigham, J. P. (2014). Architecting Real-Time Crowd-Powered Systems. Human Computation 1, 1

(2014).

Lin, C. H, Mausam, M, and Weld, D. S. (2012). Dynamically Switching between Synergistic Workflows for Crowdsourcing. In TwentySixth AAAI Conference on Artificial Intelligence.

Lin, T.-Y, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, Dollár, P, and Zitnick, C. L. (2014) Microsoft COCO: Common

Objects in Context. In European Conference on Computer Vision. Springer, 740–755.

Lofi, C and El Maarry, K. (2014). Design Patterns for Hybrid Algorithmic-Crowdsourcing Workflows.. In CBI (1). 1–8.

Lundgard, A, Yang, Y, Foster, M. L, and Lasecki, W. S. (2018). Bolt: Instantaneous crowdsourcing via just-in-time training. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 467.

MacLeod, H, Bennett, C. L, Morris, M. R, and Cutrell, E. (2017). Understanding Blind People’s Experiences with Computer-Generated

Captions of Social Media Images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM,

–5999.

Pan, J.-Y, Yang, H.-J, Faloutsos, C, and Duygulu, P. (2004). Automatic multimedia cross-modal correlation discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 653–658.

Papadopoulos, D. P, Uijlings, J. R, Keller, F, and Ferrari, V. (2016). We don’t need no bounding-boxes: Training object class detectors

using only human verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 854–863.

Papineni, K, Roukos, S, Ward, T, and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.

Pirsiavash, H and Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern

Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2847–2854.

Quinn, A. J and Bederson, B. B. (2011). Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of the

SIGCHI Conference on Human Factors in Computing Systems. ACM, 1403–1412.

Russakovsky, O, Deng, J, Su, H, Krause, J, Satheesh, S, Ma, S, Huang, Z, Karpathy, A, Khosla, A, and Bernstein, M. (2015). Imagenet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

Rzeszotarski, J and Kittur, A. (2012). CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 55–62.

Sabou, M, Scharl, A, and Föls, M. (2013). Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows. International

Journal on Semantic Web and Information Systems (IJSWIS) 9, 3 (2013), 14–41.

Salisbury, E, Kamar, E, and Morris, M. R. (2017). Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for

Refining Vision-to-Language Technology for the Blind. Proceedings of HCOMP 2017 (2017).

Salisbury, E, Kamar, E, and Morris, M. R. (2018). Evaluating and Complementing Vision-to-Language Technology for People who are

Blind with Conversational Crowdsourcing.. In IJCAI. 5349–5353.

Sodemann, A. A, Ross, M. P, and Borghetti, B. J. (2012). A Review of Anomaly Detection in Automated Surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 1257–1272.

Song, J. Y, Lemmer, S. J, Liu, M. X, Yan, S, Kim, J, Corso, J. J, and Lasecki, W. S. (2019). Popup: reconstructing 3D video using

particle filtering to aggregate crowd responses. In Proceedings of the 24th International Conference on Intelligent User Interfaces.

ACM, 558–569.

Von Ahn, L and Dabbish, L. (2004). Labeling Images with a Computer Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 319–326.

Von Ahn, L, Ginosar, S, Kedia, M, Liu, R, and Blum, M. (2006)a. Improving Accessibility of the Web with a Computer Game. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 79–82.

Von Ahn, L, Liu, R, and Blum, M. (2006)b. Peekaboom: A Game for Locating Objects in Images. In Proceedings of the SIGCHI

Conference on Human Factors in Computing Systems. ACM, 55–64.

Weld, D. S and Dai, P. (2011). Human Intelligence Needs Artificial Intelligence. In Workshops at the Twenty-Fifth AAAI Conference on

Artificial Intelligence.

Wigness, M, Draper, B. A, and Ross Beveridge, J. (2015). Efficient Label Collection for Unlabeled Image Datasets. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.

Zhang, H, Horvitz, E, and Parkes, D. C. (2013). Automated Workflow Synthesis.. In AAAI.

Zhou, F and Lin, Y. (2016). Fine-grained image classification by exploring bipartite-graph labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1124–1133.

Downloads

Published

2021-07-27

How to Cite

Anjum, S., Verma, A., Dang, B., & Gurari, D. (2021). Exploring the Use of Deep Learning with Crowdsourcing to Annotate Images. Human Computation, 8(2), 76-106. https://doi.org/10.15346/hc.v8i2.121