Improving Audio Tagging on Imperfect Datasets = 불완전한 데이터셋에서 수행되는오디오 태깅의 개선

이돈문 2022년
논문상세정보
' Improving Audio Tagging on Imperfect Datasets = 불완전한 데이터셋에서 수행되는오디오 태깅의 개선' 의 주제별 논문영향력
논문영향력 선정 방법
논문영향력 요약
주제
  • Audio tagging
  • knowledge transfer
  • neural networks
  • semi-supervised learning
  • weakly-supervised learning
동일주제 총논문수 논문피인용 총횟수 주제별 논문영향력의 평균
123 0

0.0%

' Improving Audio Tagging on Imperfect Datasets = 불완전한 데이터셋에서 수행되는오디오 태깅의 개선' 의 참고문헌

  • [9] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct 2015.
    [2015]
  • [99] A. Van Den Oord, S. Dieleman, and B. Schrauwen, “Transfer learning by supervised pre-training for audio-based music classification,” in Conference of the International Society for Music Information Retrieval (ISMIR 2014), 2014.
    [2014]
  • [98] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Transfer learning for music classification and regression tasks,” arXiv preprint arXiv:1703.09179, 2017.
    [2017]
  • [97] K. Sohn et al., “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.
    [2020]
  • [96] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” arXiv preprint arXiv:1905.02249, 2019.
    [2019]
  • [95] X. Zhou and M. Belkin, “Semi-supervised learning,” in Academic Press Library in Signal Processing. Elsevier, 2014, vol. 1, pp. 1239–1269.
    [2014]
  • [94] T. Xiao, X. Wang, A. A. Efros, and T. Darrell, “What should not be contrastive in contrastive learning,” arXiv preprint arXiv:2008.05659, 2020.
    [2020]
  • [93] L. Wang and A. v. d. Oord, “Multi-format contrastive learning of audio representations,” arXiv preprint arXiv:2103.06508, 2021.
  • [92] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Byol for audio: Self-supervised learning for general-purpose audio representation,” arXiv preprint arXiv:2103.06695, 2021.
  • [91] S. Pascual, M. Ravanelli, J. Serr`a, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” Proc. Interspeech 2019, pp. 161–165, 2019.
    [2019]
  • [90] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big selfsupervised models are strong semi-supervised learners,” arXiv preprint arXiv:2006.10029, 2020.
    [2020]
  • [8] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, and Y. Rui, “Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1985–1993.
    [2015]
  • [89] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
    [2020]
  • [88] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609– 617.
    [2017]
  • [86] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in ICML, 2011.
    [2011]
  • [85] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level attention model for weakly supervised audio classification,” arXiv preprint arXiv:1803.02353, 2018.
    [2018]
  • [83] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 131–135.
  • [82] M.Won, A. Ferraro, D. Bogdanov, and X. Serra, “Evaluation of cnn-based automatic music tagging models,” arXiv preprint arXiv:2006.00751, 2020.
    [2020]
  • [81] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE access, vol. 7, pp. 19 143–19 165, 2019.
    [2019]
  • [80] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural network architectures and their applications,” Neurocomputing, vol. 234, pp. 11–26, 2017.
    [2017]
  • [79] A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic scene classification: An overview of DCASE 2017 challenge entries,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), September 2018, pp. 411–415.
    [2018]
  • [78] K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” in The 17th International Society of Music Information Retrieval Conference, New York, USA. International Society of Music Information Retrieval, 2016.
    [2016]
  • [76] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, “Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 45–49.
    [2016]
  • [75] D. Stowell and D. Clayton, “Acoustic event detection for multiple overlapping similar sources,” in 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2015, pp. 1–5.
    [2015]
  • [74] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Sparse representation based on a bag of spectral exemplars for acoustic event detection,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 6255–6259.
    [2014]
  • [73] L. Su, C.-C. M. Yeh, J.-Y. Liu, J.-C. Wang, and Y.-H. Yang, “A systematic evaluation of the bag-of-frames representation for music information retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1188– 1200, 2014.
    [2014]
  • [72] A. Plinge, R. Grzeszick, and G. A. Fink, “A bag-of-features approach to acoustic event detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3704–3708.
    [2014]
  • [70] A. Klapuri and M. Davy, “Signal processing methods for music transcription,” 2007.
    [2007]
  • [6] Y. Jin, L. Khan, L. Wang, and M. Awad, “Image annotations by combining multiple evidence & wordnet,” in Proceedings of the 13th annual ACM international conference on Multimedia, 2005, pp. 706–715.
    [2005]
  • [69] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classification and sound event detection,” in Signal Processing Conference (EUSIPCO), 2016 24th European. IEEE, 2016, pp. 1128–1132.
    [2016]
  • [68] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044.
    [2014]
  • [67] Y. Han and K. Lee, “Convolutional neural network with multiple-width frequency-delta data augmentation for acoustic scene classification,” IEEE AASP challenge on detection and classification of acoustic scenes and events, 2016.
    [2016]
  • [66] Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition,” in 2015 23rd European Signal Processing Conference (EUSIPCO). IEEE, 2015, pp. 125–129.
    [2015]
  • [65] A. Rabaoui, M. Davy, S. Rossignol, and N. Ellouze, “Using one-class svms and wavelets for audio surveillance,” IEEE Transactions on information forensics and security, vol. 3, no. 4, pp. 763–775, 2008.
    [2008]
  • [63] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.
    [2020]
  • [62] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” Advances in Neural Information Processing Systems, vol. 32, pp. 6665–6675, 2019.
    [2019]
  • [61] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 113–123.
    [2019]
  • [60] S. Mun, S. Park, D. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,” DCASE2017 Challenge, Tech. Rep., September 2017.
    [2017]
  • [5] J. Fu, T. Mei, K. Yang, H. Lu, and Y. Rui, “Tagging personal photos with transfer deep learning,” in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 344–354.
    [2015]
  • [59] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pantazopoulos, M. Nikandrou, T. Giannakopoulos, A. Katsamanis, A. Potamianos, and S. Narayanan, “Data augmentation using gans for speech emotion recognition.” in Interspeech, 2019, pp. 171–175.
    [2019]
  • [57] J. Yoon, D. Jarrett, and M. van der Schaar, “Time-series generative adversarial networks,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 5508–5518.
    [2019]
  • [56] C. Esteban, S. L. Hyland, and G. R¨atsch, “Real-valued (medical) time series generation with recurrent conditional gans,” arXiv preprint arXiv:1706.02633, 2017.
    [2017]
  • [55] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
    [2017]
  • [54] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Proc. Interspeech 2019, pp. 2613–2617, 2019.
    [2019]
  • [53] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multichannel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics ICA2013, vol. 19, no. 1. Acoustical Society of America, 2013, p. 035081.
    [2013]
  • [51] A. Zheng and A. Casari, Feature engineering for machine learning: principles and techniques for data scientists. ” O’Reilly Media, Inc.”, 2018.
    [2018]
  • [50] J. O. Smith, Digital Audio Resampling Home Page. http://wwwccrma. stanford.edu/˜jos/resample/, January 28, 2002.
    [2002]
  • [4] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei, “What does classifying more than 10,000 image categories tell us?” in European conference on computer vision. Springer, 2010, pp. 71–84.
    [2010]
  • [49] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
    [2016]
  • [48] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” Proc. Interspeech 2019, pp. 3465–3469, 2019.
    [2019]
  • [47] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
    [2018]
  • [46] J. Pons, O. Nieto, M. Prockup, E. Schmidt, A. Ehmann, and X. Serra, “End-to-end learning for music audio tagging at scale,” arXiv preprint arXiv:1711.02520, 2017.
    [2017]
  • [45] J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms,” arXiv preprint arXiv:1703.01789, 2017.
    [2017]
  • [43] B. Mulgrew, P. Grant, and J. Thompson, Digital signal processing: concepts and applications. Macmillan International Higher Education, 1999.
    [1999]
  • [42] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications. McGraw-Hill New York, 1986, vol. 31999.
    [1986]
  • [41] J. Hammersley, “Probability and statistics: The harald cram´er volume,” 1960.
    [1960]
  • [40] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.
    [2018]
  • [38] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 292–301.
    [2018]
  • [37] R. Milner, M. A. Jalal, R. W. Ng, and T. Hain, “A cross-corpus study on speech emotion recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 304–311.
    [2019]
  • [36] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, “Analysis of deep learning architectures for cross-corpus speech emotion recognition.” in INTERSPEECH, 2019, pp. 1656–1660.
    [2019]
  • [34] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
    [2017]
  • [33] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711.
    [2016]
  • [32] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423.
    [2016]
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
    [2017]
  • [30] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
    [2014]
  • [2] B. Berendt and C. Hanser, “Tags are not metadata, but” just more content”-to some people.” in ICWSM, 2007.
    [2007]
  • [29] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
    [2014]
  • [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [2014]
  • [25] C. Sch¨orkhuber and A. Klapuri, “Constant-q transform toolbox for music processing,” in 7th sound and music computing conference, Barcelona, Spain, 2010, pp. 3–64.
    [2010]
  • [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
    [2009]
  • [23] E. Law and L. Von Ahn, “Input-agreement: a new mechanism for collecting data using human computation games,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009, pp. 1197– 1206.
    [2009]
  • [22] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset.” in Ismir, vol. 2, no. 9, 2011, p. 10.
    [2011]
  • [21] Y. Panagakis and C. Kotropoulos, “Automatic music tagging via parafac2,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 481–484.
    [2011]
  • [20] E. Fonseca, M. Plakal, F. Font, D. P. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” arXiv preprint arXiv:1807.09902, 2018.
    [2018]
  • [1] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and humanlabeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780.
  • [19] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22nd ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014, pp. 1041–1044.
    [2014]
  • [18] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia. ACM Press, 2015, pp. 1015–1018. [Online]. Available: http://dl.acm.org/ citation.cfm?doid=2733373.2806390
    [2015]
  • [17] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), November 2017, pp. 85–92.
    [2017]
  • [16] P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley, “Chimehome: A dataset for sound source recognition in a domestic environment,” in 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2015, pp. 1–5.
    [2015]
  • [164] S. Du, S. You, X. Li, J. Wu, F. Wang, C. Qian, and C. Zhang, “Agree to disagree: Adaptive ensemble knowledge distillation in gradient space,” Advances in Neural Information Processing Systems, vol. 33, 2020.
    [2020]
  • [163] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: a dataset for music analysis,” arXiv preprint arXiv:1612.01840, 2016.
    [2016]
  • [162] U. Marchand and G. Peeters, “The extended ballroom dataset,” 2016.
    [2016]
  • [161] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha, and Y.-H. Yang, “1000 songs for emotional analysis of music,” in Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia. ACM, 2013, pp. 1–6.
    [2013]
  • [160] F. Chollet et al., “Keras,” https://keras.io, 2015.
    [2015]
  • [15] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp. 379–393, Feb 2018.
  • [159] K. Choi, D. Joo, and J. Kim, “Kapre: On-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras,” in Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning. ICML, 2017.
    [2017]
  • [158] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017. Institute of Electrical and Electronics Engineers Inc., 2017.
    [2017]
  • [157] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks.”
  • [156] D. Lee, S. Lee, Y. Han, and K. Lee, “Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input,” Detection and Classification of Acoustic Scenes and Events (DCASE), 2017.
    [2017]
  • [155] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. Ieee, 2008, pp. 263–272.
    [2008]
  • [154] J. Lee, K. Lee, J. Park, J. Park, and J. Nam, “Deep content-user embedding model for music recommendation,” arXiv preprint arXiv:1807.06786, 2018.
    [2018]
  • [153] A. Van den Oord, S. Dieleman, and B. Schrauwen, “Deep content-based music recommendation,” in Advances in neural information processing systems, 2013, pp. 2643–2651.
    [2013]
  • [152] E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tagging.” in ISMIR, 2009, pp. 387–392.
    [2009]
  • [150] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” arXiv preprint arXiv:2004.11362, 2020.
    [2020]
  • [14] T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 2020, submitted. [Online]. Available: https://arxiv.org/abs/2005.14623
    [2020]
  • [149] D. Lee, J. Lee, J. Park, and K. Lee, “Enhancing music features by knowledge transfer from user-item log data,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 386–390.
    [2019]
  • [147] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  • [146] L. Cances and T. Pellegrini, “Comparison of deep co-training and meanteacher approaches for semi-supervised audio tagging,” in IEEE 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), 2021.
  • [145] J.-B. Grill, F. Strub, F. Altch´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020.
    [2020]
  • [144] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
    [2016]
  • [143] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” arXiv preprint arXiv:2004.07301, 2020.
    [2020]
  • [142] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking cnn models for audio classification,” arXiv preprint arXiv:2007.11154, 2020.
    [2020]
  • [141] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multichannel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics ICA2013, vol. 19, no. 1. Acoustical Society of America, 2013, p. 035081.
    [2013]
  • [140] A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of semantic audio representations,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 126–130.
    [2018]
  • [13] J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,” The Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881–891, 2007.
    [2007]
  • [139] S. Chang, D. Lee, J. Park, H. Lim, K. Lee, K. Ko, and Y. Han, “Neural audio fingerprint for high-specific audio retrieval based on contrastive learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3025–3029.
  • [138] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
    [2015]
  • [137] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
    [2006]
  • [136] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S¨ackinger, and R. Shah, “Signature verification using a “siamese” time delay neural network,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, pp. 669–688, 1993.
    [1993]
  • [135] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, “Ensemble selection from libraries of models,” in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 18.
    [2004]
  • [134] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    [2014]
  • [133] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
    [2015]
  • [132] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
    [2016]
  • [130] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system.”
  • [129] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision. Springer, 2014, pp. 346–361.
    [2014]
  • [128] J. Lee and J. Nam, “Multi-level and multi-scale feature aggregation using pre-trained convolutional neural networks for music auto-tagging,” arXiv preprint arXiv:1703.01793, 2017.
    [2017]
  • [127] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in Neural Networks (IJCNN), 2015 International Joint Conference on. IEEE, 2015, pp. 1–7.
    [2015]
  • [126] Y. Han, J. Kim, K. Lee, Y. Han, J. Kim, and K. Lee, “Deep convolutional neural networks for predominant instrument recognition in poly- phonic music,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 1, pp. 208–221, 2017.
    [2017]
  • [124] J. Schluter and S. Bock, “Improved musical onset detection with convolutional neural networks,” in Acoustics, speech and signal processing (icassp), 2014 ieee international conference on. IEEE, 2014, pp. 6979– 6983.
    [2014]
  • [123] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using convolutional neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 559–563.
    [2015]
  • [122] K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on. IEEE, 2015, pp. 1–6.
    [2015]
  • [121] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for polyphonic sound event detection in real life recordings,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 6440–6444.
    [2016]
  • [120] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 151–155.
    [2015]
  • [11] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2005.
    [2005]
  • [119] T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, “Sound event detection in multisource environments using source separation,” in Machine Listening in Multisource Environments, 2011.
    [2011]
  • [118] X. Zhuang, X. Zhou, M. A. Hasegawa-Johnson, and T. S. Huang, “Realworld acoustic event detection,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1543–1551, 2010.
    [2010]
  • [117] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic event detection in real life recordings,” in Signal Processing Conference, 2010 18th European. IEEE, 2010, pp. 1267–1271.
    [2010]
  • [116] K. J. Anstey, J. Wood, S. Lord, and J. G. Walker, “Cognitive, sensory and physical factors enabling driving safety in older adults,” Clinical psychology review, vol. 25, no. 1, pp. 45–65, 2005.
    [2005]
  • [114] D. Zhang and D. Ellis, “Detecting sound events in basketball video archive,” Dept. Electronic Eng., Columbia Univ., New York, 2001.
    [2001]
  • [113] Y.-T. Peng, C.-Y. Lin, M.-T. Sun, and K.-C. Tsai, “Healthcare audio event classification using hidden markov models and hierarchical hidden markov models,” in Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on. IEEE, 2009, pp. 1218–1221.
    [2009]
  • [112] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems,” in Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on. IEEE, 2007, pp. 21–26.
    [2007]
  • [111] C. Clavel, T. Ehrette, and G. Richard, “Events detection for an audiobased surveillance system,” in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on. IEEE, 2005, pp. 1306–1309.
    [2005]
  • [110] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 165–168.
    [2009]
  • [109] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: An ieee aasp challenge,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–4.
    [2013]
  • [108] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5949–5953.
    [2018]
  • [107] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised knowledge transfer for deep learning from private training data,” arXiv preprint arXiv:1610.05755, 2016.
    [2016]
  • [106] X. Li, J. Wu, H. Fang, Y. Liao, F. Wang, and C. Qian, “Local correlation consistency for knowledge distillation,” in European Conference on Computer Vision. Springer, 2020, pp. 18–33.
    [2020]
  • [105] S. Kong, T. Guo, S. You, and C. Xu, “Learning student networks with few data,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4469–4476.
  • [104] T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in International Conference on Machine Learning. PMLR, 2018, pp. 1607–1616.
    [2018]
  • [103] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose, and M. Richardson, “Do deep convolutional nets really need to be deep and convolutional?” arXiv preprint arXiv:1603.05691, 2016.
    [2016]
  • [102] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
    [2014]
  • [101] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
    [2015]
  • [100] J. S. G´omez, J. Abeßer, and E. Cano, “Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning.”
  • What is sound ?
    R. Pasnau vol . 49 , no . 196 , pp . 309 ? 324 [1999]
  • The relation of pitch to frequency : A revised scale
    S. S. Stevens and J. Volkmann vol . 53 , no . 3 , pp . 329 ? 353 [1940]
  • The contribution of sound source characteristics in the assessment of urban soundscapes
    C. Lavandier and B. Defr´eville vol . 92 , no . 6 , pp . 912 ? 921 [2006]
  • Tagging
    P. Rafferty vol . 45 , no . 6 , pp . 500 ? 516 [2018]
  • Soundnet : Learning sound representations from unlabeled video ,
    Y. Aytar , C. Vondrick , and A. Torralba vol . 29 , pp . 892 ? 900 [2016]
  • Musical genre classification of audio signals
    G. Tzanetakis and P. Cook , vol . 10 , no . 5 , pp . 293 ? 302 [2002]
  • Multitask learning
    R. Caruana vol . 28 , no . 1 , pp . 41 ? 75 [1997]
  • Multilayer perceptrons for classification and regression
    F. Murtagh vol . 2 , no . 5-6 , pp . 183 ? 197 [1991]
  • Median filtering for removal of low-frequency background drift
    A. W. Moore Jr and J. W. Jorgenson vol . 65 , no . 2 , pp . 188 ? 191 [1993]
  • Machine-learning based classification of speech and music
    M. K. S. Khan and W. G. Al-Khatib vol . 12 , no . 1 , pp . 55 ? 67 [2006]
  • Learning to recognize transient sound events using attentional supervision.
  • Improved music genre classification with convolutional neural networks.
  • Environmental audio scene and sound event recognition for autonomous surveillance : A survey and com- parative studies
    S. Chandrakala and S. Jayalakshmi vol . 52 , no . 3 , pp . 1 ? 34 [2019]
  • Deep learning
    LeCun , Y. Bengio , and G. Hinton vol . 521 , no . 7553 , pp . 436 ? 444 [2015]
  • Deep convolutional neural networks and data augmentation for environmental sound classification ,
    J. Salamon and J. P. Bello , vol . 24 , no . 3 , pp . 279 ? 283 [2017]
  • Data augmentation using generative adversarial networks for robust speech recognition
    Y. Qian , H. Hu , and T. Tan , vol . 114 , pp . 1 ? 9 [2019]
  • Birdsong recognition using backpropagation and multivariate statistics
    A. L. McIlraith and H. C. Card , vol . 45 , no . 11 , pp . 2740 ? 2748 [1997]
  • Automatic noise source recognition
    D. Dufournet , P. Jouenne , and A. Rozwadowski , vol . 103 , no . 5 , p. 2950 , [1998]
  • Audio keywords generation for sports video analysis ,
  • Assistive tagging : A survey of multimedia tagging with human-computer joint exploration
    M. Wang , B. Ni , X.-S. Hua , and T.-S. Chua vol . 44 , no . 4 , pp . 1 ? 24 [2012]
  • Adaptive pooling operators for weakly labeled sound event detection
    B. McFee , J. Salamon , and J. P. Bello , vol . 26 , no . 11 , pp . 2180 ? 2193 [2018]