Yi-Jen(Ian) Shih

UT Austin CS Ph.D.

Email: yjshih [AT] utexas.edu

I'm a second year Ph.D. student at UTCS supervised by Prof. David Harwath. Before joining UT, I was supervised by Prof. Hung-yi Lee from NTU and Prof. Yi-Hsuan Yang from Academia Sinica.
I'm interested in Speech Foundation Models, Self-supervised Representation Learning and Multimodal Representation learning.


Recent Publications * indicates equal contribution
  • Self-supervised Speech Models for Word-Level Stuttered Speech Detection
    Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath
    IEEE Spoken Language Technology Workshop (SLT) 2024
    arXiv 

  • Measuring Sound Symbolism in Audio-Visual Models
    Wei-Cheng Tseng*, Yi-Jen Shih*, David Harwath, Raymond Mooney
    IEEE Spoken Language Technology Workshop (SLT) 2024
    arXiv 

  • Interface Design for Self-Supervised Speech Models
    Yi-Jen Shih, David Harwath
    Interspeech 2024
    arXiv  code 

  • SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
    Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
    ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
    arXiv 

  • Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
    Hung-Chieh Fang*, Nai-Xuan Ye*, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
    ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
    arXiv 

  • AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
    Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
    International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2024
    arXiv 

  • M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrievall
    Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
    International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2023
    arXiv 

  • SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Modell
    Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
    IEEE Spoken Language Technology Workshop (SLT) 2022
    arXiv  blog  code  present@JSALT22  poster 

  • Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformerl
    Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang
    IEEE Transactions on Multimedia (TMM) 2022
    arXiv  blog  code  demo  slides@MILA  talk@mMILA