I'm a second year Ph.D. student at UTCS supervised by Prof. David Harwath.
Before joining UT, I was supervised by Prof. Hung-yi Lee from NTU and Prof. Yi-Hsuan Yang from Academia Sinica.
I'm interested in Speech Foundation Models, Self-supervised Representation Learning and Multimodal Representation learning.
Recent Publications * indicates equal contribution
-
Self-supervised Speech Models for Word-Level Stuttered Speech Detection
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath
IEEE Spoken Language Technology Workshop (SLT) 2024
arXiv -
Measuring Sound Symbolism in Audio-Visual Models
Wei-Cheng Tseng*, Yi-Jen Shih*, David Harwath, Raymond Mooney
IEEE Spoken Language Technology Workshop (SLT) 2024
arXiv -
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
arXiv -
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Hung-Chieh Fang*, Nai-Xuan Ye*, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
arXiv -
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2024
arXiv -
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrievall
Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2023
arXiv -
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Modell
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
IEEE Spoken Language Technology Workshop (SLT) 2022
arXiv blog code present@JSALT22 poster -
Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformerl
Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang
IEEE Transactions on Multimedia (TMM) 2022
arXiv blog code demo slides@MILA talk@mMILA