Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval. To eliminate the heterogeneity be-tween the modalities of images and texts, the existing sub-space learning methods try to learn a common latent sub-space under which cross-modal matching can be performed. However, these methods usually require fully paired sam-ples (images with corresponding texts) and also ignore the class label information along with the paired samples. In-deed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quan-tities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this paper we propose a novel model for cross-modal retrieval problem. It consists of 1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modali-ties based on both paired and unpaired samples; 2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. Exper-iments on a large scale web image dataset MIRFlickr-1M with both fully paired and unpaired settings show the effec-tiveness of the proposed model on the cross-modal retrieval task.