In this paper, we investigate the problem of modeling images and associated text for cross-modal retrieval tasks such as text-to-image search and image-to-text search. To make the data from image and text modalities comparable, previous cross-modal retrieval methods directly learn two projection matrices to map the raw features of the two modalities into a common subspace, in which cross-modal data matching can be performed. However, the different feature representations and correlation structures of different modalities inhibit these methods from efficiently modeling the relationships across modalities through a common subspace. To handle the diversities of different modalities, we first leverage the coupled dictionary learning method to generate homogeneous sparse representations for different modalities by associating and jointly updating their dictionaries. We then use a coupled feature mapping scheme to project the derived sparse representations from different modalities into a common subspace in which cross-modal retrieval can be performed. Experiments on a variety of cross-modal retrieval tasks demonstrate that the proposed method outperforms the state-of-the-art approaches.