Functional neuroimaging research provides detailed observations of the response patterns that natural sounds (e.g. human voices and speech, animal cries, environmental sounds) evoke in the human brain. The computational and representational mechanisms underlying these observations, however, remain largely unknown. Here we combine high spatial resolution (3 and 7 Tesla) functional magnetic resonance imaging (fMRI) with computational modeling to reveal how natural sounds are represented in the human brain. We compare competing models of sound representations and select the model that most accurately predicts fMRI response patterns to natural sounds. Our results show that the cortical encoding of natural sounds entails the formation of multiple representations of sound spectrograms with different degrees of spectral and temporal resolution. The cortex derives these multi-resolution representations through frequency-specific neural processing channels and through the combined analysis of the spectral and temporal modulations in the spectrogram. Furthermore, our findings suggest that a spectral-temporal resolution trade-off may govern the modulation tuning of neuronal populations throughout the auditory cortex. Specifically, our fMRI results suggest that neuronal populations in posterior/dorsal auditory regions preferably encode coarse spectral information with high temporal precision. Vice-versa, neuronal populations in anterior/ventral auditory regions preferably encode fine-grained spectral information with low temporal precision. We propose that such a multi-resolution analysis may be crucially relevant for flexible and behaviorally-relevant sound processing and may constitute one of the computational underpinnings of functional specialization in auditory cortex.