Height estimation from single remote sensing images remains a challenging problem due to limited labeled data and significant domain shifts from general-purpose imagery. This paper systematically explores the effectiveness of fully transferring pretrained monocular depth estimation models, particularly Depth Anything V2, for pixel-wise height prediction in aerial imagery. Our results demonstrate that retaining depth-specific decoders substantially enhances performance. Experiments on Data Fusion Contest 2018, ISPRS Potsdam, and ISPRS Vaihingen datasets show that Depth Anything V2, pretrained on 62M+ images, achieves a 7.2% MAE reduction over its backbone-only variant without depth-specific fine-tuning, and provides significantly sharper digital surface models. Our findings reveal that depth-pretrained models learn viewpoint-invariant geometric priors, enabling effective cross-domain transfer to height estimation tasks despite perspective shifts.