A deep-learning-based closure model to address energy loss in low-dimensional surrogate models based on proper-orthogonal-decomposition (POD) modes is introduced. Using a transformer-encoder block with an easy-attention mechanism, the model predicts the spatial probability density function of fluctuations not captured by the truncated POD modes. The methodology is demonstrated on the wake of the Windsor body at yaw angles of
$\delta = [2.5^\circ ,5^\circ ,7.5^\circ ,10^\circ ,12.5^\circ ]$, with
$\delta = 7.5^\circ$ as a test case, and in a realistic urban environment at wind directions of
$\delta = [-45^\circ ,-22.5^\circ ,0^\circ ,22.5^\circ ,45^\circ ]$, with
$\delta = 0^\circ$ as a test case. Key coherent modes are identified by clustering them based on dominant frequency dynamics using Hotelling’s
$T^2$ on the spectral properties of temporal coefficients. These coherent modes account for nearly
$60 \,\%$ and
$75 \,\%$ of the total energy for the Windsor body and the urban environment, respectively. For each case, a common POD basis is created by concatenating coherent modes from training angles and orthonormalising the set without losing information. Transformers with different size on the attention layer, (64, 128 and 256), are trained to model the missing fluctuations in the Windsor body case. Larger attention sizes always improve predictions for the training set, but the transformer with an attention layer of size 256 slightly overshoots the fluctuation predictions in the Windsor body test set because they have lower intensity than in the training cases. A single transformer with an attention size of 256 is trained for the urban flow. In both cases, adding the predicted fluctuations close the energy gap between the reconstruction and the original flow field, improving predictions for energy, root-mean-square velocity fluctuations and instantaneous flow fields. For instance, in the Windsor body case, the deepest architecture reduces the mean energy error from
$37 \,\%$ to
$12 \,\%$ and decreases the Kullback–Leibler divergence of velocity distributions from
${\mathcal{D}}_{\mathcal{KL}}=0.2$ to below
${\mathcal{D}}_{\mathcal{KL}}=0.026$.