To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Bridge the gap between theoretical concepts and their practical applications with this rigorous introduction to the mathematics underpinning data science. It covers essential topics in linear algebra, calculus and optimization, and probability and statistics, demonstrating their relevance in the context of data analysis. Key application topics include clustering, regression, classification, dimensionality reduction, network analysis, and neural networks. What sets this text apart is its focus on hands-on learning. Each chapter combines mathematical insights with practical examples, using Python to implement algorithms and solve problems. Self-assessment quizzes, warm-up exercises and theoretical problems foster both mathematical understanding and computational skills. Designed for advanced undergraduate students and beginning graduate students, this textbook serves as both an invitation to data science for mathematics majors and as a deeper excursion into mathematics for data science students.
This chapter describes the important role of artificial intelligence (AI) in Big Data psychology research. First, we discuss the main goals of AI, and then delve into an example of machine learning and what is happening under the hood. The chapter then describes the Perceptron, a classic simple neural network, and how this has grown into deep learning AI which has become increasingly popular in recent years. Deep learning can be used both for prediction and generation, and has a multitude of applications for psychology and neuroscience. This chapter concludes with the ethical quandaries around fake data generated by AI and biases that exist in how we train systems, as well as some exciting clinical applications of AI relevant to psychology and neuroscience.
Transonic buffet presents time-dependent aerodynamic characteristics associated with shock, turbulent boundary layer and their interactions. Despite strong nonlinearities and a large degree of freedom, there exists a dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality of transonic buffet phenomena. This study seeks a low-dimensional representation of transonic airfoil buffet at a high Reynolds number with machine learning. Wall-modelled large-eddy simulations of flow over the OAT15A supercritical airfoil at two Mach numbers, $M_\infty = 0.715$ and 0.730, respectively producing non-buffet and buffet conditions, at a chord-based Reynolds number of ${Re} = 3\times 10^6$ are performed to generate the present datasets. We find that the low-dimensional nature of transonic airfoil buffet can be extracted as a sole three-dimensional latent representation through lift-augmented autoencoder compression. The current low-order representation not only describes the shock movement but also captures the moment when the separation occurs near the trailing edge in a low-order manner. We further show that it is possible to perform sensor-based reconstruction through the present low-dimensional expression while identifying the sensitivity with respect to aerodynamic responses. The present model trained at ${Re} = 3\times 10^6$ is lastly evaluated at the level of a real aircraft operation of ${Re} = 3\times 10^7$, exhibiting that the phase dynamics of lift is reasonably estimated from sparse sensors. The current study may provide a foundation towards data-driven real-time analysis of transonic buffet conditions under aircraft operation.
With ambitious action required to achieve global climate mitigation goals, climate change has become increasingly salient in the political arena. This article presents a dataset of climate change salience in 1,792 political manifestos of 620 political parties across different party families in forty-five OECD, European, and South American countries from 1990 to 2022. Importantly, our measure uniquely isolates climate change salience, avoiding the conflation with general environmental and sustainability content found in other work. Exploiting recent advances in supervised machine learning, we developed the dataset by fine-tuning a pre-trained multilingual transformer with human coding, employing a resource-efficient and replicable pipeline for multilingual text classification that can serve as a template for similar tasks. The dataset unlocks new avenues of research on the political discourse of climate change, on the role of parties in climate policy making, and on the political economy of climate change. We make the model and the dataset available to the research community.
The epidemiology and age-specific patterns of lifetime suicide attempts (LSA) in China remain unclear. We aimed to examine age-specific prevalence and predictors of LSA among Chinese adults using machine learning (ML).
Methods
We analyzed 25,047 adults in the 2024 Psychology and Behavior Investigation of Chinese Residents (PBICR-2024), stratified into three age groups (18–24, 25–44, ≥ 45 years). Thirty-seven candidate predictors across six domains—sociodemographic, physical health, mental health, lifestyle, social environment, and self-injury/suicide history—were assessed. Five ML models—random forest, logistic regression, support vector machine (SVM), Extreme Gradient Boosting (XGBoost), and Naive Bayes—were compared. SHapley Additive exPlanations (SHAP) were used to quantify feature importance.
Results
The overall prevalence of LSA was 4.57% (1,145/25,047), with significant age differences: 8.10% in young adults (18–24), 4.67% in adults aged 25–44, and 2.67% in older adults (≥45). SVM achieved the best test-set performance across all ages [area under the curve (AUC) 0.88–0.94, sensitivity 0.79–0.87, specificity 0.81–0.88], showing superior calibration and net clinical benefit. SHAP analysis identified both shared and age-specific predictors. Suicidal ideation, adverse childhood experiences, and suicide disclosure were consistent top predictors across all ages. Sleep disturbances and anxiety symptoms stood out in young adults; marital status, living alone, and perceived stress in mid-life; and functional limitations, poor sleep, and depressive symptoms in older adults.
Conclusions
LSA prevalence in Chinese adults is relatively high, with a clear age gradient peaking in young adulthood. Risk profiles revealed both shared and age-specific predictors, reflecting distinct life-stage vulnerabilities. These findings support age-tailored suicide prevention strategies in China.
The simulation of turbulent flow requires many degrees of freedom to resolve all the relevant time and length scales. However, due to the dissipative nature of the Navier–Stokes equations, the long-term dynamics is expected to lie on a finite-dimensional invariant manifold with fewer degrees of freedom. In this study, we build low-dimensional data-driven models of pressure-driven flow through a circular pipe. We impose the ‘shift-and-reflect’ symmetry to study the system in a minimal computational cell (e.g. the smallest domain size that sustains turbulence) at a Reynolds number of 2500. We build these models by using autoencoders to parametrise the manifold coordinates and neural ordinary differential equation to describe their time evolution. Direct numerical simulations (DNSs) typically require of the order of $\mathcal{O}(10^5)$ degrees of freedom, while our data-driven framework enables the construction of models with fewer than 20 degrees of freedom. Remarkably, these reduced-order models effectively capture crucial features of the flow, including the streak breakdown. In short-time tracking, these models accurately track the true trajectory for one Lyapunov time, as well as the leading Lyapunov exponent, while at long-times, they successfully capture key aspects of the dynamics such as Reynolds stresses and energy balance. The model can quantitatively capture key characteristics of the flow, including the streak breakdown and regeneration cycle. Additionally, we report new exact coherent states found in the DNS with the aid of these low-dimensional models. This approach leads to the discovery of seventeen previously unknown solutions within the turbulent pipe flow system, notably featuring relative periodic orbits characterised by the longest reported periods for such flow conditions.
With the growing amount of historical infrastructure data available to engineers, data-driven techniques have been increasingly employed to forecast infrastructure performance. In addition to algorithm selection, data preprocessing strategies for machine learning implementations plays an equally important role in ensuring accuracy and reliability. The present study focuses on pavement infrastructure and identifies four categories of strategies to preprocess data for training machine-learning-based forecasting models. The Long-Term Pavement Performance (LTPP) dataset is employed to benchmark these categories. Employing random forest as the machine learning algorithm, the comparative study examines the impact of data preprocessing strategies, the volume of historical data, and forecast horizon on the accuracy and reliability of performance forecasts. The strengths and limitations of each implementation strategy are summarized. Multiple pavement performance indicators are also analysed to assess the generalizability of the findings. Based on the results, several findings and recommendations are provided for short-to medium-term infrastructure management and decision-making: (i) in data-scarce scenarios, strategies that incorporate both explanatory variables and historical performance data provides better accuracy and reliability, (ii) to achieve accurate forecasts, the volume of historical data should at least span a time duration comparable to the intended forecast horizon, and (iii) for International Roughness Index and transverse crack length, a forecast horizon up to 5 years is generally achievable, but forecasts beyond a three-year horizon are not recommended for longitudinal crack length. These quantitative guidelines ultimately support more effective and reliable application of data-driven techniques in infrastructure performance forecasting.
This study aimed to evaluate an artificial intelligence-assisted tool for psychiatric case formulation compared with human trainees. Twenty trainees and an artificial intelligence system produced formulations for three simulated psychiatric cases. Formulations were scored using the integrated case formulation scale (ICFS), assessing content, integration and total quality. Time taken was recorded, and assessor predictions of formulation origin were analysed.
Results
Artificial intelligence produced formulations significantly faster (<10 s) than trainees (mean 52.1 min). Trainees achieved higher ICFS total scores (mean difference 8.3, P < 0.001), driven by superior content scores, while integration scores were comparable. The assessor identified artificial intelligence-generated formulations with 71.4% sensitivity, but overall accuracy of who produced the formulations was only 58.3%.
Clinical implications
Artificial intelligence shows promise as a time-saving adjunct in psychiatric training and practice, but requires improvements in generating detailed content. Optimising teaching methods for trainees and refining artificial intelligence systems can enhance the integration of artificial intelligence into clinical workflows.
Synthetic datasets, artificially generated to mimic real-world data while maintaining anonymization, have emerged as a promising technology in the financial sector, attracting support from regulators and market participants as a solution to data privacy and scarcity challenges limiting machine learning (ML) deployment. This article argues that synthetic data’s effects on financial markets depend critically on how these technologies are embedded within existing ML infrastructural ‘stacks’ rather than on their intrinsic properties. We identify three key tensions that will determine whether adoption proves beneficial or harmful: (1) data circulability versus opacity, particularly the ‘double opacity’ problem arising from stacked ML systems, (2) model-induced scattering versus model-induced herding in market participant behavior, and (3) flattening versus deepening of data platform power. These tensions directly correspond to core regulatory priorities around model risk management, systemic risk, and competition policy. Using financial audit as a case study, we demonstrate how these tensions interact in practice and propose governance frameworks, including a synthetic data labeling regime to preserve contextual information when datasets cross organizational boundaries.
A deep-learning-based closure model to address energy loss in low-dimensional surrogate models based on proper-orthogonal-decomposition (POD) modes is introduced. Using a transformer-encoder block with an easy-attention mechanism, the model predicts the spatial probability density function of fluctuations not captured by the truncated POD modes. The methodology is demonstrated on the wake of the Windsor body at yaw angles of $\delta = [2.5^\circ ,5^\circ ,7.5^\circ ,10^\circ ,12.5^\circ ]$, with $\delta = 7.5^\circ$ as a test case, and in a realistic urban environment at wind directions of $\delta = [-45^\circ ,-22.5^\circ ,0^\circ ,22.5^\circ ,45^\circ ]$, with $\delta = 0^\circ$ as a test case. Key coherent modes are identified by clustering them based on dominant frequency dynamics using Hotelling’s $T^2$ on the spectral properties of temporal coefficients. These coherent modes account for nearly $60 \,\%$ and $75 \,\%$ of the total energy for the Windsor body and the urban environment, respectively. For each case, a common POD basis is created by concatenating coherent modes from training angles and orthonormalising the set without losing information. Transformers with different size on the attention layer, (64, 128 and 256), are trained to model the missing fluctuations in the Windsor body case. Larger attention sizes always improve predictions for the training set, but the transformer with an attention layer of size 256 slightly overshoots the fluctuation predictions in the Windsor body test set because they have lower intensity than in the training cases. A single transformer with an attention size of 256 is trained for the urban flow. In both cases, adding the predicted fluctuations close the energy gap between the reconstruction and the original flow field, improving predictions for energy, root-mean-square velocity fluctuations and instantaneous flow fields. For instance, in the Windsor body case, the deepest architecture reduces the mean energy error from $37 \,\%$ to $12 \,\%$ and decreases the Kullback–Leibler divergence of velocity distributions from ${\mathcal{D}}_{\mathcal{KL}}=0.2$ to below ${\mathcal{D}}_{\mathcal{KL}}=0.026$.
This Element provides a comprehensive guide to deep learning in quantitative trading, merging foundational theory with hands-on applications. It is organized into two parts. The first part introduces the fundamentals of financial time-series and supervised learning, exploring various network architectures, from feedforward to state-of-the-art. To ensure robustness and mitigate overfitting on complex real-world data, a complete workflow is presented, from initial data analysis to cross-validation techniques tailored to financial data. Building on this, the second part applies deep learning methods to a range of financial tasks. The authors demonstrate how deep learning models can enhance both time-series and cross-sectional momentum trading strategies, generate predictive signals, and be formulated as an end-to-end framework for portfolio optimization. Applications include a mixture of data from daily data to high-frequency microstructure data for a variety of asset classes. Throughout, they include illustrative code examples and provide a dedicated GitHub repository with detailed implementations.
Politicians’ presentation of self is central to election efforts. For these efforts to be successful, they need voters to receive and believe the messages they communicate. We examine the relationship between politicians’ communications and voters’ perceptions of their ideology. Using the content of politicians’ ideological presentation of self through social media communications, we create a measure of messaging ideology for all congressional candidates between 2018 and 2022 and all congressional officeholders between 2012 and 2022 along with voter perceptions of candidate ideology during the same time period. Using these measures, our work shows voters’ perceptions of candidate ideology are strongly related to messaging, even after controlling for incumbent voting behavior. We also examine how the relationship between politician messaging and voter perceptions changes relative to other information about the politician and in different electoral contexts. On the whole, voters’ perceptions of candidate ideology are strongly correlated with politician communications.
Despite their widespread use, purely data-driven methods often suffer from overfitting, lack of physical consistency, and high data dependency, particularly when physical constraints are not incorporated. This study introduces a novel data assimilation approach that integrates Graph Neural Networks (GNNs) with optimization techniques to enhance the accuracy of mean flow reconstruction, using Reynolds-averaged Navier–Stokes (RANS) equations as a baseline. The method leverages the adjoint approach, incorporating RANS-derived gradients as optimization terms during GNN training, ensuring that the learned model adheres to physical laws and maintains consistency. Additionally, the GNN framework is well-suited for handling unstructured data, which is common in the complex geometries encountered in computational fluid dynamics. The GNN is interfaced with the finite element method for numerical simulations, enabling accurate modeling in unstructured domains. We consider the reconstruction of mean flow past bluff bodies at low Reynolds numbers as a test case, addressing tasks such as sparse data recovery, denoising, and inpainting of missing flow data. The key strengths of the approach lie in its integration of physical constraints into the GNN training process, leading to accurate predictions with limited data, making it particularly valuable when data are scarce or corrupted. Results demonstrate significant improvements in the accuracy of mean flow reconstructions, even with limited training data, compared to analogous purely data-driven models.
Bridging theory and practice in network data analysis, this guide offers an intuitive approach to understanding and analyzing complex networks. It covers foundational concepts, practical tools, and real-world applications using Python frameworks including NumPy, SciPy, scikit-learn, graspologic, and NetworkX. Readers will learn to apply network machine learning techniques to real-world problems, transform complex network structures into meaningful representations, leverage Python libraries for efficient network analysis, and interpret network data and results. The book explores methods for extracting valuable insights across various domains such as social networks, ecological systems, and brain connectivity. Hands-on tutorials and concrete examples develop intuition through visualization and mathematical reasoning. The book will equip data scientists, students, and researchers in applications using network data with the skills to confidently tackle network machine learning projects, providing a robust toolkit for data science applications involving network-structured data.
Within the context of machine learning-based closure mappings for Reynolds-averaged Navier Stokes turbulence modelling, physical realisability is often enforced using ad hoc postprocessing of the predicted anisotropy tensor. In this study, we address the realisability issue via a new physics-based loss function that penalises non-realisable results during training, thereby embedding a preference for realisable predictions into the model. Additionally, we propose a new framework for data-driven turbulence modelling which retains the stability and conditioning of optimal eddy viscosity-based approaches while embedding equivariance. Several modifications to the tensor basis neural network to enhance training and testing stability are proposed. We demonstrate the conditioning, stability and generalisation of the new framework and model architecture on three flows: flow over a flat plate, flow over periodic hills and flow through a square duct. The realisability-informed loss function is demonstrated to significantly increase the number of realisable predictions made by the model when generalising to a new flow configuration. Altogether, the proposed framework enables the training of stable and equivariant anisotropy mappings, with more physically realisable predictions on new data. We make our code available for use and modification by others. Moreover, as part of this study, we explore the applicability of Kolmogorov–Arnold networks to turbulence modelling, assessing its potential to address nonlinear mappings in the anisotropy tensor predictions and demonstrating promising results for the flat plate case.
This chapter introduces the network machine learning landscape, bridging traditional machine learning with network-specific approaches. It defines networks, contrasts them with tabular data structures, and explains their ubiquity in various domains. The chapter outlines different types of network learning systems, including single vs. multiple network, attributed vs. non-attributed, and model-based vs. non-model-based approaches. It also discusses the scope of network analysis, from individual edges to entire networks. The chapter concludes by addressing key challenges in network machine learning, such as imperfect observations, partial network visibility, and sample limitations. Throughout, it emphasizes the importance of statistical learning in generalizing findings from network samples to broader populations, setting the stage for more advanced concepts in subsequent chapters.
This research paper aimed to develop a supervised machine learning (ML) approach that learns and predicts data-based culling from farm information that reflects the criteria of the decisions taken to cull a cow by a farm manager. Data containing the features of milk yield, days in milk, lactation number, pregnancy status, days open and days pregnant were obtained from January to December 2020 from dairy cows on a large dairy farm in northern Mexico. The cows were labelled as those that were data-based culled (Cull) and those that were not culled (Stay). Six supervised ML algorithms were evaluated in a binary classification including logistic regression (LR), Gaussian naïve Bayes (GNB), k-nearest neighbors (k-NN), support vector machine (SVM), random forest (RF) and multilayer perceptron (MLP). Each model was subjected to hyperparameter optimization using a grid search approach combined with tenfold stratified cross-validation. This ensured that the class imbalance (Cull vs. Stay) was accounted during model evaluation. The best-performing model for each algorithm was selected on cross-validated accuracy. To evaluate the prediction performance of the ML algorithms on both labels from learned data, the metrics accuracy, precision, recall, F1-score and the Matthews correlation coefficient (MCC) were employed. Accuracy among all classifiers was >0.90. The poorest prediction performance was observed in GNB (MCC = 0.50) and LR (MCC = 0.72). Conversely, the rest of the classifiers achieved superior prediction performance in learning the specific culling criteria, reaching an MCC score >0.91. Overall, culling criteria can be learned and predicted by ML algorithms and their performance varies among classifiers. This study identified RF as the best performing algorithm, but k-NN, SVM and MLP are possible candidates to be used in on-farm conditions. To increase their reliability, these approaches need to be tested in several farms, under different scenarios and varieties of features.