To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Bridge the gap between theoretical concepts and their practical applications with this rigorous introduction to the mathematics underpinning data science. It covers essential topics in linear algebra, calculus and optimization, and probability and statistics, demonstrating their relevance in the context of data analysis. Key application topics include clustering, regression, classification, dimensionality reduction, network analysis, and neural networks. What sets this text apart is its focus on hands-on learning. Each chapter combines mathematical insights with practical examples, using Python to implement algorithms and solve problems. Self-assessment quizzes, warm-up exercises and theoretical problems foster both mathematical understanding and computational skills. Designed for advanced undergraduate students and beginning graduate students, this textbook serves as both an invitation to data science for mathematics majors and as a deeper excursion into mathematics for data science students.
This chapter discusses the broader role and impact of analytics science in improving various aspects of society. It introduces what the book is about, and what the reader should expect to learn from reading this book. It also discusses the analytics revolution in the private and public sector, and introduces a key element of the book — insight-driven problem solving — by highlighting its vital role in addressing various societal problems.
This chapter is devoted to data analysis and its critical role in analytics science. The reader is introduced to the science of inference from observations and experiments and learns about the main ideas in data analysis that have been influential in addressing societal problems. Real-world examples are used throughout to convey the main ideas and illustrate why data analyses performed without sufficient care can yield wrong insights. Successful examples of insight-driven problem solving approaches in data analysis are contrasted with those that can yield wrong insights, and the reader is taken on an engaging yet educational journey that depicts how and why successful insight-driven problem solving approaches using data can have significant public impact.
This chapter introduces the reader to the big picture of what analytics science is. What is analytics science? What types does it have, and what is its scope? How can analytics science be used to improve various tasks that society needs to carry out? Is analytics science all about using data? Or can it work without data? What is the role of data versus models? How can one develop and rely on a model to answer essential questions when the model can be wrong due to its assumptions? What is ambiguity in analytics science? Is that different from risk? And how do analytics scientists address ambiguity? What is the role of simulation in analytics science? These are some of the questions that the chapter addresses. Finally, the chapter discusses the notion of "centaurs" and how a successful use of analytics science often requires combining human intuition with the power of strong analytical models.
This textbook introduces the fundamentals of MATLAB for behavioral sciences in a concise and accessible way. Written for those with or without computer programming experience, it works progressively from fundamentals to applied topics, culminating in in-depth projects. Part I covers programming basics, ensuring a firm foundation of knowledge moving forward. Difficult topics, such as data structures and program flow, are then explained with examples from the behavioral sciences. Part II introduces projects for students to apply their learning directly to real-world problems in computational modelling, data analysis, and experiment design, with an exploration of Psychtoolbox. Accompanied by online code and datasets, extension materials, and additional projects, with test banks, lecture slides, and a manual for instructors, this textbook represents a complete toolbox for both students and instructors.
Numerous symposia and conferences have been held to discuss the promise of Artificial Intelligence (AI). Many center on its potential to transform fields like health and medicine, law, education, business, and more. Further, while many AI-focused events include those data scientists involved in developing foundational models, to our knowledge, there has been little attention on AI’s role for data science and the data scientist. In a new symposium series with its inaugural debut in December 2024 titled AI for Data Science, thought leaders convened to discuss both the promises and challenges of integrating AI into the workflows of data scientists. A keynote address by Michael Pencina from Duke University together with contributions from three panels covered a wide range of topics including rigor, reproducibility, the training of current and future data scientists, and the potential of AI’s integration in public health.
Bridging theory and practice in network data analysis, this guide offers an intuitive approach to understanding and analyzing complex networks. It covers foundational concepts, practical tools, and real-world applications using Python frameworks including NumPy, SciPy, scikit-learn, graspologic, and NetworkX. Readers will learn to apply network machine learning techniques to real-world problems, transform complex network structures into meaningful representations, leverage Python libraries for efficient network analysis, and interpret network data and results. The book explores methods for extracting valuable insights across various domains such as social networks, ecological systems, and brain connectivity. Hands-on tutorials and concrete examples develop intuition through visualization and mathematical reasoning. The book will equip data scientists, students, and researchers in applications using network data with the skills to confidently tackle network machine learning projects, providing a robust toolkit for data science applications involving network-structured data.
Chapter 13 presents the second application of MATLAB to behavioral sciences: data analysis. Students review previously-learned data structures often encountered in practice before applying their programming knowledge from Chapters 1 to 11 to manage each. Starting with tabular data, tables from Chapter 8 are reviewed, with students learning common data science tasks for managing one or more tabular data sets, before applying their knowledge to real experimental data. Next, hierarchical data are reviewed, connecting students’ knowledge of structure arrays from Chapter 8 to a popular internet-based data format (JSON), with students applying their newfound knowledge to analyze data on the behavior of European monarchs.
Recent advancements in data science and artificial intelligence have significantly transformed plant sciences, particularly through the integration of image recognition and deep learning technologies. These innovations have profoundly impacted various aspects of plant research, including species identification, disease detection, cellular signaling analysis, and growth monitoring. This review summarizes the latest computational tools and methodologies used in these areas. We emphasize the importance of data acquisition and preprocessing, discussing techniques such as high-resolution imaging and unmanned aerial vehicle (UAV) photography, along with image enhancement methods like cropping and scaling. Additionally, we review feature extraction techniques like colour histograms and texture analysis, which are essential for plant identification and health assessment. Finally, we discuss emerging trends, challenges, and future directions, offering insights into the applications of these technologies in advancing plant science research and practical implementations.
Ambient air pollution remains a global challenge, with adverse impacts on health and the environment. Addressing air pollution requires reliable data on pollutant concentrations, which form the foundation for interventions aimed at improving air quality. However, in many regions, including the United Kingdom, air pollution monitoring networks are characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements within the United Kingdom. The machine learning framework used is LightGBM, a gradient boosting algorithm based on decision trees, for efficient and scalable modeling. This approach provides a comprehensive dataset for England throughout 2018 at a 1 km2 hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area. Validation was conducted to assess the model’s performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for nitrogen dioxide (NO2), Ozone (O3), particulate matter with a diameter of 10 μm or less (PM10), particulate matter with a diameter of 2.5 μm or less PM2.5, and sulphur dioxide (SO2), at a higher resolution than was previously possible.
The nineteenth century was the first era of “big data” in the modern world, and American literary texts published during this time, such as Herman Melville’s Moby-Dick (1851), offer an aesthetic reframing of how individuals and institutions within a culture of data use information at scale to claim authority over knowledge and, by extension, power over people. Moby-Dick also gestures toward the ways that African and African American bodies were subjected to the most brutal regimes of quantification that the nineteenth century had to offer in the form of the transatlantic and intra-American slave trade. One of the major problems facing American literary studies and digital humanities today is the question of how to excavate and explicate the quantitative turn of earlier centuries as we seek to better understand the cultures of data we live in today. The best initial response to this problem is not to begin with a specific digital tool per se, but to build a set of guiding principles for how to critically approach data, media, and power from within a context that recognizes the distinctive contributions of literary texts as aesthetic objects. This essay models one such approach to do so.
Machine learning (ML) techniques have emerged as a powerful tool for predicting weather and climate systems. However, much of the progress to date focuses on predicting the short-term evolution of the atmosphere. Here, we look at the potential for ML methodology to predict the evolution of the ocean. The presence of land in the domain is a key difference between ocean modeling and previous work looking at atmospheric modeling. Here, we look to train a convolutional neural network (CNN) to emulate a process-based General Circulation Model (GCM) of the ocean, in a configuration which contains land. We assess performance on predictions over the entire domain and near to the land (coastal points). Our results show that the CNN replicates the underlying GCM well when assessed over the entire domain. RMS errors over the test dataset are low in comparison to the signal being predicted, and the CNN model gives an order of magnitude improvement over a persistence forecast. When we partition the domain into near land and the ocean interior and assess performance over these two regions, we see that the model performs notably worse over the near land region. Near land, RMS scores are comparable to those from a simple persistence forecast. Our results indicate that ocean interaction with land is something the network struggles with and highlight that this is may be an area where advanced ML techniques specifically designed for, or adapted for, the geosciences could bring further benefits.
Originating from a unique partnership between data scientists (datavaluepeople) and peacebuilders (Build Up), this commentary explores an innovative methodology to overcome key challenges in social media analysis by developing customized text classifiers through a participatory design approach, engaging both peace practitioners and data scientists. It advocates for researchers to focus on developing frameworks that prioritize being usable and participatory in field settings, rather than perfect in simulation. Focusing on a case study investigating the polarization within online Christian communities in the United States, we outline a testing process with a dataset consisting of 8954 tweets and 10,034 Facebook posts to experiment with active learning methodologies aimed at enhancing the efficiency and accuracy of text classification. This commentary demonstrates that the inclusion of domain expertise from peace practitioners significantly refines the design and performance of text classifiers, enabling a deeper comprehension of digital conflicts. This collaborative framework seeks to transition from a data-rich, analysis-poor scenario to one where data-driven insights robustly inform peacebuilding interventions.
The application of data analytics to product usage data has the potential to enhance engineering and decision-making in product planning. To achieve this effectively for cyber-physical systems (CPS), it is necessary to possess specialized expertise in technical products, innovation processes, and data analytics. An understanding of the process from domain knowledge to data analysis is of critical importance for the successful completion of projects, even for those without expertise in these areas. In this paper, we set out the foundation for a toolbox for data analytics, which will enable the creation of domain-specific pipelines for product planning. The toolbox includes a morphological box that covers the necessary pipeline components, based on a thorough analysis of literature and practitioner surveys. This comprehensive overview is unique. The toolbox based on it promises to support and enable domain experts and citizen data scientists, enhancing efficiency in product design, speeding up time to market, and shortening innovation cycles.
Behavioral Network Science explains how and why structure matters in the behavioral sciences. Exploring open questions in language evolution, child language learning, memory search, age-related cognitive decline, creativity, group problem solving, opinion dynamics, conspiracies, and conflict, readers will learn essential behavioral science theory alongside novel network science applications. This book also contains an introductory guide to network science, demonstrating how to turn data into networks, quantify network structure across scales, and hone one's intuition for how structure arises and evolves. Online R code allows readers to explore the data and reproduce all the visualizations and simulations for themselves, empowering them to make contributions of their own. For data scientists interested in gaining a professional understanding of how the behavioral sciences inform network science, or behavioral scientists interested in learning how to apply network science from the ground up, this book is an essential guide.
Maximise student engagement and understanding of matrix methods in data-driven applications with this modern teaching package. Students are introduced to matrices in two preliminary chapters, before progressing to advanced topics such as the nuclear norm, proximal operators and convex optimization. Highlighted applications include low-rank approximation, matrix completion, subspace learning, logistic regression for binary classification, robust PCA, dimensionality reduction and Procrustes problems. Extensively classroom-tested, the book includes over 200 multiple-choice questions suitable for in-class interactive learning or quizzes, as well as homework exercises (with solutions available for instructors). It encourages active learning with engaging 'explore' questions, with answers at the back of each chapter, and Julia code examples to demonstrate how the mathematics is actually used in practice. A suite of computational notebooks offers a hands-on learning experience for students. This is a perfect textbook for upper-level undergraduates and first-year graduate students who have taken a prior course in linear algebra basics.
Drawing examples from real-world networks, this essential book traces the methods behind network analysis and explains how network data is first gathered, then processed and interpreted. The text will equip you with a toolbox of diverse methods and data modelling approaches, allowing you to quickly start making your own calculations on a huge variety of networked systems. This book sets you up to succeed, addressing the questions of what you need to know and what to do with it, when beginning to work with network data. The hands-on approach adopted throughout means that beginners quickly become capable practitioners, guided by a wealth of interesting examples that demonstrate key concepts. Exercises using real-world data extend and deepen your understanding, and develop effective working patterns in network calculations and analysis. Suitable for both graduate students and researchers across a range of disciplines, this novel text provides a fast-track to network data expertise.