Policy Significance Statement
This work highlights the impact of adversarial attacks on natural language processing (NLP) systems, especially in high-stakes application domains such as healthcare. As these artificial intelligence (AI) methods become more powerful, policymakers must ensure that they are used fairly, securely, and transparently. Key concerns include preventing bias, protecting privacy, and managing the high-energy demands of large-scale models. This paper explores attacks, defenses, and the growing role of Bayesian methods to improve robustness and decision-making. However, these advances also raise concerns about data protection and algorithmic bias. Policymakers should promote transparency, ethical standards, and sustainable AI practices. Balanced regulation will allow NLP technologies to remain trustworthy, effective, and aligned with public interest across sectors and international boundaries.
1. Introduction
Natural language processing (NLP) has undergone significant evolution over the years. It transitioned from rule-based systems to machine learning and statistical models. Recently, the use of deep learning and the introduction of transformer architectures marked a revolution for NLP (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2018). This has redefined human-computer interaction and broadened the scope of NLP applications in various domains fueled by the emergence of generative AI (GenAI) and large language models (LLMs). Models such as bidirectional encoder representations from transformers (BERT) and generative pre-trained transformers (GPT) have yielded cutting-edge performance in tasks such as language understanding, generation, translation, and summarization (Johri et al., Reference Johri, Khatri, Al-Taani, Sabharwal, Suvanov and Kumar2021). NLP-based technologies enhance various fields such as healthcare, customer service, education, and entertainment (Esmradi et al., Reference Esmradi, Yip and Chan2023). This rapid progress and increasing number of NLP applications in human-facing domains emphasize the importance of addressing relevant policy implications to ensure an equitable, ethical, and effective deployment.
Moreover, increasing use of NLP systems amplifies the security concerns due to potential data manipulation that can impact NLP outcomes. For instance, adversarial attacks can alter the sentiment of a text, manipulate translation results, or generate misleading content in automated systems. The consequences of adversarial attacks on NLP systems can be severe and range from security and privacy risks to reduced reliability. They could reduce trust in NLP systems, especially in critical applications like legal document analysis, medical diagnosis, or autonomous vehicles. To mitigate the impact of adversarial threats, various defensive techniques such as adversarial training (AdvT) and input preprocessing are utilized (Goyal et al., Reference Goyal, Doddapaneni, Khapra and Ravindran2023). Integrating adversarial testing and evaluation into the development lifecycle helps uncover and address vulnerabilities before deploying NLP systems in real-world scenarios. However, securing these systems remains a persistent challenge, demanding ongoing innovation and adaptability. Emerging vulnerabilities, especially in high-stakes domains like healthcare, law, and cybersecurity, raise complex policy issues. To mitigate risks from misleading or harmful outputs, a strong regulatory framework is essential. This includes safeguards such as secure model training, continuous adversarial testing, and transparency throughout model deployment.
Several review papers have surveyed adversarial attacks and defenses in NLP focusing on different aspects. Zhang et al. (Reference Zhang, Sheng, Alhazmi and Li2020b) provided a comprehensive overview of adversarial attack techniques on deep learning models in NLP, covering convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. They emphasized the diversity of attack methods and their impact on various NLP tasks. Li et al. (Reference Li, Qiu, Qian and Zhao2020) focused on the vulnerability of RNNs to adversarial attacks, exploring how spatial and temporal dependencies in text data can be exploited, and highlighting potential mitigation strategies. Dong et al. (Reference Dong, Dong, Yuan and Guan2022) covered both adversarial attacks and defenses on NLP in deep learning. Alsmadi et al. (Reference Alsmadi, Aljaafari, Nazzal, Alhamed, Sawalmeh, Vizcarra, Khreishah, Anan, Algosaibi, al-Naeem, Aldalbahi and al-Humam2022) presented a survey of methods for text generation. Cheng et al. (Reference Cheng, Jiang and Macherey2019) examined the susceptibility of neural machine translation (NMT) models, particularly transformers, to adversarial perturbations. These reviews focus on specific models or types of attacks without providing a unified overview. In addition, recent advancements in adversarial attack methods and defense mechanisms are not fully captured in these surveys, particularly empirical comparisons involving ensemble techniques developed in the last few years. They often lack guidance on the practical implementations, making it challenging for practitioners to apply these methods. In terms of LLM defenses, Esmradi et al. (Reference Esmradi, Yip and Chan2023) analyzed LLM security vulnerabilities and reviewed effective defense strategies including data sanitization, encryption-based methods, differential privacy, and filtering. The review of Qiu et al. (Reference Qiu, Liu, Zhou and Huang2022) covered various attack methods, such as character-level, word-level, and sentence-level perturbations, and defense strategies, from data augmentation and AdvT to recent innovations like certified defenses, in addition to their discussion of evaluation metrics. Despite broad coverage in recent literature; practical guidance, emerging techniques, such as Bayesian methods and overarching policy implications, remain insufficiently explored in this fast-moving field of study.
We address these limitations by extending Shaw et al. (Reference Shaw, Ansari and Ekin2025) with an overview of policy implications related to adversarial NLP. In particular, we review adversarial attacks in NLP, examining attack methods, exploited vulnerabilities, and defense strategies. We identify key trends, gaps, and future directions, with a focus on Bayesian methods, in addition to policy implications. The contributions include a holistic overview that integrates recent attack and defense techniques, such as LLM attacks and zero-shot defenses, practical guidance including some empirical comparisons to assess method effectiveness, and coverage of policy implications and emerging techniques such as Bayesian methods.
This manuscript proceeds as follows. Section 2 provides an overview of the literature of NLP methods with an emphasis on Bayesian methods. Section 3 presents an overview of adversarial attacks and existing defenses, while providing practical guidance. Section 4 discusses the policy implications and the relevance of adversarial NLP. Section 5 presents emerging areas and future research directions. The manuscript concludes with final remarks in Section 6.
2. Related literature
2.1. Overview of NLP techniques
The main NLP methodologies include rule-based approaches, statistical methods, machine learning (ML), deep learning (DL), and transformer models. The choice of the algorithm depends on the application because of the varying strengths and weaknesses of these approaches. Rule-based approaches use predefined linguistic rules for tasks like tokenization and parsing. While they are interpretable and preferred for simple tasks, they may lack adaptability to natural language complexity. Statistical methods, such as hidden Markov models (HMMs) and conditional random fields, are utilized for tasks like part-of-speech (POS) tagging and named entity recognition (NER). They are robust and handle uncertainty. However, they require extensive feature engineering and may not capture long-range dependencies. Supervised ML models, e.g., support vector machines and naive Bayes classifiers, learn from labeled data for tasks like text classification. Unsupervised ML methods, like clustering and topic modeling, uncover hidden structures in text. While ML algorithms may overfit to training data and need large labeled datasets and feature engineering, they are versatile and generalize well to new data. DL techniques, such as RNNs, CNNs, and long short-term memory networks (LSTMs) learn hierarchical representations from raw text. They excel at sequence modeling, text classification, and machine translation but require large datasets. The ability to capture complex patterns and dependencies in text may become computationally expensive, possibly suffering from gradient issues. Transformer architectures like BERT and GPT use self-attention mechanisms to capture contextual relationships in text. They excel at language understanding, generation, translation, and summarization but need massive computational resources and extensive pre-training on large text corpora (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Polosukhin2017). While they achieve state-of-the-art performance across NLP tasks, computing and data needs may limit their use.
The emerging use of NLP emphasizes challenges and policy implications related to data privacy, fairness, bias mitigation, and ethical AI governance. First, the reliance of transformer-based models like BERT and GPT on large-scale datasets raises concerns regarding user data privacy, especially when training on sensitive information (Carlini et al., Reference Carlini, Tramer, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea and Raffel2021). Regulatory frameworks such as the General Data Protection Regulation (GDPR) emphasize the need for data anonymization, consent-based collection, and user control over personal information, which may conflict with the vast, unregulated data sources often used in pretraining these models. Furthermore, bias in training data can propagate into model outputs, leading to unfair or discriminatory outcomes in applications such as automated hiring, content moderation, and sentiment analysis (Bender et al., Reference Bender, Gebru, McMillan-Major and Shmitchell2021). Additionally, the increasing carbon footprint of large-scale NLP models due to the computational needs raises sustainability concerns. This encourages the use of training techniques such as pruning, quantization, and knowledge distillation that could reduce energy consumption (Strubell et al., Reference Strubell, Ganesh and McCallum2020). Addressing these challenges requires a balanced regulatory approach that promotes innovation while ensuring ethical AI deployment in real-world applications. The use of NLP in adversarial environments amplifies some of these challenges, resulting in elevated policy implications (Schlarmann and Hein, Reference Schlarmann and Hein2023).
2.2. Bayesian methods for NLP
Bayesian approaches are versatile for NLP applications due to their natural ability to model uncertainty, embed prior knowledge, and decision making under incomplete information (Cohen, Reference Cohen2022). In NLP tasks, such as text classification, sentiment analysis, or machine translation, Bayesian inference can be used to estimate the posterior distribution of model parameters given observed data. This allows for the incorporation of prior knowledge, which can help in regularizing models to avoid overfitting and improve generalization to unseen data. For instance, Naive Bayes classifiers are foundational in text classification tasks. They are based on Bayes’ theorem and make the simplifying assumption that the features are conditionally independent given the class label. Despite this simplification, Naive Bayes often performs surprisingly well, especially in spam detection, sentiment analysis, and topic classification.
For probabilistic clustering of text data, a Bayesian hierarchical method, Latent Dirichlet Allocation (LDA) (Blei et al., Reference Blei, Ng and Jordan2003), and the subsequent development of topic models have become popular for document summarization and information retrieval. (Abdelrazek et al., Reference Abdelrazek, Eid, Gawish, Medhat and Hassan2023). Bayesian methods also can be used for sequence labeling tasks such as POS tagging. For example, HMMs, which are probabilistic models that assume a sequence of observed words is generated by a sequence of hidden states (POS tags), can be trained using Bayesian inference. Bayesian networks and HMMs can be used to model the probabilistic relationships that arise in machine translation, word sense disambiguation, information retrieval, parsing, and named entity recognition. Finally, LLMs where the goal is to predict the probability of a sequence of words benefit from Bayesian n-gram models and neural networks in capturing the probabilistic relationships between words (Chien, Reference Chien2019). They can be useful for speech recognition and text generation.
While Bayesian methods provide robust and interpretable solutions to a wide range of NLP tasks, their complexity and computational cost may limit the real-world adoption. Bayesian methods in NLP could also present policy challenges related to fairness and transparency. Bayesian inference inherently depends on prior probabilities, which can introduce bias if the training data are not representative (Evans and Guo, Reference Evans and Guo2021). This is particularly relevant in adversarial NLP, where robustness of the models against misinformation, spam, and adversarial attacks, is crucial.
3. Adversarial NLP
In conducting the following literature survey, we have followed the guidelines listed by Webster and Watson (Reference Webster and Watson2002) and Brocke et al. (Reference Jv, Simons, Niehaves, Riemer, Plattfaut and Cleven2009). Our coverage is deemed as a combination of exhaustive with selective citations and centrally focusing on select topics of adversarial attacks and defenses in natural language processing. In particular, we have used keywords “Natural language processing,” “Adversarial attack NLP,” “Adversarial defense NLP” for queries in “Google Scholar,” “IEEE Xplore,” and “ScienceDirect.” We have utilized a backward and forward search focusing on the attack and defense methods that are published between the years of 2018 and 2024. However, various synonyms for the term “natural language processing,” such as “text mining” or “computational linguistics,” have been disregarded in our study, highlighting its incomplete nature. In addition, we omitted the preprints that have less than 50 citations. Figure 1 presents the taxonomy for the adversarial attack and defense methods in NLP.

Figure 1. Taxonomy of adversarial attacks and defenses in NLP.
3.1. Adversarial attacks in NLP
Adversarial attacks with varying complexity and knowledge levels have been developed to weaken the performance and reliability of NLP systems. These methods usually involve changing text data, at different levels such as characters (char), words, or sentences, to trick NLP models to produce incorrect outputs, as in misclassification. Character-level attacks involve altering individual characters in a text to trick NLP models, such as changing “cat” to “c at.” This minor modification can disrupt the model’s understanding and cause incorrect predictions (Huang et al., Reference Huang, Kajiwara and Arase2021)). In word-level attacks, individual words in a text are strategically altered to mislead NLP models without changing the overall meaning. An example involves changing “The product is excellent” to “The product is terrible” by substituting “excellent” with “terrible.” This subtle change can cause the model to misclassify the sentiment (Gao et al., Reference Gao, Lanchantin, Soffa and Qi2018)). Sentence-level (paraphrasing) attacks rephrase sentences to confuse NLP models while keeping the meaning the same. An example is rephrasing “The cat sat on the mat” to “The mat was where the cat sat” that can lead the model to misclassify this sentence (Zeng et al., Reference Zeng, Li, Song, Gao, Lyu and King2018)).
In terms of knowledge of the attacked model, adversarial attacks are at varying levels between white-box and black-box attacks. In white-box attacks, the attacker has complete information about the target model, including its architecture, parameters, and training data. A black-box attacker has little or no knowledge of the target model. They rely on querying the model and observing its outputs to create adversarial examples. Binary attacks correspond to attacks for binary classifications. Attacks can also be targeted with a particular objective or may be more general as non-targeted.
The attack methods broadly range from basic attacks with adding manually crafted inputs or altering words, to iterative gradient based refinements. Our work reflects the advances in NLP models where attackers used heuristic and gradient based methods to exploit model vulnerabilities. Heuristic methods are mostly specific to certain models lacking generalizability. Gradient-based techniques like the Fast Gradient Sign Method (FGSM) create perturbations that misled models while remaining undetectable to humans (Wang, Reference Wang2022)). These gradient-based adversarial perturbations could be iteratively refined, as in iterative FGSM and Projected Gradient Descent (PGD), resulting in higher success rates than one-shot methods (Chao et al., Reference Chao, Robey, Dobriban, Hassani, Pappas and Wong2023). Guo et al. (Reference Guo, Sablayrolles, Jégou and Kiela2021) presents an overview of adversarial attack techniques on deep learning models, while Hartl et al. (Reference Hartl, Bachl, Fabini and Zseby2020) and Cheng et al. (Reference Cheng, Jiang and Macherey2019) focus on RNNs and transformers based NMTs, respectively. Smaller variants of the transformer models (e.g., BERT, GPT-3) are more susceptible to adversarial attacks. Table 1 presents an overview of highlights of adversarial attacks, ranging from char-level to sentence-level attacks, against various NLP models.
Table 1. Review highlights of adversarial attacks in NLP

3.2 Defense mechanisms in NLP
The increasing effectiveness of adversarial attacks makes robust countermeasures necessary for NLP security. Defenses against adversarial attacks in NLP are either based on (reactive) detection or (proactive) model enhancement methods. Detection and filtering methods may have limited power against sophisticated and dynamic adversarial attacks. Therefore, model enhancement methods such as AdvT, functional improvement, and certification could be preferred. Among these, AdvT is based on proactive inclusion of adversarial examples in training data. For instance, SmoothLLM uses adversarial examples during training to improve the robustness of LLMs with remarkable results while requiring computational resources. Phrasing is a specific tailored method that trains models to recognize resilient phrases. Zero-Shot defender for adversarial sample detection and restoration (ZDDR) combines AdvT with zero-shot learning to detect and restore adversarial inputs. As shown in Table 2, several other methods enhance model robustness against specific attacks at the cost of additional fine-tuning. Input preprocessing and data transformation methods include “Synonym Encoding” which replaces words with synonyms to reduce sensitivity to specific words. Duplicate text filtering removes duplicate text to improve generalization. Despite its effectiveness, it may also discard useful data. Data sanitization is based on removing sensitive information from data. It protects privacy at the potential cost of altering semantics. Finally, knowledge expansion methods augment training data with external knowledge sources. They enhance understanding, but their performance depends on the quality and relevance of the knowledge added. While these defense techniques help improve NLP model robustness, increased computational complexity and vulnerabilities to specific attacks are among current limitations. Table 2 presents an overview of NLP defenses against adversarial attacks ranging from char to sentence levels.
Table 2. Review highlights of adversarial defenses in NLP

3.3. Practical guidance
Practitioners can use various techniques to design adversarial attacks that evaluate and stress-test model robustness. These include open source libraries like nlpaug Footnote 1, TextAttack Footnote 2, Foolbox Footnote 3, and CleverHans Footnote 4 that provide algorithms for crafting adversarial examples. One widely used option is to utilize genetic algorithmic frameworks that evolve adversarial examples through natural selection. Their standard steps include initialization (generation of initial adversarial examples), evaluation (assessing their effectiveness), selection (choosing high performing examples), and crossover and mutation (creating new examples).
In this study, we focused on gradient-based adversarial attacks by utilizing the grand framework of model selection, gradient calculation and perturbation generation, and evaluation, through the use of open-source Python library Nlpaug. This data augmentation library offers methods such as synonym replacement, contextual word substitution, and back translation, making it versatile for different NLP tasks, such as sentiment analysis and topic classification. Augmenting text data has been shown to enhance the performance of NLP models, particularly for classification tasks (Bayer et al., Reference Bayer, Kaufhold, Buchhold, Keller, Dallmeyer and Reuter2023). In particular, we generated adversarial attacks on text data by applying random attacks, e.g., adding spelling errors, word splitting, and random perturbationsFootnote 5. The relevant code can be accessed at GitHubFootnote 6. Figure 2 displays a practical example for a word substitution (synonym) attack.

Figure 2. Adversarial sentiment analysis example on IMDB dataset (Shaw et al., Reference Shaw, Ansari and Ekin2024).
Table 3 presents the impact of such adversarial attacks on different combinations of NLP methods against both IMDB and TwitterFootnote 7 datasets. The varying impacts of attacks on accuracy and F1 scores can be recognized. For instance, accuracy dropped by 30% under attack, illustrating the practical risk to deployed NLP systems.
Table 3. Select empirical results before and after attacks against ensembles of CNN and BiLSTM

In terms of defenses, Table 2 indicates the popularity of adversarial training (AdvT). In our practical implementations (Shaw and Ekin, Reference Shaw and Ekin2024), we also have recognized that AdvT not only enhances the robustness of the model but also improves generalization to unseen adversarial examples. This could offer practitioners a practical pathway to fortify NLP models against increasingly sophisticated attacks, ensuring more secure and reliable performance.
Datasets with adversarial examples, e.g., word substitutions, character-level perturbations, are crucial for training and testing NLP models’ robustness. Evaluating models on diverse datasets helps researchers understand their robustness and improve defense strategies and model architectures. Therefore, Tables 1 and 2 list the utilized datasets in adversarial NLP literature. Several key metrics are used to evaluate the impact of adversarial attacks and defenses in NLP. These metrics help assess how well different attack techniques, defense mechanisms, and models work against adversarial threats. While accuracy is among the measures used to quantify the proportion of correctly classified examples, attack success rate measures the effectiveness of adversarial attacks. Robustness measures a model’s ability to maintain performance when facing adversarial perturbations. Transferability assesses if adversarial examples for one model can deceive other models, indicating shared vulnerabilities. While text domain lacks universal benchmarks or data sets as introduced in image domain, there has been recent work such as Adversarial GLUE (Wang et al., Reference Wang, Xu, Wang, Gan, Cheng, Gao, Awadallah and Li2021a) and MITRE ATLAS MatrixFootnote 8 to address that.
4. Policy implications
NLP has revolutionized the way we interact with technology, offering transformative applications in domains such as healthcare (Wong et al., Reference Wong, Plasek, Montecalvo and Zhou2018; Jerfy et al., Reference Jerfy, Selden and Balkrishnan2024; Khattak and Rabbi, Reference Khattak and Rabbi2023), legal systems (Hovy and Spruit, Reference Hovy and Spruit2016) with profound societal and economic implications. In healthcare, NLP powers applications like clinical text analysis, electronic health record processing, and patient sentiment analysis. Inaccuracies or biases in these systems can lead to incorrect diagnoses or compromised patient care (Schopow et al., Reference Schopow, Osterhoff and Baur2023). Policies should ensure rigorous validation of NLP models used in healthcare and establish guidelines for their deployment. For instance, requiring official approvals for NLP-powered medical devices can help maintain high standards of safety and efficacy. NLP tools are also increasingly employed in legal document analysis, contract review, and case law research. While these applications enhance efficiency, they also raise concerns about accountability and interpretability (Aletras et al., Reference Aletras, Tsarapatsanis, Preoţiuc-Pietro and Lampos2016; Doshi-Velez and Kim, Reference Doshi-Velez and Kim2017; Ariai and Demartini, Reference Ariai and Demartini2024). Governments may establish guidelines to ensure that NLP systems in legal contexts remain interpretable and unbiased. Collaboration with legal professionals can help create standards for ethical usage (Quevedo et al., Reference Quevedo, Cerny, Rodriguez, Rivas, Yero, Sooksatra and Taibi2023).
Adversarial NLP attacks, such as phishing or data poisoning, enhance cybersecurity risks of these systems (Story et al., Reference Story, Zimmeck, Ravichander, Smullen, Wang, Reidenberg, Cameron Russell and Sadeh2019). These attacks could include but are not limited to data extraction, bias manipulation, and monetization of misinformation. For instance, subtle adversarial prompts targeting LLMs integrated into customer service systems could trick models into revealing masked personal data (e.g., partial identification, account info) through carefully crafted follow-up queries. In educational or medical domains, adversarial inputs could distort model outputs to favor certain commercial products, institutions, or treatments, nudging user decisions toward sponsored or malicious ends. Monetization of misinformation could occur by manipulating an NLP assistant that could redirect users to unofficial monetized third party services. Lastly, adversaries may craft synthetic personas or requests to overwhelm public service chatbots, to prevent vulnerable users from getting timely help.
These examples emphasize how seemingly minor adversarial manipulations can lead to tangible societal risks, and could reinforce the necessity of proactive defense mechanisms. The resulting policy implications should be considered for equitable, ethical, and effective deployment. Policies requiring regular updates to NLP systems to address emerging threats and encourage the development of secure architectures are possibly needed. International coordination efforts on cybersecurity policies, such as information sharing agreements, are critical to improve global resilience.
4.1. Ethical challenges and governance
The applications of NLP systems may often result in ethical challenges related to fairness and privacy. One of the most pressing challenges in NLP is the presence of bias in training data. Models trained on biased data can inadvertently replicate and amplify discriminatory patterns. For instance, sentiment analysis systems may misclassify language from minority dialects, and automated hiring tools can exhibit gender or racial biases. Governments and organizations should possibly create standards for bias detection and mitigation. Policies could advocate for transparency in dataset composition and require regular audits of NLP models to evaluate fairness. Robust governance frameworks can help ensure that NLP technologies do not perpetuate or exacerbate societal inequities. Frameworks such as the European Union’s AI Act, which emphasizes accountability and fairness, serve as important precedents for global adoption.
Moreover, privacy is a critical consideration in NLP applications, especially in domains like healthcare or legal services, where sensitive information is processed. LLMs trained on vast amounts of data may inadvertently memorize and reproduce private or confidential information. Hence, policy interventions may be essential to safeguard user data and enforce compliance with privacy regulations like the GDPR. Adopting differential privacy techniques during model training and ensuring end-to-end encryption for NLP-powered systems can help address these concerns.
These ethical challenges become even more crucial in adversarial environments, and policy makers need to be aware of the implications. Accountability mechanisms that are crucial to prevent the misuse of NLP technologies should consider potential manipulation of data and models. Organizations can increase the transparency and document their decision-making processes of their models and provide avenues for recourse in cases of harm caused by algorithmic decisions.
4.2. Policy and regulatory needs for adversarial robustness
The threat posed by adversarial attacks to the reliability and security of the NLP systems creates the potential need for regulatory frameworks. Such policies that encourage secure model development are especially crucial for high-stakes applications. We can address some of these challenges by embedding safeguard regulations as part of the development lifecycle for NLP systems. Similarly to the required cybersecurity certifications (Alawida et al., Reference Alawida, Mejri, Mehmood, Chikhaoui and Isaac Abiodun2023), NLP models could undergo adversarial stress tests to confirm their resilience against attacks on data and models (Singh et al., Reference Singh, Grover and Kumar2022). Governments and public agencies can incentivize the development of more robust NLP technologies by funding research and offering tax incentives to companies that prioritize security. Public-private partnerships can also facilitate research into more effective techniques to secure NLP systems. For instance, frameworks could mandate transparency in how models handle adversarial examples, ensuring public trust and accountability (Al-Maliki et al., Reference Al-Maliki, Qayyum, Ali, Abdallah, Qadir, Hoang, Niyato and Al-Fuqaha2024).
Adversarial threats transcend borders, particularly when NLP models are deployed in multilingual or cross-cultural contexts (e.g., global customer service chatbots or translation systems). Such a global nature of NLP development and deployment may require international collaboration on policy frameworks and standardization. Differences in data privacy laws, ethical standards, and regulatory requirements between countries can hinder progress and create compliance challenges for organizations. Efforts to harmonize data privacy regulations, such as aligning GDPR with U.S. frameworks like the California Consumer Privacy Act, can streamline compliance for NLP applications operating across jurisdictions. In addition, policies promoting open data standards and interoperability can facilitate cross-border collaboration in NLP research. For example, shared datasets for adversarial training and multilingual NLP can help advance the field while adhering to ethical standards. As nations become more protective of their digital resources, policies should balance the need for digital sovereignty with the benefits of international collaboration. Agreements on data sharing and joint research initiatives can ensure mutual benefits while respecting national interests.
As LLMs integrate Bayesian techniques for probabilistic reasoning, regulatory guidelines should ensure that these models maintain ethical standards, transparency, and responsible AI governance. Encouraging open-source Bayesian NLP frameworks and federated learning approaches could further enhance privacy-preserving AI applications while maintaining model robustness and scalability. Future policies should focus on balancing innovation and regulation, ensuring that Bayesian NLP methods contribute to ethical, fair, and secure AI systems.
5. Emerging areas and future directions
The increasing popularity of NLP applications and the extent of adversarial attacks emphasize the motivation for adversarial NLP frameworks with increasing policy implications. This section provides an overview of the several emerging areas and potential future directions.
The integrity, reliability, and robustness of the development and deployment of responsible NLP-based frameworks are fundamental areas of interest. Adversarial robustness frameworks are developed to evaluate and improve NLP model robustness. In addition to AdvT scaling attempts, emerging defense methods include functional improvement that involves enhancing the model’s architecture and certification that provides formal guarantees of a model’s robustness against specific types of adversarial attacks. For instance, refining word embeddings or incorporating attention mechanisms can make the model more robust by improving its ability to discern and mitigate adversarial perturbations. By employing methods such as randomized smoothing or robust optimization, these techniques ensure that the model’s predictions remain stable within certain predefined bounds, offering practitioners a verifiable level of security. Incorporation of ethical guidelines focusing on transparency, fairness, and accountability is considered while reducing the impact of adversarial attacks. With respect to model interpretability, techniques, such as attention visualization and saliency maps, improve understanding of model behavior and identify vulnerabilities. In terms of tools, more adversarial attack detection and robustness testing frameworks are made public (Bird et al., Reference Bird, Day, Garofolo, Henderson, Laprun and Liberman2000; Wymberry and Jahankhani, Reference Wymberry and Jahankhani2024). These efforts aim to enhance the security, reliability, and trustworthiness of NLP systems, mitigating the risks posed by adversarial attacks. Increasing academia-industry collaboration presents opportunities in terms of research partnerships and data sharing initiatives.
Use of contextual information for generation of adversarial attacks is an emerging area of interest powered by the emergence of LLMs. For instance, semantic attacks such as synonym substitution manipulate the meaning of text inputs without changing their coherence, leading to plausible but incorrect predictions. Methods based on word embeddings and syntax trees, as well as genetic algorithms and reinforcement learning, are used to craft sophisticated adversarial examples that are harder to detect. In particular, black-box attacks have become more popular given the lack of knowledge about methods. This emphasizes the importance of transferability and generalization of attacks across models and domains. On the defense side, understanding generalization characteristics and developing tailored context-aware strategies is crucial for countering these attacks. In terms of domain adaptation, robustness against attacks in different real-world scenarios with diverse language characteristics is an emerging area.
5.1. Bayesian methods for adversarial NLP
Bayesian methods are increasingly relevant in adversarial machine learning (AML) due to their robustness and ability to quantify uncertainty (Rios Insua et al., Reference Rios Insua, Naveiro, Gallego and Poulos2023). They can be used to generate attacks as well as detecting and defending against adversarial attacks. Bayesian sequence models can generate text while maintaining a measure of uncertainty, helping to avoid nonsensical or adversarially induced outputs. Similarly, Monte Carlo dropout could be used to approximate Bayesian inference in deep learning while providing uncertainty estimates. High uncertainty of Bayesian predictions may indicate possible adversarial manipulations (Zhao et al., Reference Zhao, Su and Ji2020). For instance, using Bayesian neural networks, where weights are treated as distributions rather than point estimates, can produce more reliable confidence estimates, making the system more robust to attacks that exploit overconfident but incorrect predictions. By continuously updating the model with new data, Bayesian approaches can also adapt to evolving threats, maintaining the security of the NLP system over time. Bayesian optimization can be used to tune hyperparameters or model architectures to find configurations that are less susceptible to adversarial attacks. In addition, Bayesian methods inherently provide regularization, which can make models more robust to perturbations. Bayesian generative models, such as variational autoencoders (Doersch, Reference Doersch2016), can generate adversarial text samples by sampling from the latent space, providing a range of AdvT examples. Lastly, Bayesian decision-theoretic models such as adversarial risk analysis (Banks et al., Reference Banks, Gallego, Naveiro and Ríos Insua2022) could be used for AML (Ekin et al., Reference Ekin, Naveiro, Insua and Torres-Barrán2023; Caballero et al., Reference Caballero, Camacho, Ekin and Naveiro2024). These could be beneficial to model the interactions among the decision makers within adversarial NLP contexts (Shaw, Reference Shaw2025).
Overall, Bayesian methods offer a principled framework for decision-making under uncertainty, an essential aspect of developing robust adversarial defenses. Their ability to incorporate prior knowledge, provide calibrated uncertainty estimates, and apply natural regularization makes them well-suited for identifying and mitigating adversarial inputs. These strengths can guide models toward safer predictions when faced with perturbed or ambiguous data. However, the high computational demands and the implementation complexity have limited their adoption relative to more conventional approaches. As interest in large-scale NLP systems, particularly LLMs, continues to grow, there is a renewed need to reassess these trade-offs and explore the broader application of Bayesian approaches in adversarial NLP settings.
5.2. Emerging policy challenges
The emergence of large-scale NLP models, particularly LLMs, has introduced significant new policy challenges and amplified existing concerns. These challenges span domains, such as national security, public trust, misinformation, access, sustainability, and regulatory oversight.
First, adversarial attacks can significantly undermine the reliability of government-run IT systems that use NLP models. For instance, LLMs deployed in public service portals, e.g., immigration, social benefits, and tax filing systems, could be manipulated into giving incorrect or misleading guidance. An adversarial input might alter eligibility decisions, misroute applications, or subtly redirect users to malicious third-party websites. In addition to affecting operational efficiency, these may also result in privacy breaches, denial of services, or legal liability for government agencies. These risks highlight the urgent need for adversarial robustness in publicly deployed AI systems.
Second, many autonomous systems (e.g., self-driving cars, automated drones) increasingly rely on NLP for tasks like interpreting commands, processing sensor data, or interacting with humans. Adversarial threats could disrupt these systems, leading to potential safety risks. This eventually could diminish public trust in NLP. Potential governmental intervention can help establish minimum robustness and transparency standards, incentivize industry practices like “adversarial audits” and robustness certification, and coordinate international policy to prevent cross-border adversarial attacks.
Third, training and deploying large NLP models consume substantial energy resources, contributing to environmental concerns. Policies can encourage the development of energy-efficient model architectures, sustainable deployment and environmental reporting standards in alignment with global sustainability goals.
Finally, the generative capabilities of large-scale models result in powerful tools for misinformation and propaganda. Governments may support transparency when it comes to detecting and mitigating the spread of (false) information generated by NLP systems. Public awareness campaigns can complement these efforts. This also has consequences on trust in NLP systems. As NLP systems become increasingly sophisticated, their accessibility to under-resourced communities and languages remains a concern. Policies can support initiatives to develop NLP tools for low-resource languages and to enhance equitable access to these technologies.
These implications emphasize the importance of detection mechanisms and their use for public benefit while keeping the systems accessible. Policymakers face the complex task of balancing technical defenses and accessibility with legislative oversight. Ensuring secure model development is essential, especially when models are used in sensitive domains like legal aid, medical triage, or education. Prevention and early detection require investment in both tooling and organizational workflows, including training for civil servants to identify anomalies. Legislation might also be necessary, especially in creating (international) standards for responsible deployment, redressal mechanisms for affected users, and penalties for malicious adversarial input crafting. International collaboration and proactive regulation may also help navigate the complexities of NLP systems. Establishing a globally shared knowledge hub that gathers and publishes real-time adversarial threat intelligence for NLP systems can be beneficial. This could help with standardization of adversarial robustness and help responding to adversarial incidents or failures (e.g., LLM misuse, model degradation under attack), in the spirit of cybersecurity breach notification laws. Ultimately, an integrated approach managing technical, procedural, and legal aspects is likely to be most effective for NLP governance.
6. Conclusion
This critical review provides an examination of adversarial attacks and defenses in NLP, describing the challenges and policy implications while describing potential future directions. The rapidly evolving landscape of NLP, driven by sophisticated machine-learning models, has simultaneously seen an increase in the complexity and efficacy of adversarial attacks. These attacks exploit vulnerabilities in NLP systems, leading to significant concerns regarding their reliability and security. This review highlights some of the attacks and defenses while providing practical guidance and coverage of emerging techniques such as Bayesian methods.
Our review identifies several key challenges in addressing adversarial threats. The diversity of attack methods underscores the complexity of developing robust defenses. Moreover, the trade-off between model accuracy and robustness remains a critical issue, with many defensive strategies potentially degrading model performance. Another major challenge is the lack of standardized evaluation metrics and benchmarks, making it difficult to assess and compare the effectiveness of different defensive techniques comprehensively. Looking ahead, we have identified several future directions as critical for advancing the field. There is a pressing need for the development of more resilient NLP models that perform well in both ideal conditions while remaining reliable when challenged by adversarial attacks. This includes research into hybrid models that combine multiple defense strategies to cover a broader range of attack vectors. Another promising area is the integration of human-in-the-loop approaches, where human expertise is leveraged to detect and mitigate adversarial threats in real time.
The increasing reliance on NLP systems introduces critical policy challenges that need to be addressed for equitable and secure deployment. Adversarial attacks highlight the need for robust regulatory frameworks. Policymakers may focus on enforcing standards for adversarial testing to improve resilience against such attacks before deployment. International coordination can help create cross-border agreements addressing the global nature of adversarial threats, promoting a collective defense. Furthermore, initiatives to incentivize private organizations and research institutions to develop secure NLP systems, such as through grants or tax benefits, can accelerate innovation in this space. These measures are pivotal for maintaining trust in NLP technologies, especially in high-stakes sensitive domains like healthcare, legal systems, and cybersecurity. As NLP continues to reshape industries and societies, a strong policy foundation will be essential to maximize its benefits while mitigating its risks.
While major progress has been made in understanding and mitigating adversarial attacks in NLP, the field remains in its nascent stages both in terms of methodological developments and policy guidelines. Continued interdisciplinary research, combining insights from ML, cybersecurity, policy and linguistics, will be essential in developing robust NLP systems capable of withstanding adversarial challenges. AI risk management and trustworthy AI frameworks may benefit from consideration of Bayesian methods. Bayesian decision-making offers a powerful approach for context-aware risk mitigation by managing uncertainty and integrating expert knowledge through priors. These frameworks apply broadly across NLP methods, from topic models to large language models (LLMs). However, balancing model complexity, interpretability, computational efficiency, and policy implications remains an active and important area of research.
Abbreviations
- AI
-
Artificial intelligence
- BERT
-
Bidirectional encoder representations from transformers
- CNN
-
Convolutional neural network
- GenAI
-
Generative AI
- GPT
-
Generative pre-trained transformer
- LLMs
-
Large language models
- NLP
-
Natural language processing
- RNN
-
Recurrent neural network
Data availability statement
The data that support the findings of this study are open-source and made available Shaw (Reference Shaw2025).
Author contribution
Conceptualization: LS and TE. Methodology: LS and TE. Formal analysis and investigation: LS, MWA and TE. Writing—original draft: LS and TE, Writing—review and editing: TE. Funding acquisition: TE Resources: TE Supervision: LS and TE. All authors approved the final submitted draft.
Funding statement
This work was supported by the National Science Foundation under research grant 2,334,268, the Air Force Office of Scientific Research awards FA-9550-21-1-0239 and FA8655-21-1-7042. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests
The authors declare none.
Comments
No Comments have been published for this article.