Hostname: page-component-857557d7f7-cmjwd Total loading time: 0 Render date: 2025-12-08T04:32:37.702Z Has data issue: false hasContentIssue false

Analyzing Political Text at Scale with Online Tensor LDA

Published online by Cambridge University Press:  04 December 2025

Sara Kangaslahti
Affiliation:
Department of Computer Science, Harvard University, USA
Danny Ebanks
Affiliation:
Institute for Quantitative Social Science, NVIDIA Corp., USA
Jean Kossaifi
Affiliation:
Johns Hopkins University, USA
Anqi Liu
Affiliation:
Department of Computer Science, California Institute of Technology, USA
R. Michael Alvarez*
Affiliation:
Division of Humanities and Social Science, California Institute of Technology, USA
Animashree Anandkumar
Affiliation:
Division of Engineering and Applied Sciences
*
Corresponding author: R. Michael Alvarez; Email: rma@caltech.edu

Abstract

This article proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, tensor latent Dirichlet allocation, that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3$\times $–4$\times $ those of prior parallelized latent Dirichlet allocation methods), and that it scales linearly to text datasets with over a billion documents; and iii) we provide an open-source, GPU-based implementation of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus, this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

Information

Type
Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Edited by: Jeff Gill

References

Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Liu, Y.-K.. 2012. “Two SVDs Suffice: Spectral Decompositions for Probabilistic Topic Modeling and Latent Dirichlet Allocation.” Preprint, arXiv:1204.6703. http://arxiv.org/abs/1204.6703.Google Scholar
Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Liu, Y.-K.. 2013. “A Spectral Algorithm for Latent Dirichlet Allocation.” In Advances in Neural Information Processing Systems, edited by F. Pereira and C.J. Burges and L. Bottou and K.Q. Weinberger, vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/15d4e891d784977cacbfcbb00c48f133-Paper.pdf Google Scholar
Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M.. 2014. “Tensor Decompositions for Learning Latent Variable Models.” Journal of Machine Learning Research 15 (80): 27732832. http://jmlr.org/papers/v15/anandkumar14b.html Google Scholar
Barberá, P. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23 (1): 7691.10.1093/pan/mpu011CrossRefGoogle Scholar
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., and Nagler, J.. 2021. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis 29 (1): 1942. https://doi.org/10.1017/pan.2020.8 CrossRefGoogle Scholar
Barberá, P., et al. 2019. “Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data.” American Political Science Review 113 (4): 883901.10.1017/S0003055419000352CrossRefGoogle ScholarPubMed
Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A., and Bonneau, R.. 2015. “Tweeting from Left to Right: Is Online Political Communication More than an Echo Chamber?Psychological Science 26 (10): 15311542.10.1177/0956797615594620CrossRefGoogle ScholarPubMed
Bartl, M., Nissim, M., and Gatt, A.. 2020. “Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias.” In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, edited by M. R. Costa-jussà, C. Hardmeier, W. Radford, and K. Webster, 1–6. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/2020.gebnlp-1.1/ Google Scholar
Benoit, K., Munger, K., and Spirling, A.. 2019. “Measuring and Explaining Political Sophistication through Textual Complexity.” American Journal of Political Science 63 (2): 491508.10.1111/ajps.12423CrossRefGoogle ScholarPubMed
Blei, D. M., Ng, A. Y., and Jordan, M. I.. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 9931022.Google Scholar
Brown, D. 2018. “19 Million Tweets Later: A Look at #MeToo a Year after the Hashtag Went Viral.” USA Today. Google Scholar
Cao, J., Adams-Cohen, N., and Michael Alvarez, R.. 2021. “Reliable and Efficient Long-Term Social Media Monitoring.” Journal of Computer and Communications 9: 97109. https://doi.org/10.4236/jcc.2021.910006 CrossRefGoogle Scholar
Chen, J., Li, K., Zhu, J., and Chen, W.. 2016. “WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation.” Proceedings of the VLDB Endowment 9 (10): 744755. https://doi.org/10.14778/2977797.2977801 CrossRefGoogle Scholar
Chong, D. 1991. Collective Action and the Civil Rights Movement. Chicago, IL: University of Chicago Press.10.7208/chicago/9780226228693.001.0001CrossRefGoogle Scholar
Clark-Parsons, R. 2022. Networked Feminism how Digital Media Makers Transformed Gender Justice Movements. Oakland, CA: University of California Press.Google Scholar
Craig, S. C., Martinez, M. D., Gainous, J., and Kane, J. G.. 2006. “Winners, Losers, and Election Context: Voter Responses to the 2000 Presidential Election.” Political Research Quarterly 59 (4): 579592. https://doi.org/10.1177/106591290605900407 CrossRefGoogle Scholar
Dennis-Henderson, A. G. 2020. “Analysis of World War One Diaries Using Natural Language Processing.” PhD diss., The University of Adelaide. https://hdl.handle.net/2440/129622 Google Scholar
Denny, M. J., and Spirling, A.. 2018. “Text Preprocessing for Unsupervised Learning: Why it Matters, when it Misleads, and What to Do about it.” Political Analysis 26 (2): 168189.10.1017/pan.2017.44CrossRefGoogle Scholar
Dimitrov, D., et al. 2020. “TweetsCOV19—A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic.” In Proceedings of the 29th ACM International Conference on Information & Knowledge Management CIKM ’20 Virtual Event, 29912998. Ireland: Association for Computing Machinery. https://doi.org/10.1145/3340531.3412765 Google Scholar
Ebanks, D., Anandkumar, A., Kossaifi, J., Sara, K., and Alvarez, R.. 2025. “Replication Data for: Analyzing Political Text at Scale with Online Tensor LDA.” https://doi.org/10.7910/DVN/OKPRJG CrossRefGoogle Scholar
Gilardi, F., Kubli, M., Gessler, T., and Müller, S.. 2022. “Social Media and Political Agenda Setting.” Political Communication 39 (1): 3960.10.1080/10584609.2021.1910390CrossRefGoogle Scholar
Grimmer, J., and King, G.. 2011. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences 108 (7): 26432650. https://doi.org/10.1073/pnas.1018067108 CrossRefGoogle ScholarPubMed
Grimmer, J., Roberts, M. E., and Stewart, B. M.. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, NJ: Princeton University Press.Google Scholar
Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267297. http://www.jstor.org/stable/24572662 10.1093/pan/mps028CrossRefGoogle Scholar
Hannak, A., Anderson, E., Barrett, L. F., Lehmann, S., Mislove, A., and Riedewald, M.. 2021. “Tweetin’ in the Rain: Exploring Societal-Scale Effects of Weather on Mood.” Proceedings of the International AAAI Conference on Web and Social Media 6 (1): 479482. https://ojs.aaai.org/index.php/ICWSM/article/view/14322 10.1609/icwsm.v6i1.14322CrossRefGoogle Scholar
Hoffman, M., Bach, F., and Blei, D.. 2010. “Online Learning for Latent Dirichlet Allocation.” In Advances in Neural Information Processing Systems, edited by Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A., vol. 23. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf Google Scholar
Huang, F., Niranjan, U. N., Hakeem, M. U., and Anandkumar, A.. 2015. “Online Tensor Methods for Learning Latent Variable Models.” Journal of Machine Learning Research 16 (86): 27972835. http://jmlr.org/papers/v16/huang15a.html Google Scholar
Janzamin, M., Ge, R., Kossaifi, J., and Anandkumar, A.. 2019. “Spectral Learning on Matrices and Tensors.” Foundations and Trends®in Machine Learning 12 (5–6): 393536. https://doi.org/10.1561/2200000057 CrossRefGoogle Scholar
Jost, J. T., et al. 2018. “How Social Media Facilitates Political Protest: Information, Motivation, and Social Networks.” Political Psychology 39 (S1): 85118. https://doi.org/10.1111/pops.12478.CrossRefGoogle Scholar
Kann, C., Hashash, S., Steinert-Threlkeld, Z., and Michael Alvarez, R.. 2023. “Collective Identity in Collective Action: Evidence from the 2020 Summer BLM Protests.” Frontiers in Political Science 5. https://doi.org/10.3389/fpos.2023.1185633.CrossRefGoogle Scholar
King, G., and Hopkins, D.. 2010. “Extracting Systematic Social Science Meaning from Text.” American Journal of Political Science 54 (1): 229247.Google Scholar
Kocielnik, R., Prabhumoye, S., Zhang, V., Jiang, R., Alvarez, R. M., and Anandkumar, A.. 2023. “BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models.” Preprint, arXiv:2302.07371.Google Scholar
Kolda, T. G., and Bader, B. W.. 2009. “Tensor Decompositions and Applications.” SIAM Review 51 (3): 455500.10.1137/07070111XCrossRefGoogle Scholar
Kossaifi, J., Panagakis, Y., Anandkumar, A., and Pantic, M.. 2019. “TensorLy: Tensor Learning in Python.” Journal of Machine Learning Research 20 (26): 16. http://jmlr.org/papers/v20/18-277.html.Google Scholar
Lauderdale, B. E., and Clark, T. S.. 2014. “Scaling Politically Meaningful Dimensions Using Texts and Votes.” American Journal of Political Science 58 (3): 754771.10.1111/ajps.12085CrossRefGoogle Scholar
Larson, J. M., Nagler, J., Ronen, J., and Tucker, J. A.. 2019. “Social Networks and Protest Participation: Evidence from 130 Million Twitter Users.” American Journal of Political Science 63 (3): 690705.10.1111/ajps.12436CrossRefGoogle Scholar
Li, Z., Cao, J., Adams-Cohen, N., and Michael Alvarez, R.. 2023. “The Effect of Misinformation Intervention: Evidence from Trump’s Tweets and the 2020 Election.” In Disinformation in Open Online Media, edited by Ceolin, D., Caselli, T., and Tulin, M., 88102. Cham: Springer Nature Switzerland.10.1007/978-3-031-47896-3_7CrossRefGoogle Scholar
Linegar, M., Kocielnik, R., and Michael Alvarez, R.. 2023. “Large Language Models and Political Science.Frontiers in Political Science 5. https://doi.org/10.3389/fpos.2023.1257092.CrossRefGoogle Scholar
Liu, A., Srikanth, M., Adams-Cohen, N., Alvarez, R. M., and Anandkumar, A.. 2019. “Finding Social Media Trolls: Dynamic Keyword Selection Methods for Rapidly-Evolving Online Debates.” Preprint, arXiv:1911.05332.Google Scholar
Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., and Tingley, D.. 2015. “Computer-Assisted Text Analysis for Comparative Politics.” Political Analysis 23 (2): 254277. https://doi.org/10.1093/pan/mpu019 CrossRefGoogle Scholar
Metzger, M. M. D., Bonneau, R., Nagler, J., and Tucker, J. A.. 2016. “Tweeting Identity? Ukrainian, Russian, and #Euromaidan.” Journal of Comparative Economics 44 (1): 1640.10.1016/j.jce.2015.12.004CrossRefGoogle Scholar
Monroe, B. L., Colaresi, M. P., and Quinn, K. M.. 2017. “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16 (4): 372403. https://doi.org/10.1093/pan/mpn018 CrossRefGoogle Scholar
Munger, K. 2017. “Tweetment Effects on the Tweeted: Experimentally Reducing Racist Harassment.” Political Behavior 39: 629649. https://doi.org/10.1007/s11109-016-9373-5 CrossRefGoogle Scholar
Nozza, D., Bianchi, F., and Hovy, D.. 2021. “HONEST: Measuring Hurtful Sentence Completion in Language Models.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, edited by Toutanova, K., et al., 23982406. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.191 Google Scholar
Olson, M Jr.,. 1971. The Logic of Collective Action: Public Goods and the Theory of Groups, with a New Preface and Appendix, vol. 124. Cambridge, MA: Harvard University Press.Google Scholar
Papalexakis, E. E., Faloutsos, C., and Sidiropoulos, N. D.. 2016. “Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms.” ACM Transactions on Intelligent Systems and Technology 8 (2): 144.10.1145/2915921CrossRefGoogle Scholar
RAPIDS Development Team. 2018. “RAPIDS: Collection of Libraries for End to End GPU Data Science.” Nvidia. https://rapids.ai Google Scholar
Rehurek, R., and Sojka, P.. 2011. “Gensim–Python Framework for Vector Space Modelling.” NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3 (2).Google Scholar
Roberts, M. E., et al. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58 (4): 10641082.10.1111/ajps.12103CrossRefGoogle Scholar
Roberts, M. E., Stewart, B. M., and Tingley, D.. 2016. “Navigating the Local Modes of Big Data.” In Computational Social Science, edited by Alvarez, R. M.. Cambridge, UK: Cambridge University Press Cambridge.Google Scholar
Röder, M., Both, A., and Hinneburg, A.. 2015. “Exploring the Space of Topic Coherence Measures.” In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining WSDM ’15, 399408. Shanghai: Association for Computing Machinery. https://doi.org/10.1145/2684822.2685324 CrossRefGoogle Scholar
Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N.. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), edited by Inui, K., Jiang, J., Ng, V., and Wan, X., 34073412. Hong Kong: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1339 Google Scholar
Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E, and Faloutsos, C.. 2017. “Tensor Decomposition for Signal Processing and Machine Learning.” Transactions Signal Processing 65 (13): 35513582.10.1109/TSP.2017.2690524CrossRefGoogle Scholar
Sinclair, B., Smith, S. S., and Tucker, P. D.. 2018. ““It’s Largely a Rigged System”: Voter Confidence and the Winner Effect in 2016.” Political Research Quarterly 71 (4): 854868.10.1177/1065912918768006CrossRefGoogle Scholar
Sinnenberg, L., et al. 2016. “Twitter as a Potential Data Source for Cardiovascular Disease Research.” JAMA Cardiology 1 (9): 10321036. https://doi.org/10.1001/jamacardio.2016.3029 CrossRefGoogle ScholarPubMed
Steinert-Threlkeld, Z. 2018. Twitter as Data. Cambridge, UK: Cambridge University Press.10.1017/9781108529327CrossRefGoogle Scholar
Steinert-Threlkeld, Z. C. 2017. “Spontaneous Collective Action: Peripheral Mobilization during the Arab Spring.” American Political Science Review 111 (2): 379403. https://doi.org/10.1017/S0003055416000769 CrossRefGoogle Scholar
Swierczewski, C., Bodapati, S., Beniwal, A., Leen, D., and Anandkumar, A.. 2019. “Large Scale Cloud Deployment of Spectral Topic Modeling.” https://parlearning.github.io/papers%5C_2019/ParLearning%5C_2019%5C_04.pdf Google Scholar
Tillery, A. B. 2019. “What Kind of Movement Is Black Lives Matter? The View from Twitter.” The Journal of Race, Ethnicity, and Politics 4 (2): 297323. https://doi.org/10.1017/rep.2019.17 CrossRefGoogle Scholar
Wang, S., Liu, H., Gaihre, A., and Yu, H.. 2023. “EZLDA: Efficient and Scalable LDA on GPUs.” in IEEE Access, vol. 11, pp. 100165–100179, https://doi.org/10.1109/ACCESS.2023.3315239.CrossRefGoogle Scholar
Xie, X., Liang, Y., Li, X., and Tan, W.. 2019. “CuLDA: Solving Large-Scale LDA Problems on GPUs.” In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing HPDC ’19, 195205. Phoenix, AZ: Association for Computing Machinery. https://doi.org/10.1145/3307681.3325407 Google Scholar
Yu, H.-F., Hsieh, C.-J., Yun, H., Vishwanathan, S. V. N., and Dhillon, I. S.. 2015. “A Scalable Asynchronous Distributed Algorithm for Topic Modeling.” In Proceedings of the 24th International Conference on World Wide Web WWW ’15, 13401350. Florence: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741682 Google Scholar
Yuan, J., et al. 2015. “LightLDA: Big Topic Models on Modest Computer Clusters.” In Proceedings of the 24th International Conference on World Wide Web WWW ’15, 13511361. Florence: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741115 Google Scholar
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W.. 2018. “Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods.” Preprint, arXiv:1804.06876.10.18653/v1/N18-2003CrossRefGoogle Scholar
Supplementary material: File

Kangaslahti et al. supplementary material

Kangaslahti et al. supplementary material
Download Kangaslahti et al. supplementary material(File)
File 1.5 MB
Supplementary material: Link

Kangaslahti et al. Dataset

Link