In numerous posts, we have been discussing synthetic populations and their use in agent-based modeling. But there are many modeling styles that also utilize synthetic populations. In our own work we often spend significant amounts of time creating such synthetic populations, especially those grounded with data, due to the time needed to collect, preprocess and generate the final synthetic population. To alleviate this, we (Na (Richard) Jiang, Fuzhen Yin, Boyu Wang and myself) have a new paper published in Scientific Data, entitled "A Large-Scale Geographically Explicit Synthetic Population with Social Networks for the United States." Our aim of this paper is to build and provide a geographically explicit synthetic population along with its social networks using open data including that from the latest 2020 U.S. Census which can be used in a variety of geo-simulation models.
Summary of the Resulting Datasets. |
Specially, in the paper we outline how we created the a synthetic population of 330,526,186 individuals representing America's 50 states and Washington D.C.. Each individual has a set of geographical locations that represent their home, work or school addresses. Additionally, these individuals are not isolated, they are embedded in a larger social setting based on their household, working and studying relationships (i.e., social networks).
The work (e.g., data collection, data preprocessing and generation processes) was coded using Python 3.12 and all the scripts used are available at: https://github.com/njiang8/geo-synthetic-pop-usa while the resulting datasets (85 GB uncompressed) are available at OSF: https://osf.io/fpnc2/.
To give you a sense of the paper, below we provide the abstract to it, along with some results and our efforts to validate the synthetic population. While at the full reference and link to the paper can be found at the bottom of the post.
Abstract:
Within the geo-simulation research domain, micro-simulation and agent-based modeling often require the creation of synthetic populations. Creating such data is a time-consuming task and often lacks social networks, which are crucial for studying human interactions (e.g., disease spread, disaster response) while at the same time impacting decision-making. We address these challenges by introducing a Python based method that uses the open data including that from 2020 U.S. Census data to generate a large-scale realistic geographically explicit synthetic population for America's 50 states and Washington D.C. along with the stylized social networks (e.g., home, work and schools). The resulting synthetic population can be utilized within various geo-simulation approaches (e.g., agent-based modeling), exploring the emergence of complex phenomena through human interactions and further fostering the study of urban digital twins.
Keywords: Synthetic Population, U.S. Census 2020, Agent-Based Modeling, Geo-Simulation, Social Networks.
Data Generation Workflow and Resulting Datasets. |
A Sample of a Social Networks for one Household and their Home, Work and Educational Social Networks from the Generated Data. |
Sample of Generated Social Networks Extracted from the City of Buffalo, New York: (a) Household; (b) Work; (c) School; (d) Daycare. |
Validation of the Synthetic Population at Different Levels: (a) Population under Different 18 Age Groups; (b) Household under Different Household Types. |
Full Referece:
Jiang, N., Yin, F., Wang., B. and Crooks, A.T. (2024), A Large-Scale Geographically Explicit Synthetic Population with Social Networks for the United States, Scientific Data, 11, 1204. https://doi.org/10.1038/s41597-024-03970-1 (pdf)