Event Date:
Monday, November 11, 2024 - 1:00pm to 2:00pm
Event Date Details:
Monday November 11, 2024
Event Location:
- Sobel lecture room (South Hall 5607 F)
Event Price:
FREE
Event Contact:
Prof Xiaotong Shen
University of Minnesota
- Department Seminar
Abstract: Synthetic data generation heralds a paradigm shift in data science, addressing the challenges of data scarcity and privacy and enabling unprecedented performance. As synthetic data gains prominence, questions arise regarding the accuracy of statistical methods compared to their application on raw data alone. Addressing this, we introduce the Synthetic Data Generation for Analytics framework, which applies statistical methods to high-fidelity synthetic data produced by advanced generative models like tabular diffusion models through
knowledge transfer. These models, trained using raw data, are enriched with insights from relevant studies. A significant finding within this framework is the generational effect: the error of a statistical method initially decreases with the integration of synthetic data but may subsequently increase. This phenomenon, rooted in the complexities of replicating raw data distributions, introduces the "reflection point," an optimal threshold of synthetic data defined by specific error metrics. Through one data example, we demonstrate the effectiveness of this framework.
This work is joint with Yifei Liu and Rex Shen.
Speaker's bio: Xiaotong T. Shen is the John Black Johnston Distinguished Professor in the College of Liberal Arts at the University of Minnesota. His areas of interest include machine learning and data science, high-dimensional inference, non/semi-parametric inference, causal relations, graphical models, explainable Machine Intelligence (MI), personalization, recommender systems, natural language processing, generative modeling, and nonconvex minimization. His current research efforts are devoted to further developing causal and constrained inference, generative inference and prediction for black-box learners, and diffusion, normalizing flows, and summarization. The targeted application areas are biomedical sciences, artificial intelligence, and engineering.
October 16, 2024 - 8:57am