Posted: Nov 29, 2023
The proliferation of big data across scientific, commercial, and social domains has created unprecedented challenges for computational systems. Traditional analytical approaches that process entire datasets have become increasingly untenable due to escalating computational demands, storage requirements, and processing times. This computational burden has stimulated research into sampling methodologies that can provide statistically valid insights from subsets of data. However, conventional sampling techniques typically employ fixed sample sizes determined a priori, which often results in either excessive computational overhead or insufficient statistical power. This research addresses this fundamental limitation by introducing a novel sequential sampling framework that dynamically determines optimal sample sizes based on real-time statistical convergence metrics. Our approach represents a paradigm shift from static to adaptive sampling, where the sampling process continues only until predetermined statistical stability criteria are met. This methodology challenges the conventional wisdom that sample size must be predetermined and instead posits that sampling should be guided by the inherent statistical properties of the data stream itself. The core innovation of our work lies in the development of a multi-dimensional convergence monitoring system that tracks variance stabilization, distributional consistency, and parameter estimation stability simultaneously. By integrating these metrics into a unified stopping criterion, our method achieves significant computational savings while maintaining statistical rigor. This approach is particularly valuable in environments where data streams are continuous and computational resources are constrained, such as edge computing, real-time analytics, and resource-limited research settings. Our research questions investigate whether sequential sampling can substantially reduce computational complexity without compromising analytical accuracy, how this reduction varies across different data domains and analytical tasks, and what statistical guarantees can be provided for the convergence-based stopping criteria. We examine these questions through rigorous experimentation across three diverse big data domains, providing comprehensive evidence.
Downloads: 59
Abstract Views: 1102
Rank: 78690