10月14日 | 王汉生:Subsampling and Jackknifing: A Practical Solution for Large Data Analysis with Limited Computational Resources

时间:2020-10-12浏览:10设置

时 间:2020年10月14日(周三)上午 10:00-11:00

题 目:Subsampling and Jackknifing: A Practical Solution for Large Data Analysis with Limited Computational Resources

报告人:王汉生 北京大学光华管理学院 教授

腾讯会议:764 890 264 

直播网址:https://meeting.tencent.com/l/z9g4k2I9giL4 也可扫描二维码观看讲座直播


摘 要:

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife- debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.


个人简介:

王汉生,北京大学光华管理学院商务统计与经济计量系主任,嘉茂荣聘教授(2014-2015年),蓝天环保讲席教授(2015-2016年),博士生导师,北京大学商务智能研究中心主任,微信公众号“狗熊会”创始人,美国统计学会Fellow(2014年),国家杰出青年基金获得者(2016年)。1998年于北京大学数学学院概率统计系本科毕业,2001年于美国威斯康星大学麦迪逊分校统计系博士毕业。2003年加入光华管理学院至今。在国内外重要学术期刊上发表文章七十余篇,并(合)著中英文专著各一本。先后担任以下国际学术刊物副主编:AOS, JASA, Sinica,JBES,CSDA,SII。主要理论研究兴趣为:高维数据分析、变量选择、数据降维、极值理论、以及半参数模型。主要应用研究兴趣为:搜索引擎营销、社会关系网络。


返回原图
/