# SGSA Student Seminar Fall 2012

## Friday 4.10 - 5PM, 1011 Evans

### Organizers: Ngoc Tran and Angie Zhu

The Statistics Graduate Students Association (SGSA) holds a weekly seminar every Friday, where students present their research to their peers. Three of our seminars are dedicated to productivity topics, such as using Git, LaTeX, programming. The events are sponsored by the Graduate Assembly.

 DATE SPEAKER TITLE (click for an abstract) September 28 Wenpin Tang Asset Liability Management via Stochastic Target Approach October 5 Productivity seminar discussion, lead by Chris Paciorek Discussion on Computing/Productivity Topics in Demand October 12 Jonathan Hermon Domination by product measures and applications October 19 Tamara Broderick Basic Version Control via GIT October 26 Derek Bean The role of the explanatory variables in high-dimensional regression problems: a theoretical perspective November 2 Jonathan Terhost Collecting Data via Web and Intro to SQL " " Click here for the R code November 9 Ngoc Tran Functional principal component analysis in an asymmetric norm November 16 - 3-4PM Hongwei Li Labeling by Crowdsourcing November 16 - 4-5PM Jack Kamm Approximate Conditional Sampling Distributions for Inferring Demography and Local Genealogies November 30 Miles Lopes How quickly do ensembles converge?

### Wenpin Tang: Asset Liability Management via Stochastic Target Approach

Under the law relative to the sustainable management of radioactive wastes, some nuclear plants need to do investment to hedge the expense. Therefore we are in-terested in the problem of conditional asset liablity management which allows to calculate the necessary provision to the future expense. We are looking for the best provision when we want to satisfy the constraint with a certain probabilty (quantile hedging). We show that this problem can be converted into a stochastic target problem. We provide then a derivation of the dynamic programming equation together with the associated boundary conditions. Finally, we try to propose some analytic method (duality approach) as well as some numerical schemes to solve the PDEs et analyse the performance of these approaches.

### Jonathan Hermon: Domination by product measures and applications

Domination by product measures is a machinery for dealing with local dependencies in many cases, which allows one to transfer to a probability space in which everything is independent. This is a very strong generalization of the celebrated "Lovasz Local lemma" (which is often used combinatorics and C.S. theory). We will see some applications to particle systems, percolation processes and statistical mechanics models (as time permits).

### Tamara Broderick: Basic Version Control via GIT

Git "is a distributed revision control and source code management (SCM) system". We'll learn how to set up a git repository from scratch and get started using version control from the command line. We'll also learn about the new git repository space in statistics. For best results, come with your laptop and have git installed (e.g. for Mac OS X: http://code.google.com/p/git-osx-installer . Or at least come with your laptop and the ability to ssh into the stats computers.

### Derek Bean: The role of the explanatory variables in high-dimensional regression problems: a theoretical perspective

I will survey a number of high dimensional regression problems, emphasizing examples from my own research. I will pay special attention to the structural assumptions made on the explanatory variables (interchangeably the design matrix or predictor variables) which drive the statistical behavior of the estimators proposed for these problems. Without attempting to provide any sort of coherent or unifying conclusions, my goal is to drive home the point that the structure of the predictors plays a crucial role in these problems. Finally I speculate on an open question: do real high dimensional data conform to these assumptions?

### Jonathan Terhost: Collecting Data via Web and Intro to SQL

Statisticians need data. One way to get some is by scraping the web. I'll give a basic tutorial on how to scrape data from a web site, and then show you how to store and query the data in a SQL database.

### Ngoc Tran: Functional principal component analysis in an asymmetric norm

In a recent paper titled "Functional Data Analysis of Generalized Quantile Regressions", Guo et. al. studied an analogue of FPCA for quantiles and expectiles. Typical applications involve functional data (such as time-series, spatial data, or more specifically, weather data), where one is interested in estimating and predicting extreme values over time, such as the 99th percentile of rainfall, or the 90th percentile house price in California over the last five years. In the classical setting, FPCA is computationally efficient, easy to interpret, and has nice mathematical properties.

In this project, I study the abstract statistical problems arising from the work of Guo et. al. In other words, I ask when, and how can one find "the" best $k$-dimensional subspace that approximates $n$ observed curves in an asymmetric norm instead of the $\ell_2$-norm. Preliminary results show that there are multiple ways to generalize FPCA to handle quantiles and expectiles, one of which is the approach in Guo et. al. Given the applicability of this method, it is a highly interesting statistical problem to study all possible generalizations of FPCA to quantiles and apply them to real-life data, such as the prediction of extreme weather events.

### Hongwei Li: Labeling by Crowdsourcing

Crowdsourcing system is an effective tool for human-powered computation on a lot of tasks which might be very challenging for computers. Thus, it's becoming more and more popular nowadays. In this talk, I will discuss some conditions of bounding the error rate with high probability under one coin model and two coins model of crowdsourcing in general. Based on the analysis, we verified our theory on both the synthetic data and some public datasets.

### Jack Kamm: Approximate Conditional Sampling Distributions for Inferring Demography and Local Genealogies

The Coalescent with Recombination, Migration, and Population Growth is a population genetics model for the genealogy of a sample under general demographic scenarios. While simulating under this model is easy, computing likelihoods for parameter inference is hard. I discuss the use of approximate likelihoods for population genetic inference. In particular, I discuss current work on generalizing approximate conditional likelihoods to handle complex demographic scenarios, and on applications of these conditional likelihoods to estimating the genealogy and the ancestral alleles of a sample.

### Miles Lopes: How quickly do ensembles converge?

Many of the most widely used methods in machine learning are based on the principle that a collection of weak learners can be aggregated to form a single strong learner. This set of approaches is commonly referred to as "ensemble methods", and includes random forests, boosting, and bagging as well known examples. A fundamental issue that determines both the statistical performance and computational cost of ensemble methods is the choice of the number of weak learners. Despite intense study over the last 10 to 15 years, there has been little theoretical understanding of how performance grows as a function of the number of weak learners. In particular, one would like to know how quickly the error rate err_n of an ensemble of n classifiers converges to its limiting value e*. In this talk, I will show that for many ensembles, the rate of convergence is given by err_n - e* = c/n + o(1/n), where the constant c has an exact formula, and can be estimated from data. As a consequence, this result offers a principled and data-driven way to choose the number of classifiers.