SGSA Student Seminar Fall 2012
Friday 4.10 - 5PM, 1011 Evans
The Statistics Graduate Students Association (SGSA) holds a weekly seminar every Friday, where students present their
research to their peers. Three of our seminars are dedicated to productivity topics, such as using Git, LaTeX, programming. The events are sponsored by the Graduate Assembly.
Title and Abstracts
Under the law relative to the sustainable management of radioactive wastes, some nuclear plants need to do investment to hedge the expense. Therefore we are in-terested in the problem of conditional asset liablity management which allows to calculate the necessary provision to the future expense. We are looking for the best provision when we want to satisfy the constraint with a certain probabilty (quantile hedging). We show that this problem can be converted into a stochastic target problem. We provide then a derivation of the dynamic programming equation together with the associated boundary conditions. Finally, we try to propose some analytic method (duality approach) as well as some numerical schemes to solve the PDEs et analyse the performance of these approaches.
Domination by product measures is a machinery for dealing with
local dependencies in many cases, which allows one to transfer to a
probability space in which everything is independent. This is a very strong
generalization of the celebrated "Lovasz Local lemma" (which is often
used combinatorics and C.S. theory). We will see some applications to
particle systems, percolation processes and statistical mechanics models
(as time permits).
Git "is a distributed revision control and source code
management (SCM) system". We'll learn how to
set up a git repository from scratch and get started using version
control from the command line. We'll also learn about the new git
repository space in statistics. For best results, come with your
laptop and have git installed (e.g. for Mac OS X:
http://code.google.com/p/git-osx-installer . Or at least come with
your laptop and the ability to ssh into the stats computers.
I will survey a number of high dimensional regression problems, emphasizing examples from my own research. I will pay special attention to the structural assumptions made on the explanatory variables (interchangeably the design matrix or predictor variables) which drive the statistical behavior of the estimators proposed for these problems. Without attempting to provide any sort of coherent or unifying conclusions, my goal is to drive home the point that the structure of the predictors plays a crucial role in these problems. Finally I speculate on an open question: do real high dimensional data conform to these assumptions?
Statisticians need data. One way to get some is by scraping the web. I'll
give a basic tutorial on how to scrape data from a web site, and then show
you how to store and query the data in a SQL database.
In a recent paper titled "Functional Data Analysis of Generalized Quantile Regressions", Guo et. al. studied an analogue of FPCA for quantiles and expectiles. Typical applications involve functional data (such as time-series, spatial data, or more specifically, weather data), where one is interested in estimating and predicting extreme values over time, such as the 99th percentile of rainfall, or the 90th percentile house price in California over the last five years. In the classical setting, FPCA is computationally efficient, easy to interpret, and has nice mathematical properties.
In this project, I study the abstract statistical problems arising from the work of Guo et. al. In other words, I ask when, and how can one find "the" best $k$-dimensional subspace that approximates $n$ observed curves in an asymmetric norm instead of the $\ell_2$-norm. Preliminary results show that there are multiple ways to generalize FPCA to handle quantiles and expectiles, one of which is the approach in Guo et. al. Given the applicability of this method, it is a highly interesting statistical problem to study all possible generalizations of FPCA to quantiles and apply them to real-life data, such as the prediction of extreme weather events.
Crowdsourcing system is an effective tool for human-powered computation on a lot of tasks which might be very challenging for computers. Thus, it's becoming more and more popular nowadays.
In this talk, I will discuss some conditions of bounding the error rate with high probability under one coin model and two coins model of crowdsourcing in general. Based on the analysis, we verified our theory on both the synthetic data and some public datasets.
The Coalescent with Recombination, Migration, and Population Growth is
a population genetics model for the genealogy of a sample under
general demographic scenarios. While simulating under this model is
easy, computing likelihoods for parameter inference is hard. I discuss
the use of approximate likelihoods for population genetic inference.
In particular, I discuss current work on generalizing approximate
conditional likelihoods to handle complex demographic scenarios, and
on applications of these conditional likelihoods to estimating the
genealogy and the ancestral alleles of a sample.
Many of the most widely used methods in machine learning are based
on the principle that a collection of weak learners can be
aggregated to form a single strong learner. This set of approaches
is commonly referred to as "ensemble methods", and includes
random forests, boosting, and bagging as well known examples. A
fundamental issue that determines both the statistical performance
and computational cost of ensemble methods is the choice of the
number of weak learners. Despite intense study over the last 10 to
15 years, there has been little theoretical understanding of how
performance grows as a function of the number of weak learners. In
particular, one would like to know how quickly the error rate
err_n of an ensemble of n classifiers converges to its limiting
value e*. In this talk, I will show that for many ensembles,
the rate of convergence is given by err_n - e* = c/n + o(1/n),
where the constant c has an exact formula, and can be estimated from data.
As a consequence, this result offers a principled and data-driven way to
choose the number of classifiers.