«Analysis methods of heavy-tailed data»
The duration of the course is approximately 25-30 hours plus 20 hours of exercises.
Heavy-tailed distributions are typical for phenomena in complex multi-component systems such as biometry, economics, ecological systems, sociology, Web access statistics and Internet traffic, biblio-metrics, finance and business. The typical examples of such distributions are Pareto, Weibull with shape parameter less than 1, Cauchy, Zipf-Mandelbrot laws. The analysis of heavy-tailed distributions requires special methods of estimation because of their specific features. These are slower decay to zero of heavy tails than that of an exponential rate; the violation of Cramer's condition; sparse observations in the tail domain of the distribution. Due to the lack of the information beyond the range of the empirical sample the nonparametric estimates use essentially the asymptotic distributions of the maximum of the empirical sample as models of the distribution behavior at infinity. The course will provide a detailed survey of classical results and some recent developments in the theory of nonparametric estimation of the density, tail index, high quantiles, hazard rate, renewal function and time seties assuming the data are heavy tailed distributed. Both asymptotical results like convergence rates of the estimates and results for the samples of moderate sizes supported by Monte-Carlo investigation will be considered. The exposition will be accompanied by numerous illustrations and examples motivated by applications.
- Students of mathematics and statistics, computer science and electrical enginering who are interested in learning about practical applications in the area of heavy-tailed data analysis, and who are looking for new approaches and fundamental results, supported by proofs.
- Practitioners who wish to analyze heavy tailed empirical data and could be interested in rough methodology and algorithms of numerical calculations related to the analysis of heavy-tailed data.
The course will assume prior knowledge of probability and basic statistical techniques.
The course will be taught in English.
Introduction: definitions and basic properties of classes of heavy-tailed distributions. Tail index estimation. Methods for the selection of the number of the largest order statistics in Hill's estimator. Rough methods for the detection of heavy tails and the number of finite moments. (2-3 hours)
(Section 1 contains the introduction with necessary definitions, basic properties and examples of heavy-tailed data. The tail index indicates the shape of the tail and therefore it is the basic characteristic of heavy-tailed data. Methods for tail index estimation are presented. Finally, several rough tools for the detection of heavy-tailedness, the dependence and the number of finite moments are considered.)
Density estimation. Main principles of the estimation. Nonparametric estimation of the densities of light-tailed distributions. Smoothing methods. (2 hours)
(In Section 2 the main principals of the density estimation like Lebesque's theorem, Fisher's scheme, L_1, L_2, \chi^2 approaches, exponent method and the estimation of the density as a solution of an ill-posed problem are considered. The links between these approaches are established. Classical methods of density estimation such as kernel estimators, projection estimators, histogram and polygram, and their smoothing tools like cross-validation, the discrepancy method and other are presented.)
Heavy-tailed density estimation. Combined parametric-nonparametric methods, Barron's estimate and \chi^2-optimality. Kernel estimates with variable bandwidth and their smoothing methods: the integrated squared error cross-validation (ISE), weighted version of squared error cross-validation (WISE), discrepancy method. Re-transformed nonparametric estimates. (2-3 hours)
(In Section 3 the problems of heavy-tailed density estimation are discussed. Three approaches to heavy-tailed density estimation are considered. The first relates to combined parametric-nonparametric methods, where the tail domain of the density is fitted by some parametric model and the main part of the density (the body) is fitted by some nonparametric method like a histogram. A similar approach realized by Barron's estimator is considered. The second approach is devoted to kernel estimates with variable bandwidth. The optimal accuracy of these estimates as well as their disadvantages for heavy-tailed density estimation are discussed. The last approach contains the preliminary transformation of the empirical sample to a new one, whose density is more convenient for restoration.)
Transformation choice: finite and adapted transformations. Re-transformed kernel estimates. Boundary kernels. Accuracy measuring: L_1, L_2 approaches, decay rate at infinity. (2-3 hours)
(In Section 4 specific transformations are presented. The quality of re-transformed kernel estimates with regard to the metrics in spaces L_1, L_2 is considered. To improve the fitting at the tail domain, special boundary kernels are presented.)
Re-transformed density estimates and Bayesian classification algorithm. Risk of the misclassification. (2 hours)
(In Section 5 the empirical Bayesian classifier constructed by means of re-transformed density estimates is considered. The quality of the classifier is presented both by theoretical and by a Monte Carlo study.)
Estimation of high quantiles, endpoints, excess functions. (2-3 hours)
(In Section 6 several classical methods for quantile estimation are considered. The methods of estimating high quantiles, endpoints, excess functions for heavy-tailed distributions are presented. An application to WWW-traffic data is considered.)
Nonparametric estimation of hazard rate function in light- and heavy-tailed cases. (2 hours)
(In Section 7 the estimation of a hazard rate function is considered both for light- and heavy-tailed distributions. For the heavy-tailed case a transformation approach is presented. For the light-tailed case the hazard rate is evaluated as the solution of an integral equation. Such tasks are ill-posed and hence, the solution is obtained by Tikhonov's regularization method. Regularized estimates are presented.)
Estimation of the renewal function within the finite time interval and for infinite time. Histogram-type non-parametric estimator, its asymptotical properties and smoothing methods. (2-3 hours)
( Section 8 contains the estimates of the renewal function at infinite time. The nonparametric estimation of the renewal function, that means the mean number of events of interest in a finite time interval is considered, too. Smoothing of the histogram-type estimate is considered. Several known methods and original methods of the author are presented. The application to WWW-traffic data is considered.)
Dependence detection by univariate and bivariate data. (2-3 hours)
(Section 9 contains the various mixing conditions, the autocorrelation function, portmanteau tests, extremal index estimation for the univariate case. The example of video traffic data analysis is given. For the bivariate case, the classical measures of dependence like Kendall's tau and Spearman's rho as well as the Pickands A-function (that reflects the dependence of two maxima) and copulas are given. The application to TCP-flow data control and Web-data is presented.)
Exercises are provided
- Aivazyan SA, Buchstaber VM, Yenyukov IS and Meshalkin LD (1989). Applied statistics. Classification and reduction of dimensionality. Financy i statistika. Moscow (in Russian). Relevant for Lecture 5.
- Beirlant J, Goegebeur Y, Teugels J and Segers J (2004) Statistics of Extremes: Theory and Applications. Wiley, Chichester, West Sussex. Relevant for Lecture 1.
- Devroye L, Gyorfi L (1985). Nonparametric density estimation. The L_1 view, John Wiley & Sons, New York. Relevant for Lectures 1-4.
- Embrechts P, Kluppelberg C and Mikosch T (1997). Modelling Extremal Events for Finance and Insurance. Springer, Berlin. Relevant for Lectures 1 & 6.
- Gnedenko BW and Kowalenko IN (1971). Einführung in die Bedienungstheorie. Oldenbourg Verlag, München. Useful for Lecture 8.
- Markovich NM (2007). Nonparametric Analysis of Univariate Heavy-Tailed data: Research and Practice, Wiley, Chichester, West Sussex. Useful for Lectures 1-10.
- Silverman BW (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, New York.Essential for Lectures 1, 2 & 4.
- Simonoff JS (1996). Smoothing Methods in Statistics. Springer, New York. Essential for Lectures 1, 2 & 4.
- Tikhonov AN, Arsenin VY (1977). Solution of Ill-posed Problems. John Wiley, New York. Useful for Lecture 7.
- Wand MP, Jones MC (1995). Kernel smoothing. Chapman & Hall, New York.Essential for Lectures 2 & 4.