A brief introduction to robust statistics

robustness statistics This is a topic that many people are looking for. star-trek-voyager.net is a channel providing useful information about learning, life, digital marketing and online courses …. it will help you have an overview and solid multi-faceted knowledge . Today, star-trek-voyager.net would like to introduce to you A brief introduction to robust statistics. Following along are instructions in the video below:

“This video is a 10 minute introduction to nthe basics of robust statistics. We will will start by looking at estimating the nlocation and correlation parameters to see how robust behave relative to their nclassical non robust counterparts. I will then discuss some of my own research ninto estimating a sparse precision matrix in the presence of cellwise contamination at the end are some references you can use nto find out more about the various techniques presented so let s get started. The aim of robust statistics is to develop nestimators that model the bulk of the data and are not unduly influenced by outlying nobservations or observations that are not representative of the true underlying data ngenerating process to explore this idea.

We consider the simple ncase of location estimation. We will look at three estimators. The mean the median nand the hodges lehmann estimator of location. Which is just the median of the pairwise means in this example.

We have 10 observations drawn nfrom a uniform distribution over the range 0. To 10. All three estimates of location start noff close to the true parameter value. Which is 5 in this case what we will do is observe nhow they behave when we artificially corrupt some of the observations we start by taking the largest observation nand moving it to the right when we do this.

The mean also starts to increase. Which is nwhat. You would expect from a non robust estimator in contrast the hodges lehmann estimator nand the median remain unchanged. It is this resilience to contamination that makes them nwhat we call robust estimators if we take the two largest observations and nmove them to the right the median stays the same the hodges lehmann estimate jumps from n5 to 55.

But then stays constant and the mean reacts as before by increasing as the contaminated nobservations increase. We observe similar behaviour when the three nlargest observations are contaminated when the four largest observations are contaminated nthe hodges lehmann estimate. Now behaves just like the mean in that both estimators are nin breakdown. That is to say they are no longer representative of the bulk of the data.

And when five observations are contaminated neven. The median is no longer sure. Which observations represent the original data and which are nthe contaminated observations. So it also returns an estimate.

That is no longer representative nof. The location parameter of the original data generating process. We can observe similar behaviour in multivariate ndata sets in this example we are looking to estimate the correlation between variables in the lower half of the matrix we have scatter nplots of 30 observations for example. This scatter plot.

Shows variable 2 on the y axis. Nand..


Variable 1 on the x axis. Whereas. This one has variable 2 on the x axis and variable n3 on the y axis in the upper half of the matrix. We have the ntrue parameter values that i used to generate the data the classical correlation estimates nand the robust correlation estimates between each of the pairs of variables.

The robust estimator that i have used is the nmcd estimator and you can find links to further details at the end of the presentation. We are going to observe what happens to our nestimates. When we take three observations and move them away from the main data cloud nlet s focus on the relationship between variable 1 and variable 2. As the observations move nfurther away from the data cloud.

The classical correlation estimator is decreasing and eventually nit becomes zero suggesting that there is no correlation between variable 1 and variable n2 as the contamination moves further away the classical estimate is now giving a negative nvalue suggesting a negative relationship between variable 1 and variable 2 in contrast the robust estimates have all nstayed reasonably close to the true parameter values as they are modelling the core of the ndata and are not as influenced by these outlying values. Let s look at a real example. Some work. I ndo for the industry body meat and livestock australia is concerned with predicting the neating quality of beef.

They run a large number of consumer tests. Nwhere consumers are asked to rate pieces of meat on their tenderness. Juicyness flavour. Nand.

Give an overall score. The aim is to develop a predictive model that nrates each cut of meat as 3 4. Or. 5.

Star based on what we know about consumer preferences. Here s a scatter plots of ratings for almost n3000 pieces of meat. There s obvious structure in the data. Set nwith tenderness juicyness flavour.

And overall being highly positively related. However. Nthere is also a lot of noise. With some consumers.

Giving very high scores. For some variables..


Nand low. Scores. For other variables. For example.

Up. Here. You have some observations. Nwhere consumers have given very high tenderness scores.

But very low overall scores if you calculated classical covariances you ncan see there there is reasonably strong positively linear relationships between the variables. Nhowever robust techniques such as the minimum covariance determinant can be used to highlight nthe tightest core of the data. Which represents the relationship between the variables for nthe majority of consumers. Which suggests that true underlying correlations between nvariables are much stronger than the classical method would otherwise suggest for example the relationship between flavour nand overall goes from 086 to 099.

When you restrict attention to the the tightest half nof the data something that i ve looked at in my research nis estimation in the presence of cellwise contamination. Transitional. Robust techniques such as the nminimum covariance determinant. We used earlier assume that contamination happens within the nrows of a data set furthermore.

Even the most robust estimators ncan. Only cope with at most 50 of the rows being contaminated before they no longer work neffectively as we saw earlier with the median breaking down when. There were 5 contaminated. Nand 5.

Uncontaminated. Observations . This assumption of row. Wise contamination nmay.

Not be appropriate for large data sets. What we have here is a heat map of a data nmatrix. The white rectangles represent corrupted cells in this situation with a small amount nof cellwise corruption that affects less than half of the rows classical robust methods nwill still perform adequately. However as the proportion of scattered contamination.

Nincreases or the contamination is allowed to spread over all the rows. Then you might nend up in a situation..


Where all observations have at least on variable that is contaminated. There is still a lot of about the core of the data. Without your estimates nbeing. Overwhelmed by the contaminating cells in particular.

I have looked at estimating nprecision matrices in the presence of cellwise contamination. A precision matrix is just nthe inverse of the covariance matrix. However often in large data sets. You also want to nassume that the precision matrix is sparse.

That is there are a number of zero elements nin. The precision matrix this is particularly useful when modelling gaussian markov random nfields where the zeros correspond to conditional independence between the variables. We ll briefly apply these ideas to a real ndata set to finish off. We have 452 stocks or shares over 1258 trading days.

We ll observe nthe closing price which we convert to daily return series. And we want to estimate a sparse nprecision matrix to identify clusters of stocks that is groups of stocks that behave nsimilarly to do this. We used a robust covariance matrix nas an input to the graphical lasso. Details of which can be found in the references at nthe end of the presentation.

If we look at the return series for the first n6 stocks in the data set. We can see that there are a number of unusual observations. Nscattered throughout the data for example you have this negative return for 3m co. Again nfor adobe and amd also has a number of unusual observations this suggests that perhaps there nis a need for a robust estimators when analysing this data set.

We re going to visualise the results as a nnetwork of stocks if a there is a non zero entry in the estimated precision matrix between ntwo stocks then they will both appear in the graph with a line linking them furthermore. Ni have coloured the stocks by industry. We can see here in the classical approach nthat. We have some clusters of stocks.

However if we use the robust approach. These clusters nare much more densely populated. So it has identified linkages between more stocks than nthe classical approach. Did reflecting the fact that the robust method is not as influence nby.

Those unusual outlying observations. So we ve got a financials cluster over here nwe ve got an information technology cluster down here..


A utilities cluster and a energy ncluster. If we add some additional contamination. The nclassical approach is no longer able to identify those clusters of stocks. It gives you just nthis soup of linkages with no clear structure.

The robust approach on the other hand. Gives. Nyou essentially the same as what we had before when i didn t add the extra contamination nin. So you ve still got information.

Technology clusters. You ve still got a utilities cluster nover here and a financials cluster down. There. So what this tells you is that the robust nmethods are modelling the core of the data and are relatively unaffected by unusual or nextreme observations.

So that s been an extremely brief introduction ninto. The idea of robust statistics. The take home message is that the robust methods nare designed to model the core of the data. Without being unduly.

Influenced by outlying nobservations that are not representative of the true data generating process. If you d like to know more about any of the nmethods presented here you can check out these links for the hodges lehmann estimator. The nminimum covariance determinant. This is a nice.

Recent review paper or the graphical nlasso here s some of my work. There s a paper that s nto appear in computational statistics. And data analysis on the robust estimation of nprecision matrices in cellwise contamination to understand that it would perhaps help nto go back to an earlier paper on robust scale estimation or you could check out my phd nthesis. Which is available at that link there if you would like to get in touch.

There are nmy details and you can find a copy of these ” ..

Thank you for watching all the articles on the topic A brief introduction to robust statistics. All shares of star-trek-voyager.net are very good. We hope you are satisfied with the article. For any questions, please leave a comment below. Hopefully you guys support our website even more.
description:

tags:

Leave a Comment