Announcement

Collapse
No announcement yet.

Partner 728x90

Collapse

Way to filter certain % of Data out of a data series?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Way to filter certain % of Data out of a data series?

    Howdy.

    I have a set of data stored into an array. Say if I wanted to filter out data at the extremes of, i.e. anything greater or less than 90% of the rest of the data, is there an easy way to go about it?

    For example, I am trying to do this in a rather exhaustive way with a for loop, and I must be doing something wrong as it is locking up with only a few bars on my chart. But you can see the two 'for' loops there, and the intent was to replace the two highest values with the average value.

    Example.....
    Code:
    		private void FilterExtremes()
    		{	
    	                //Get the median of last 21 numbers in sD and lD DataSeries
    			medsD = System.Convert.ToInt32(GetMedian(sD, lookBack));
    			medlD = System.Convert.ToInt32(GetMedian(lD, lookBack));
                        
    			for(int j =0; j < 2; j--)
    			{
                                   //remove two highest outliers and replace with median value
    				for (int i =0; i < 21; i--)			//
    				{
    					if(sD[i] == MAX(sD,21)[0])
    					{
    						sD[i] = medsD;	
    					}
    					if(lD[i] == MAX(lD,21)[0])
    					{
    						lD[i] = medlD;	
    					}					
    				}
    			}
    		}

    #2
    Hello,

    If you add in the print lines in whats holding this up.

    I know its going to be two loops going back a total of 44 iterations. Is this in OnBarUpdate your running this? COBC = true or false?

    Furthermore they mare be a more coding effecient way to code this. Anytime you have to loop back through data series on a tick by tick bases is usually not good.
    BrettNinjaTrader Product Management

    Comment


      #3
      Well, adding in the print lines, I was just checking the values.

      Well, this loop was a called FROM the OnBarUpdate(). I'm sure there is a much better way to code this to do what I want, just not sure how at the moment?

      Comment


        #4
        Forgot to attach a picture of what I am wanting to do. This is a representation of a data series of a given lookback with values in it.

        I am wanting to somehow remove the statistical outliers.
        Attached Files

        Comment


          #5
          forrestang,

          I'm responding on behalf of Brett.

          Do you want the dataseries to remain intact or can it be reordered? I would not recommend using for loops on dataseries, since they keep getting larger and you will possibly reduce performance.

          I can possibly make some suggestions if you provide me with the above information. I look forward to helping you.
          Adam P.NinjaTrader Customer Service

          Comment


            #6
            Originally posted by NinjaTrader_AdamP View Post
            forrestang,

            I'm responding on behalf of Brett.

            Do you want the dataseries to remain intact or can it be reordered? I would not recommend using for loops on dataseries, since they keep getting larger and you will possibly reduce performance.

            I can possibly make some suggestions if you provide me with the above information. I look forward to helping you.
            Thanks for checking it out Adam.

            I would imagine it should remain intact, since I will want to find these values on each new bar that forms. So if a new bar forms, I want to go back, and sort those values over again.

            Comment


              #7
              forrestang,

              Thank you for clarifying.

              Ok, so let me see if I understand this correctly. You want to replace all data outside of the middle 90 percent with the median, or only the values that are largest? Are you running with CalculateOnBarClose=true or false? What sort of period charts are you going to run this on? Would you rather use the mean or the median?

              The main issue with using the median is that its tough to do without sorting an array or dataseries. This is probably causing efficiency issues if you use it on tick charts, or with COBC=false. There are ways of estimating the median using various tactics, but I'd have to do some research on it to see if its a good idea or not.

              Filtering 2 out of 20 is indeed 10 percent of the data set, but not necessarily the upper 10 percent if you think about your data as being normally distributed (which isn't necessarily the case with market data, but again would require more research to find a better distribution).

              Anyway, I look forward to helping you with this.
              Adam P.NinjaTrader Customer Service

              Comment


                #8
                Thanks for the response.

                Here is the idea. This is based on a COBC = true for now. Practically though, running it I would have it on false, as this will simply plot levels on a 240min chart. The calculation should remain constant on each open of a 4 hour bar.

                I have two DataSeries of Data. In one set is a set of values that are grouped into a 'smaller' set of data. The other set contains a set of larger data. So that part in the program is working. You can see a sample of that data in the attached text file.

                Here is a pic of the distribution of the 'smaller' set that I plotted based on that.


                And here is a distribution of the 'larger' data set.


                So basically, what I am wanting to do, is based on some logic, 'trim' the outer edges of this distribution off. For example, on the smaller set trim off anything above 50 pips. On the larger, trim off anything above 160. Both of those are just visual guesstimations btw based on eyeballing those distributions.

                Also, you asked about using median vs. mean, and yes mean was what I meant. I was just trying to think of a way to manipulate the data, so I figured to replace any outliers in the dataSeries with the average, which would essentially make it meaningless. As I either want to REMOVE it all together, if I can't do that, then replace with an average.

                Does all that make sense?
                Attached Files

                Comment


                  #9
                  forrestang,

                  Thanks for the charts and explanation.

                  The histogram in those charts sort of look like Poisson Distributions, or maybe one is an Exponential Distribution. In my days working at a lab I ran into something very similar. The reason I mention the different distributions is because setting something to the mean may bias it over time as you can see the mean of those bell curves is to the right of the densest portions of the histogram. Perhaps the mode would be a better choice, but still might bias the mean value of those bell curves. If you are using abs() for the differences, then you may want to do your cutting before you take the absolute value.

                  So, are you using these frequency distributions to determine the cut-off point for a secondary data series? Or are you trying to do this dynamically with the dataseries of differences? Are you using absolute value in your calculation of the differences?

                  If you calculate the mean and standard deviation off those histograms, you could set anything above the Mean + 1.281552 * StandardDeviation to the Mean in a second dataseries. I got the value in the calculation above from the following link.



                  Just looking at those frequency charts it appears like you have a ton of data, easily above 500 points. You could use the SMA and StdDev indicators as part of the calculations in the indicator. That shouldn't be a problem if you use a sufficiently large period. However, I am a little confused because originally your code is using only 21 data points.

                  Anyway, I guess I am still not exactly sure as to how you want to engineer this indicator. This is my understanding, correct me if I am wrong.
                  1. You want to cut the upper 10 percent of data from a normal distribution of two dataseries of differences.
                  2. You want to determine this value from the dataseries' themselves.
                  3. You need it to be computationally efficient.
                  4. There will be no frequency histogram to calculate the cut-off from.


                  Please let me know if these are correct assumptions.
                  Last edited by NinjaTrader_AdamP; 10-14-2011, 03:40 PM.
                  Adam P.NinjaTrader Customer Service

                  Comment


                    #10
                    Originally posted by NinjaTrader_AdamP View Post
                    forrestang,

                    Thanks for the charts and explanation.

                    The histogram in those charts sort of look like Poisson Distributions, or maybe one is an Exponential Distribution. In my days working at a lab I ran into something very similar. The reason I mention the different distributions is because setting something to the mean may bias it over time as you can see the mean of those bell curves is to the right of the densest portions of the histogram. Perhaps the mode would be a better choice.

                    So, are you using these frequency distributions to determine the cut-off point for a secondary data series? Or are you trying to do this dynamically with the dataseries of differences? Are you using absolute value in your calculation of the differences?

                    If you calculate the mean and standard deviation off those histograms, you could set anything above the Mean + 1.281552 * StandardDeviation to the Mean in a second dataseries. I got the value in the calculation above from the following link.



                    Just looking at those frequency charts it appears like you have a ton of data, easily above 500 points. You could use the SMA and StdDev indicators as part of the calculations in the indicator. That shouldn't be a problem if you use a sufficiently large period. However, I am a little confused because originally your code is using only 21 data points.

                    Anyway, I guess I am still not exactly sure as to how you want to engineer this indicator. This is my understanding, correct me if I am wrong.
                    1. You want to cut the upper 10 percent of data from a normal distribution of two dataseries of differences.
                    2. You want to determine this value from the dataseries' themselves.
                    3. You need it to be computationally efficient.
                    4. There will be no frequency histogram to calculate the cut-off from.


                    Please let me know if these are correct assumptions.
                    Using the Mode versus the mean is a possibility, wouldn't hurt to try it out. The frequency distribution, I just plotted that to get an initial idea of how all the data points where occuring. I.e. I can clear see in the first graph, the majority of the distances are less that 50 pips in distances. And in the second pic, I can see that most data fell between 10-110. These were done in matlab based on raw data exported from NT.

                    The purpose of the distribution, would be to 'tune out' the outliers. I.e., for sorting the two datSeries I mentioned earlier, I would want to remove and statistical outliers in someway. That could be by simply taking anything fell outside of 90% of 20 data points, which would mean I would remove 4 statistical outliers, 2 from each group. Could use 80,95%, I would want to mess with that number.

                    But none of these distributions would be available to me in NT. It was just meant to be a visual representation of the data itself.

                    As for your for assumptions, I would say that is correct. For the first one though, I am open to various ways of removing the outliers.

                    Comment


                      #11
                      Forrestang,

                      You don't actually need the distributions themselves if you have the equations. For example, it appears you are using Abs( x1 - x2 ) where x1 and x2 are the prices at two different times in pips.

                      I would suggest keeping a data series that does not take the absolute value, as it will let you estimate the point to cut out a bit better without having to use the mode or median or worry too much about biasing it over time. I believe it will look more normally distributed without absolute value.

                      Now, if you want to keep with the 500+ data points for determining the cut-off, do a SMA and StdDev on the data series that does not have absolute value with a period of 500.

                      Now, estimate your cut-off at 1/2 * ( abs( SMA[0] + 1.644854*StdDev[0] ) + abs( SMA[0] - 1.644854*StdDev[0]) ) , then use this cut-off on your second data series that does use the absolute value. It won't be exactly 10 percent of the larger values, but hopefully pretty close. Then you simply set the outliers in the first data series that does use abs( x1 - x2 ) to SMA[0] just bar by bar. You may run into issues with COBC=false, but you can make it so it won't change the current bar's calculation and still update bar-by-bar.

                      You may want to first plot the histograms of the non-absolute value series to see if it looks normally distributed and also to ensure that the abs( SMA[0] + 1.644854*StdDev[0] ) and abs( SMA[0] - 1.644854*StdDev[0] ) aren't drastically different.

                      Please let me know if that makes sense.
                      Adam P.NinjaTrader Customer Service

                      Comment


                        #12
                        I should have mentioned, that those aren't absolute value data series. They are just the High-Open, and the Open- Low. And then each is sorted and put into the small array, and the big array.

                        I'm not sure I understand your solution? Are you saying that you would take an SMA of the two arrays, then apply a stdDev to it, and if greater than a certain stdDev, then apply the prior value in it's place?

                        Comment


                          #13
                          Originally posted by forrestang View Post
                          Thanks for checking it out Adam.

                          I would imagine it should remain intact, since I will want to find these values on each new bar that forms. So if a new bar forms, I want to go back, and sort those values over again.
                          In that case, you have to copy the data that you want into a ArrayList, and then manipulate the data as necessary. Here is a description of what you may want to do.
                          1. Copy the necessary values into an ArrayList. You can populate this at the same time as you Set the DataSeries
                          2. Sort the ArrayList
                          3. Determine the index for the lower limit 5% of the ArrayList (5% of ArrayList.Count)
                          4. Determine the index of the upper limit 5% of the ArrayList
                          5. (Re)Initialize a new ArrayList.
                          6. Copy the items of the ArrayList, from the low index to the high index into another ArrayList.

                          You have the values that you want, sorted in the second ArrayList.

                          Comment


                            #14
                            Forrestang,

                            My apologies, I assumed your data series was some sort of difference between prices as an absolute value. Basically, my idea was that you could have two data series in your indicator storing the standard deviation as well as the simple movie average. Then use these to determine if a new incoming value for the primary data series falls in an extreme region which you set by multiplying some constant times the standard deviation then adding to the mean. If it does, you can simple set it to the mean, mode or median.

                            The issue is when you run sorts on COBC=false with large data series it can be quite computationally inefficient especially if you run it on multiple charts. My suggestion is a work-around that isn't necessarily optimal since clearly your two data series are not normal distributions. The main issue is that setting some incoming value to the mean will possibly cause a bias in the SMA data series over time. This is why the median or mode may be a better choice. You could calculate the mode/median only on bar close while the rest of the indicator works on COBC=false using some custom flags.

                            The main advantage of my approach is that you will only be updating the SMA and StdDev data series as well as your two large difference and small difference data series rather than performing sorts every tick. If you run COBC=true, then the sorts shouldn't affect you quite as much.

                            Im not sure what lists uses as their sorting algorithm, perhaps someone could comment. However a comparison of sorting algorithm time complexities is at the following link.



                            So say best case scenario you have O(n log n) time complexity, then 500 element array/list would take approximately 500 * log(500) steps to sort. If we assume they mean log base 2, then we have approximately 4483 steps to sort. Running on multiple charts and tick by tick adds up.

                            I made a simple-moving-median indicator using Merge-sort I believe, you are welcome to look at and experiment with. I didn't use lists so I could make my own sort as an exercise. Please find it attached. I put the merge-sort algorithm in the UserDefinedMethods so if you have anything in UserDefinedMethods it would be better to copy the merge-sort code into yours rather than replace it.
                            Attached Files
                            Last edited by NinjaTrader_AdamP; 10-15-2011, 10:23 AM.
                            Adam P.NinjaTrader Customer Service

                            Comment


                              #15
                              Adam, I am a bit lost I have to admit?

                              So if I have two dataSeries populated with data, and I would like to filter out the extremes and replace with something else, are you suggesting that I filter anything out of that list with the SMA+stdDev combo, or to sort that list?

                              Comment

                              Latest Posts

                              Collapse

                              Topics Statistics Last Post
                              Started by Geovanny Suaza, 02-11-2026, 06:32 PM
                              0 responses
                              599 views
                              0 likes
                              Last Post Geovanny Suaza  
                              Started by Geovanny Suaza, 02-11-2026, 05:51 PM
                              0 responses
                              344 views
                              1 like
                              Last Post Geovanny Suaza  
                              Started by Mindset, 02-09-2026, 11:44 AM
                              0 responses
                              103 views
                              0 likes
                              Last Post Mindset
                              by Mindset
                               
                              Started by Geovanny Suaza, 02-02-2026, 12:30 PM
                              0 responses
                              558 views
                              1 like
                              Last Post Geovanny Suaza  
                              Started by RFrosty, 01-28-2026, 06:49 PM
                              0 responses
                              557 views
                              1 like
                              Last Post RFrosty
                              by RFrosty
                               
                              Working...
                              X