papaya
Class Descriptive

java.lang.Object
  extended by papaya.Descriptive

public class Descriptive
extends Object

Basic descriptive statistics class for exploratory data analysis. Methods for computing Correlations and Covariances are in the Correlation class. Where appropriate, methods with similar functions are grouped into static subclasses. For example, the Descriptive.Sum class contains the following methods:

while the Descriptive.Mean class has the utilities.


Nested Class Summary
static class Descriptive.Mean
          Contains methods for computing the arithmetic, geometric, harmonic, trimmed, and winsorized means (among others).
static class Descriptive.Pooled
          Class for computing the pooled mean and variance of data sequences
static class Descriptive.Sum
          Methods for computing various different sums of datasets such as sum of inversions, logs, products, power deviations, squares, etc.
static class Descriptive.Weighted
          Contains methods related to weighted datasets.
 
Method Summary
static void frequencies(float[] sortedData, ArrayList<Float> distinctValues, ArrayList<Integer> frequencies)
          Computes the frequency (number of occurances, count) of each distinct value in the given sorted data.
static float kurtosis(float[] data, float mean, float standardDeviation)
          Returns the kurtosis (aka excess) of a data sequence, which is -3 + moment(data,4,mean) / standardDeviation4.
static float kurtosis(float moment4, float standardDeviation)
          Returns the kurtosis (aka excess) of a data sequence.
static double max(double[] data)
          Returns the largest member of a data sequence.
static double max(double a, double b)
           
static float max(float[] data)
          Returns the largest member of a data sequence.
static float max(float[][] data)
          Returns the largest member of a matrix.
static float max(float a, float b)
           
static int max(int[] data)
          Returns the largest member of a data sequence.
static int max(int[][] data)
          Returns the largest member of a matrix.
static int max(int a, int b)
           
static float mean(float[] data)
          Returns the arithmetic mean of a data sequence; That is Sum( data[i] ) / data.length .
static float meanDeviation(float[] data, float mean)
          Returns the mean deviation of a dataset.
static float median(float[] data, boolean isSorted)
          Returns the median of a data sequence.
static double min(double[] data)
          Returns the smallest member of a data sequence.
static double min(double a, double b)
           
static float min(float[] data)
          Returns the smallest member of a data sequence.
static float min(float[][] data)
          Returns the smallest member of a matrix.
static float min(float a, float b)
           
static int min(int[] data)
          Returns the smallest member of a data sequence.
static int min(int[][] data)
          Returns the smallest member of a matrix.
static int min(int a, int b)
           
static float[] mod(float[] data)
          Returns the array containing the elements that appear the most in a given dataset.
static float moment(float[] data, int k, float c)
          Returns the moment of k-th order with constant c of a data sequence, which is Sum( (data[i]-c)k ) / data.size().
static float[] outliers(float[] data, float lowerLimit, float upperLimit, boolean isSorted)
          Returns the array containing all elements in the dataset that are less than or equal to the lowerLimit and more than or equal to the upperLimit
static float product(float[] data)
          Returns the product of a data sequence, which is Prod( data[i] ).
static float product(int size, float sumOfLogarithms)
          Returns the product, which is Prod( data[i] ).
static float quantile(float[] sortedData, float phi)
          Returns the phi-quantile; that is, an element elem for which holds that phi percent of data elements are less than elem.
static float quantileInverse(float[] sortedList, float element)
          Returns how many percent of the elements contained in the receiver are <= element.
static float[] quantiles(float[] sortedData, float[] percentages)
          Returns the quantiles of the specified percentages.
static float[] quartiles(float[] data, boolean isSorted)
          Returns the quartiles of the input data array (not necessarily sorted).
static float rankInterpolated(float[] sortedList, float element)
          Returns the linearly interpolated number of elements in an array that are ≤ a given element.
static float rms(int size, double sumOfSquares)
          Returns the RMS (Root-Mean-Square) of a data sequence.
static float skew(float[] data, float mean, float standardDeviation)
          Returns the skew of a data sequence, which is moment(data,3,mean) / standardDeviation3.
static float skew(float moment3, float standardDeviation)
          Returns the skew of a data sequence when the 3rd moment has already been computed.
static float[] std(float[][] data, boolean unbiased)
          Returns an array with each element of the array corresponding to the standard deviations of each column of the input matrix.
static float std(float[] data, boolean unbiased)
          Returns the standard deviation of a dataset.
static float stdUnbiased(int size, float sampleVariance)
          Returns the unbiased sample standard deviation assuming the sample is normally distributed.
static float[] tukeyFiveNum(float[] data)
          Return the tukey five number summary of a dataset consisting of the minimum, maximum, and three quartile values.
static float[] var(float[][] data, boolean unbiased)
          Returns an array containing the variance of each column of the input matrix X.
static float var(float[] data, boolean unbiased)
          Returns the variance of a dataset, V.
static float[][] zScore(float[][] X, float[] means, float[] standardDeviations)
          Computes the standardized version of the input matrix.
static float[] zScore(float[] x, float mean, float standardDeviation)
          Returns the array of z-scores for a given data array.
static float zScore(float x, float mean, float standardDeviation)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

frequencies

public static void frequencies(float[] sortedData,
                               ArrayList<Float> distinctValues,
                               ArrayList<Integer> frequencies)
Computes the frequency (number of occurances, count) of each distinct value in the given sorted data. After this call, both distinctValues and frequencies have a new size (which is equal for both), which is the number of distinct values in the sorted data.

Distinct values are filled into distinctValues, starting at index 0. The frequency of each distinct value is filled into frequencies, starting at index 0. As a result, the smallest distinct value (and its frequency) can be found at index 0, the second smallest distinct value (and its frequency) at index 1, ..., the largest distinct value (and its frequency) at index distinctValues.size()-1.

Example:
sortedData = (5,6,6,7,8,8) --> distinctValues = (5,6,7,8), frequencies = (1,2,1,2)

Code-wise, you would write:
 
 ArrayList distinctValues = new ArrayList(); 
 
 ArrayList frequencies = new ArrayList();  
 
 frequencies(sortedData,distinctValues,frequencies);
 

Parameters:
sortedData - the data; must be sorted ascending.
distinctValues - a ArrayList to be filled with the distinct values; can have any size.
frequencies - a ArrayList to be filled with the frequencies; can have any size; set this parameter to null to ignore it.
See Also:
Frequency, Unique

kurtosis

public static float kurtosis(float moment4,
                             float standardDeviation)
Returns the kurtosis (aka excess) of a data sequence.

Parameters:
moment4 - the fourth central moment, which is moment(data,4,mean).
standardDeviation - the standardDeviation.

kurtosis

public static float kurtosis(float[] data,
                             float mean,
                             float standardDeviation)
Returns the kurtosis (aka excess) of a data sequence, which is -3 + moment(data,4,mean) / standardDeviation4.


max

public static int max(int a,
                      int b)

max

public static float max(float a,
                        float b)

max

public static double max(double a,
                         double b)

max

public static double max(double[] data)
Returns the largest member of a data sequence.


max

public static float max(float[] data)
Returns the largest member of a data sequence.


max

public static int max(int[] data)
Returns the largest member of a data sequence.


max

public static float max(float[][] data)
Returns the largest member of a matrix.


max

public static int max(int[][] data)
Returns the largest member of a matrix.


mean

public static float mean(float[] data)
Returns the arithmetic mean of a data sequence; That is Sum( data[i] ) / data.length . Similar to Descriptive.Mean.arithmetic(float[]).


meanDeviation

public static float meanDeviation(float[] data,
                                  float mean)
Returns the mean deviation of a dataset. That is Sum( Math.abs(data[i]-mean)) ) / data.length.


median

public static float median(float[] data,
                           boolean isSorted)
Returns the median of a data sequence.

Parameters:
data - the data sequence;
isSorted - true if the data sequence is sorted (in ascending order), else false.

min

public static int min(int a,
                      int b)

min

public static float min(float a,
                        float b)

min

public static double min(double a,
                         double b)

min

public static double min(double[] data)
Returns the smallest member of a data sequence.


min

public static float min(float[] data)
Returns the smallest member of a data sequence.


min

public static float min(float[][] data)
Returns the smallest member of a matrix.


min

public static int min(int[] data)
Returns the smallest member of a data sequence.


min

public static int min(int[][] data)
Returns the smallest member of a matrix.


mod

public static float[] mod(float[] data)
Returns the array containing the elements that appear the most in a given dataset. (The return type has to be an array since a dataset can have more than one mod value).

Parameters:
data - the data array
Returns:
the array containing the (distinct) elements that appear the most.

moment

public static float moment(float[] data,
                           int k,
                           float c)
Returns the moment of k-th order with constant c of a data sequence, which is Sum( (data[i]-c)k ) / data.size().


outliers

public static float[] outliers(float[] data,
                               float lowerLimit,
                               float upperLimit,
                               boolean isSorted)
Returns the array containing all elements in the dataset that are less than or equal to the lowerLimit and more than or equal to the upperLimit

Parameters:
data - the data array
lowerLimit -
upperLimit -
isSorted - true if the data array has been sorted in ascending order, else set to false.

product

public static float product(int size,
                            float sumOfLogarithms)
Returns the product, which is Prod( data[i] ). In other words: data[0]*data[1]*...*data[data.length-1]. This method uses the equivalent definition: prod = pow( exp( Sum( Log(x[i]) ) / length*length).


product

public static float product(float[] data)
Returns the product of a data sequence, which is Prod( data[i] ). In other words: data[0]*data[1]*...*data[data.length-1]. Note that you may easily get numeric overflows. Use product(int,float) instead to avoid that.


quartiles

public static float[] quartiles(float[] data,
                                boolean isSorted)
Returns the quartiles of the input data array (not necessarily sorted).

Details:
The first quartile, or lower quartile (Q[0]), is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median (Q[1]), is the value that cuts off the first 50%. The third quartile, or upper quartile (Q[2], is the value that cuts off the first 75%.

Parameters:
data - The data array.
isSorted - true if the data array has been sorted in ascending order, else set to false.
Returns:
The 3 quartile values Q[0], Q[1], and Q[2].

quantile

public static float quantile(float[] sortedData,
                             float phi)
Returns the phi-quantile; that is, an element elem for which holds that phi percent of data elements are less than elem. The quantile need not necessarily be contained in the data sequence, it can be a linear interpolation.

Parameters:
sortedData - the data sequence; must be sorted ascending.
phi - the percentage; must satisfy 0 <= phi <= 1.

quantileInverse

public static float quantileInverse(float[] sortedList,
                                    float element)
Returns how many percent of the elements contained in the receiver are <= element. Does linear interpolation if the element is not contained but lies in between two contained elements.

Parameters:
sortedList - the list to be searched (must be sorted ascending).
element - the element to search for.
Returns:
the percentage phi of elements <= element (0.0 <= phi <= 1.0).

quantiles

public static float[] quantiles(float[] sortedData,
                                float[] percentages)
Returns the quantiles of the specified percentages. The quantiles need not necessarily be contained in the data sequence, it can be a linear interpolation.

Parameters:
sortedData - the data sequence; must be sorted ascending.
percentages - the percentages for which quantiles are to be computed. Each percentage must be in the interval [0.0f,1.0f].
Returns:
the quantiles.

rankInterpolated

public static float rankInterpolated(float[] sortedList,
                                     float element)
Returns the linearly interpolated number of elements in an array that are ≤ a given element. The rank is the number of elements ≤ element. Ranks are of the form {0, 1, 2,..., sortedList.size()}.

If no element is ≤ element, then the rank is zero.

If the element lies in between two contained elements, then linear interpolation is used and a non integer value is returned.

Parameters:
sortedList - the list to be searched (must be sorted ascending).
element - the element to search for.
Returns:
the rank of the element.

rms

public static float rms(int size,
                        double sumOfSquares)
Returns the RMS (Root-Mean-Square) of a data sequence. That is Math.sqrt(Sum( data[i]*data[i] ) / data.length). The RMS of data sequence is the square-root of the mean of the squares of the elements in the data sequence. It is a measure of the average "size" of the elements of a data sequence.

Parameters:
sumOfSquares - sumOfSquares(data) == Sum( data[i]*data[i] ) of the data sequence.
size - the number of elements in the data sequence.

skew

public static float skew(float moment3,
                         float standardDeviation)
Returns the skew of a data sequence when the 3rd moment has already been computed.

Parameters:
moment3 - the third central moment, which is moment(data,3,mean).
standardDeviation - the standardDeviation.

skew

public static float skew(float[] data,
                         float mean,
                         float standardDeviation)
Returns the skew of a data sequence, which is moment(data,3,mean) / standardDeviation3.


std

public static float std(float[] data,
                        boolean unbiased)
Returns the standard deviation of a dataset. There are two definitions of the standard deviation:
 sigma_1 = 1/(N-1) * Sum( (x[i] - mean(x))^2 )
 sigma_2 = 1/(N) * Sum( (x[i] - mean(x))^2 )
 
sigma_1 is the square root of an unbiased estimator of the variance of the population the x is drawn, as long as x consists of independent, identically distributed samples. sigma_2 corresponds to the second moment of the set of values about their mean.

std(data,unbiased==true) returns sigma_1 above, while std(data,unbiased==false) returns sigma_2.

Parameters:
data - the dataset
unbiased - set to true to return the unbiased standard deviation, false to return the biased version.

std

public static float[] std(float[][] data,
                          boolean unbiased)
Returns an array with each element of the array corresponding to the standard deviations of each column of the input matrix. Each column of the matrix corresponds to a dataset, and each row an observation. There are two definitions of the standard deviation:
 sigma_1 = 1/(N-1) * Sum( (x[i] - mean(x))^2 )
 sigma_2 = 1/(N) * Sum( (x[i] - mean(x))^2 )
 
sigma_1 is the square root of an unbiased estimator of the variance of the population the x is drawn, as long as x consists of independent, identically distributed samples. sigma_2 corresponds to the second moment of the set of values about their mean.

std(data,unbiased==true) returns sigma_1 above, while std(data,unbiased==false) returns sigma_2.

Parameters:
data - the dataset
unbiased - set to true to return the unbiased standard deviation, false to return the biased version.

stdUnbiased

public static float stdUnbiased(int size,
                                float sampleVariance)
Returns the unbiased sample standard deviation assuming the sample is normally distributed. Ref: R.R. Sokal, F.J. Rohlf, Biometry: the principles and practice of statistics in biological research (W.H. Freeman and Company, New York, 1998, 3rd edition) p. 53.

See also this entry on wikipedia.org

Parameters:
size - the number of elements of the data sequence.
sampleVariance - the sample variance.

tukeyFiveNum

public static float[] tukeyFiveNum(float[] data)
Return the tukey five number summary of a dataset consisting of the minimum, maximum, and three quartile values.

Parameters:
data - the data array
Returns:
the array of five numbers.

var

public static float var(float[] data,
                        boolean unbiased)
Returns the variance of a dataset, V. For matrices, var(X,unbiased=true) returns an array containing the variance of each column of X. The result V is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples.

var(x,true) normalizes V by N - 1 if N > 1, where N is the sample size. This is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples. For N = 1, V is normalized by 1.

V = var(x,false) normalizes by N and produces the second moment of the sample about its mean.

Reference:
Algorithms for calculating variance, Wikipedia.org
Incremental calculation of weighted mean and variance, Tony Finch

Parameters:
data - the data sequence.
unbiased - set to true to return the unbiased variance (division by (N-1)), false to return the biased value (division by N).
Returns:
the variance in the data.

var

public static float[] var(float[][] data,
                          boolean unbiased)
Returns an array containing the variance of each column of the input matrix X. The result V is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples.

var(X,unbiased=true) normalizes V by N - 1 if N > 1, where N is the sample size. This is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples. For N = 1, V is normalized by 1.

V = var(X,unbiased=false) normalizes by N and produces the second moment of the sample about its mean.

Reference:
Algorithms for calculating variance, Wikipedia.org
Incremental calculation of weighted mean and variance, Tony Finch

Parameters:
data - the data sequence.
unbiased - set to true to return the unbiased variance (division by (N-1)), false to return the biased value (division by N).
Returns:
the variance in the data.

zScore

public static float zScore(float x,
                           float mean,
                           float standardDeviation)
Returns:
the z-score of that element computed as (x-mean)/standardDeviation

zScore

public static float[] zScore(float[] x,
                             float mean,
                             float standardDeviation)
Returns the array of z-scores for a given data array. with each elment given by z[i] = ( x[i] - mean ) / standardDeviation.

Returns:
the standardized array, z

zScore

public static float[][] zScore(float[][] X,
                               float[] means,
                               float[] standardDeviations)
Computes the standardized version of the input matrix. That is each, each element of column j, of the output is given by Z[i][j] = ( X[i][j]- mean(X[,j]) / standardDeviation(X[,j]) where mean(X[,j]) is the mean value of column j of X.



Processing library papaya by Adila Faruk. (C) 2014