|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectpapaya.Descriptive
public class Descriptive
Basic descriptive statistics class for exploratory data analysis.
Methods for computing Correlations and Covariances are in the Correlation
class. Where appropriate, methods with similar functions are grouped into
static subclasses. For example, the Descriptive.Sum
class contains the following
methods:
Descriptive.Sum.inversions(float[], int, int)
Descriptive.Sum.logs(float[], int, int)
Descriptive.Sum.products(float[], float[])
Descriptive.Sum.powerDeviations(float[], int, float)
Descriptive.Sum.powers(float[], int)
Descriptive.Sum.squaredDeviations(int, float)
Descriptive.Sum.squares(float[])
Descriptive.Sum.sum(float[])
Descriptive.Mean
class has the
Descriptive.Mean.columnMean(float[][])
Descriptive.Mean.rowMean(float[][])
Descriptive.Mean.arithmetic(float[])
Descriptive.Mean.geometric(int, float)
Descriptive.Mean.harmonic(int, float)
Descriptive.Mean.trimmed(float[], float, int, int)
Nested Class Summary | |
---|---|
static class |
Descriptive.Mean
Contains methods for computing the arithmetic, geometric, harmonic, trimmed, and winsorized means (among others). |
static class |
Descriptive.Pooled
Class for computing the pooled mean and variance of data sequences |
static class |
Descriptive.Sum
Methods for computing various different sums of datasets such as sum of inversions, logs, products, power deviations, squares, etc. |
static class |
Descriptive.Weighted
Contains methods related to weighted datasets. |
Method Summary | |
---|---|
static void |
frequencies(float[] sortedData,
ArrayList<Float> distinctValues,
ArrayList<Integer> frequencies)
Computes the frequency (number of occurances, count) of each distinct value in the given sorted data. |
static float |
kurtosis(float[] data,
float mean,
float standardDeviation)
Returns the kurtosis (aka excess) of a data sequence, which is -3 + moment(data,4,mean) / standardDeviation4 . |
static float |
kurtosis(float moment4,
float standardDeviation)
Returns the kurtosis (aka excess) of a data sequence. |
static double |
max(double[] data)
Returns the largest member of a data sequence. |
static double |
max(double a,
double b)
|
static float |
max(float[] data)
Returns the largest member of a data sequence. |
static float |
max(float[][] data)
Returns the largest member of a matrix. |
static float |
max(float a,
float b)
|
static int |
max(int[] data)
Returns the largest member of a data sequence. |
static int |
max(int[][] data)
Returns the largest member of a matrix. |
static int |
max(int a,
int b)
|
static float |
mean(float[] data)
Returns the arithmetic mean of a data sequence; That is Sum( data[i] ) / data.length . |
static float |
meanDeviation(float[] data,
float mean)
Returns the mean deviation of a dataset. |
static float |
median(float[] data,
boolean isSorted)
Returns the median of a data sequence. |
static double |
min(double[] data)
Returns the smallest member of a data sequence. |
static double |
min(double a,
double b)
|
static float |
min(float[] data)
Returns the smallest member of a data sequence. |
static float |
min(float[][] data)
Returns the smallest member of a matrix. |
static float |
min(float a,
float b)
|
static int |
min(int[] data)
Returns the smallest member of a data sequence. |
static int |
min(int[][] data)
Returns the smallest member of a matrix. |
static int |
min(int a,
int b)
|
static float[] |
mod(float[] data)
Returns the array containing the elements that appear the most in a given dataset. |
static float |
moment(float[] data,
int k,
float c)
Returns the moment of k -th order with constant c of a data sequence,
which is Sum( (data[i]-c)k ) / data.size() . |
static float[] |
outliers(float[] data,
float lowerLimit,
float upperLimit,
boolean isSorted)
Returns the array containing all elements in the dataset that are less than or equal to the lowerLimit
and more than or equal to the upperLimit |
static float |
product(float[] data)
Returns the product of a data sequence, which is Prod( data[i] ) . |
static float |
product(int size,
float sumOfLogarithms)
Returns the product, which is Prod( data[i] ) . |
static float |
quantile(float[] sortedData,
float phi)
Returns the phi- quantile; that is, an element elem
for which holds that phi percent of data elements are less than
elem . |
static float |
quantileInverse(float[] sortedList,
float element)
Returns how many percent of the elements contained in the receiver are <= element . |
static float[] |
quantiles(float[] sortedData,
float[] percentages)
Returns the quantiles of the specified percentages. |
static float[] |
quartiles(float[] data,
boolean isSorted)
Returns the quartiles of the input data array (not necessarily sorted). |
static float |
rankInterpolated(float[] sortedList,
float element)
Returns the linearly interpolated number of elements in an array that are ≤ a given element. |
static float |
rms(int size,
double sumOfSquares)
Returns the RMS (Root-Mean-Square) of a data sequence. |
static float |
skew(float[] data,
float mean,
float standardDeviation)
Returns the skew of a data sequence, which is moment(data,3,mean) / standardDeviation3 . |
static float |
skew(float moment3,
float standardDeviation)
Returns the skew of a data sequence when the 3rd moment has already been computed. |
static float[] |
std(float[][] data,
boolean unbiased)
Returns an array with each element of the array corresponding to the standard deviations of each column of the input matrix. |
static float |
std(float[] data,
boolean unbiased)
Returns the standard deviation of a dataset. |
static float |
stdUnbiased(int size,
float sampleVariance)
Returns the unbiased sample standard deviation assuming the sample is normally distributed. |
static float[] |
tukeyFiveNum(float[] data)
Return the tukey five number summary of a dataset consisting of the minimum, maximum, and three quartile values. |
static float[] |
var(float[][] data,
boolean unbiased)
Returns an array containing the variance of each column of the input matrix X. |
static float |
var(float[] data,
boolean unbiased)
Returns the variance of a dataset, V. |
static float[][] |
zScore(float[][] X,
float[] means,
float[] standardDeviations)
Computes the standardized version of the input matrix. |
static float[] |
zScore(float[] x,
float mean,
float standardDeviation)
Returns the array of z-scores for a given data array. |
static float |
zScore(float x,
float mean,
float standardDeviation)
|
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Method Detail |
---|
public static void frequencies(float[] sortedData, ArrayList<Float> distinctValues, ArrayList<Integer> frequencies)
distinctValues
and frequencies
have a new size (which is equal for both),
which is the number of distinct values in the sorted data.
Distinct values are filled into distinctValues
, starting at index 0.
The frequency of each distinct value is filled into frequencies
, starting at index 0.
As a result, the smallest distinct value (and its frequency) can be found at index 0,
the second smallest distinct value (and its frequency) at index 1, ...,
the largest distinct value (and its frequency) at index distinctValues.size()-1
.
sortedData = (5,6,6,7,8,8) --> distinctValues = (5,6,7,8), frequencies = (1,2,1,2)
Code-wise, you would write:
ArrayListdistinctValues = new ArrayList (); ArrayList frequencies = new ArrayList (); frequencies(sortedData,distinctValues,frequencies);
sortedData
- the data; must be sorted ascending.distinctValues
- a ArrayList to be filled with the distinct values; can have any size.frequencies
- a ArrayList to be filled with the frequencies; can have any size;
set this parameter to null
to ignore it.Frequency
,
Unique
public static float kurtosis(float moment4, float standardDeviation)
moment4
- the fourth central moment, which is moment(data,4,mean)
.standardDeviation
- the standardDeviation.public static float kurtosis(float[] data, float mean, float standardDeviation)
-3 + moment(data,4,mean) / standardDeviation4
.
public static int max(int a, int b)
public static float max(float a, float b)
public static double max(double a, double b)
public static double max(double[] data)
public static float max(float[] data)
public static int max(int[] data)
public static float max(float[][] data)
public static int max(int[][] data)
public static float mean(float[] data)
Sum( data[i] ) / data.length
.
Similar to Descriptive.Mean.arithmetic(float[])
.
public static float meanDeviation(float[] data, float mean)
Sum( Math.abs(data[i]-mean)) ) / data.length
.
public static float median(float[] data, boolean isSorted)
data
- the data sequence;isSorted
- true if the data sequence is sorted (in ascending order), else false.public static int min(int a, int b)
public static float min(float a, float b)
public static double min(double a, double b)
public static double min(double[] data)
public static float min(float[] data)
public static float min(float[][] data)
public static int min(int[] data)
public static int min(int[][] data)
public static float[] mod(float[] data)
data
- the data array
public static float moment(float[] data, int k, float c)
k
-th order with constant c
of a data sequence,
which is Sum( (data[i]-c)k ) / data.size()
.
public static float[] outliers(float[] data, float lowerLimit, float upperLimit, boolean isSorted)
lowerLimit
and more than or equal to the upperLimit
data
- the data arraylowerLimit
- upperLimit
- isSorted
- true if the data array has been sorted in ascending order, else set to false.public static float product(int size, float sumOfLogarithms)
Prod( data[i] )
.
In other words: data[0]*data[1]*...*data[data.length-1]
.
This method uses the equivalent definition:
prod = pow( exp( Sum( Log(x[i]) ) / length*length)
.
public static float product(float[] data)
Prod( data[i] )
.
In other words: data[0]*data[1]*...*data[data.length-1]
.
Note that you may easily get numeric overflows. Use product(int,float)
instead to avoid that.
public static float[] quartiles(float[] data, boolean isSorted)
Details:
The first quartile, or lower quartile (Q[0]), is the value that cuts off
the first 25% of the data when it is sorted in ascending order.
The second quartile, or median (Q[1]), is the value that cuts off the first 50%.
The third quartile, or upper quartile (Q[2], is the value that cuts off the first 75%.
data
- The data array.isSorted
- true if the data array has been sorted in ascending order, else set to false.
public static float quantile(float[] sortedData, float phi)
phi-
quantile; that is, an element elem
for which holds that phi
percent of data elements are less than
elem
.
The quantile need not necessarily be contained in the data sequence,
it can be a linear interpolation.
sortedData
- the data sequence; must be sorted ascending.phi
- the percentage; must satisfy 0 <= phi <= 1
.public static float quantileInverse(float[] sortedList, float element)
<= element
.
Does linear interpolation if the element is not contained but lies in between
two contained elements.
sortedList
- the list to be searched (must be sorted ascending).element
- the element to search for.
phi
of elements <= element
(0.0 <= phi <= 1.0)
.public static float[] quantiles(float[] sortedData, float[] percentages)
sortedData
- the data sequence; must be sorted ascending.percentages
- the percentages for which quantiles are to be computed.
Each percentage must be in the interval [0.0f,1.0f]
.
public static float rankInterpolated(float[] sortedList, float element)
{0, 1, 2,..., sortedList.size()}
.
If no element is ≤ element, then the rank is zero.
If the element lies in between two contained elements, then linear interpolation is used and a non integer value is returned.
sortedList
- the list to be searched (must be sorted ascending).element
- the element to search for.
public static float rms(int size, double sumOfSquares)
Math.sqrt(Sum( data[i]*data[i] ) / data.length)
.
The RMS of data sequence is the square-root of the mean of the squares of the elements in the data sequence.
It is a measure of the average "size" of the elements of a data sequence.
sumOfSquares
- sumOfSquares(data) == Sum( data[i]*data[i] )
of the data sequence.size
- the number of elements in the data sequence.public static float skew(float moment3, float standardDeviation)
moment3
- the third central moment, which is moment(data,3,mean)
.standardDeviation
- the standardDeviation.public static float skew(float[] data, float mean, float standardDeviation)
moment(data,3,mean) / standardDeviation3
.
public static float std(float[] data, boolean unbiased)
sigma_1 = 1/(N-1) * Sum( (x[i] - mean(x))^2 ) sigma_2 = 1/(N) * Sum( (x[i] - mean(x))^2 )sigma_1 is the square root of an unbiased estimator of the variance of the population the x is drawn, as long as x consists of independent, identically distributed samples. sigma_2 corresponds to the second moment of the set of values about their mean.
std(data,unbiased==true) returns sigma_1 above, while std(data,unbiased==false) returns sigma_2.
data
- the datasetunbiased
- set to true to return the unbiased standard deviation,
false to return the biased version.public static float[] std(float[][] data, boolean unbiased)
sigma_1 = 1/(N-1) * Sum( (x[i] - mean(x))^2 ) sigma_2 = 1/(N) * Sum( (x[i] - mean(x))^2 )sigma_1 is the square root of an unbiased estimator of the variance of the population the x is drawn, as long as x consists of independent, identically distributed samples. sigma_2 corresponds to the second moment of the set of values about their mean.
std(data,unbiased==true) returns sigma_1 above, while std(data,unbiased==false) returns sigma_2.
data
- the datasetunbiased
- set to true to return the unbiased standard deviation,
false to return the biased version.public static float stdUnbiased(int size, float sampleVariance)
See also this entry on wikipedia.org
size
- the number of elements of the data sequence.sampleVariance
- the sample variance.public static float[] tukeyFiveNum(float[] data)
data
- the data array
public static float var(float[] data, boolean unbiased)
var(x,true) normalizes V by N - 1 if N > 1, where N is the sample size. This is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples. For N = 1, V is normalized by 1.
V = var(x,false) normalizes by N and produces the second moment of the sample about its mean.
Reference:
Algorithms for calculating variance, Wikipedia.org
Incremental calculation of weighted mean and variance, Tony Finch
data
- the data sequence.unbiased
- set to true to return the unbiased variance (division by (N-1)), false to return the biased value (division by N).
public static float[] var(float[][] data, boolean unbiased)
var(X,unbiased=true) normalizes V by N - 1 if N > 1, where N is the sample size. This is an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples. For N = 1, V is normalized by 1.
V = var(X,unbiased=false) normalizes by N and produces the second moment of the sample about its mean.
Reference:
Algorithms for calculating variance, Wikipedia.org
Incremental calculation of weighted mean and variance, Tony Finch
data
- the data sequence.unbiased
- set to true to return the unbiased variance (division by (N-1)), false to return the biased value (division by N).
public static float zScore(float x, float mean, float standardDeviation)
(x-mean)/standardDeviation
public static float[] zScore(float[] x, float mean, float standardDeviation)
z[i] = ( x[i] - mean ) / standardDeviation
.
public static float[][] zScore(float[][] X, float[] means, float[] standardDeviations)
j
, of the output is given by
Z[i][j] = ( X[i][j]- mean(X[,j]) / standardDeviation(X[,j])
where mean(X[,j]) is the mean value of column j
of X
.
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |