Aggregate Functions
count
Counts the number of rows or not-NULL values.
ClickHouse supports the following syntaxes for count
:
count(expr)
orCOUNT(DISTINCT expr)
.count()
orCOUNT(*)
. Thecount()
syntax is ClickHouse-specific.
Arguments
The function can take:
Zero parameters.
One expression.
Returned value
If the function is called without parameters it counts the number of rows.
If the expression is passed, then the function counts how many times this expression returned not null. If the expression returns a Nullable-type value, then the result of
count
stays notNullable
. The function returns 0 if the expression returnedNULL
for all the rows.
In both cases the type of the returned value is UInt64.
Details
ClickHouse supports the COUNT(DISTINCT ...)
syntax. The behavior of this construction depends on the count_distinct_implementation setting. It defines which of the uniq* functions is used to perform the operation. The default is the uniqExact function.
The SELECT count() FROM table
query is optimized by default using metadata from MergeTree. If you need to use row-level security, disable optimization using the optimize_trivial_count_query setting.
However SELECT count(nullable_column) FROM table
query can be optimized by enabling the optimize_functions_to_subcolumns setting. With optimize_functions_to_subcolumns = 1
the function reads only null subcolumn instead of reading and processing the whole column data. The query SELECT count(n) FROM table
transforms to SELECT sum(NOT n.null) FROM table
.
Examples
Example 1:
Example 2:
This example shows that count(DISTINCT num)
is performed by the uniqExact
function according to the count_distinct_implementation
setting value.
min
Aggregate function that calculates the minimum across a group of values.
Example:
If you need non-aggregate function to choose a minimum of two values, see least
:
max
Aggregate function that calculates the maximum across a group of values.
Example:
If you need non-aggregate function to choose a maximum of two values, see greatest
:
sum
Calculates the sum. Only works for numbers.
avg
Calculates the arithmetic mean.
Syntax
Arguments
Returned value
The arithmetic mean, always as Float64.
NaN
if the input parameterx
is empty.
Example
Query:
Result:
Example
Create a temp table:
Query:
Get the arithmetic mean:
Query:
Result:
any
Selects the first encountered (non-NULL) value, unless all rows have NULL values in that column. The query can be executed in any order and even in a different order each time, so the result of this function is indeterminate. To get a determinate result, you can use the ‘min’ or ‘max’ function instead of ‘any’.
In some cases, you can rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY.
When a SELECT
query has the GROUP BY
clause or at least one aggregate function, ClickHouse (in contrast to MySQL) requires that all expressions in the SELECT
, HAVING
, and ORDER BY
clauses be calculated from keys or from aggregate functions. In other words, each column selected from the table must be used either in keys or inside aggregate functions. To get behavior like in MySQL, you can put the other columns in the any
aggregate function.
stddevPop
The result is equal to the square root of varPop.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevPopStable
function. It works slower but provides a lower computational error.
stddevSamp
The result is equal to the square root of varSamp.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevSampStable
function. It works slower but provides a lower computational error.
varPop(x)
Calculates the amount Σ((x - x̅)^2) / n
, where n
is the sample size and x̅
is the average value of x
.
In other words, dispersion for a set of values. Returns Float64
.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the varPopStable
function. It works slower but provides a lower computational error.
varSamp
Calculates the amount Σ((x - x̅)^2) / (n - 1)
, where n
is the sample size and x̅
is the average value of x
.
It represents an unbiased estimate of the variance of a random variable if passed values from its sample.
Returns Float64
. When n <= 1
, returns +∞
.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the varSampStable
function. It works slower but provides a lower computational error.
covarPop
Syntax: covarPop(x, y)
Calculates the value of Σ((x - x̅)(y - y̅)) / n
.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarPopStable
function. It works slower but provides a lower computational error.
covarSamp
Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1)
.
Returns Float64. When n <= 1
, returns +∞.
NOTE
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarSampStable
function. It works slower but provides a lower computational error.
anyHeavy
Selects a frequently occurring value using the heavy hitters algorithm. If there is a value that occurs more than in half the cases in each of the query’s execution threads, this value is returned. Normally, the result is nondeterministic.
Arguments
column
– The column name.
Example
Take the OnTime data set and select any frequently occurring value in the AirlineID
column.
anyLast
Selects the last value encountered. The result is just as indeterminate as for the any function.
argMin
Calculates the arg
value for a minimum val
value. If there are several different values of arg
for minimum values of val
, returns the first of these values encountered.
Syntax
Arguments
arg
— Argument.val
— Value.
Returned value
arg
value that corresponds to minimumval
value.
Type: matches arg
type.
Example
Input table:
Query:
Result:
avgWeighted
Calculates the weighted arithmetic mean.
Syntax
Arguments
x
— Values.weight
— Weights of the values.
x
and weight
must both be Integer, floating-point, or Decimal, but may have different types.
Returned value
NaN
if all the weights are equal to 0 or the supplied weights parameter is empty.Weighted mean otherwise.
Return type is always Float64.
Example
Query:
Result:
Example
Query:
Result:
Example
Query:
Result:
Example
Query:
Result:
topK
Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).
Implements the Filtered Space-Saving algorithm for analyzing TopK, based on the reduce-and-combine algorithm from Parallel Space Saving.
This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren’t the most frequent values.
We recommend using the N < 10
value; performance is reduced with large N
values. Maximum value of N = 65536
.
Arguments
N
– The number of elements to return.
If the parameter is omitted, default value 10 is used.
Arguments
x
– The value to calculate frequency.
Example
Take the OnTime data set and select the three most frequently occurring values in the AirlineID
column.
topKWeighted
Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves). Additionally, the weight of the value is taken into account.
Syntax
Arguments
N
— The number of elements to return.x
— The value.weight
— The weight. Every value is accountedweight
times for frequency calculation. UInt64.
Returned value
Returns an array of the values with maximum approximate sum of weights.
Example
Query:
Result:
See Also
groupArray
Syntax: groupArray(x)
or groupArray(max_size)(x)
Creates an array of argument values. Values can be added to the array in any (indeterminate) order.
The second version (with the max_size
parameter) limits the size of the resulting array to max_size
elements. For example, groupArray(1)(x)
is equivalent to [any (x)]
.
In some cases, you can still rely on the order of execution. This applies to cases when SELECT
comes from a subquery that uses ORDER BY
.
Example
Query:
Result:
The groupArray function will remove ᴺᵁᴸᴸ value based on the above results.
groupUniqArray
Syntax: groupUniqArray(x)
or groupUniqArray(max_size)(x)
Creates an array from different argument values. Memory consumption is the same as for the uniqExact function.
The second version (with the max_size
parameter) limits the size of the resulting array to max_size
elements. For example, groupUniqArray(1)(x)
is equivalent to [any(x)]
.
groupArrayInsertAt
Inserts a value into the array at the specified position.
Syntax
If in one query several values are inserted into the same position, the function behaves in the following ways:
If a query is executed in a single thread, the first one of the inserted values is used.
If a query is executed in multiple threads, the resulting value is an undetermined one of the inserted values.
Arguments
x
— Value to be inserted. Expression resulting in one of the supported data types.pos
— Position at which the specified elementx
is to be inserted. Index numbering in the array starts from zero. UInt32.default_x
— Default value for substituting in empty positions. Optional parameter. Expression resulting in the data type configured for thex
parameter. Ifdefault_x
is not defined, the default values are used.size
— Length of the resulting array. Optional parameter. When using this parameter, the default valuedefault_x
must be specified. UInt32.
Returned value
Array with inserted values.
Type: Array.
Example
Query:
Result:
Query:
Result:
Query:
Result:
Multi-threaded insertion of elements into one position.
Query:
As a result of this query you get random integer in the [0,9]
range. For example:
groupArrayMovingAvg
Calculates the moving average of input values.
The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.
Arguments
numbers_for_summing
— Expression resulting in a numeric data type value.window_size
— Size of the calculation window.
Returned values
Array of the same size and type as the input data.
The function uses rounding towards zero. It truncates the decimal places insignificant for the resulting data type.
Example
The sample table b
:
The queries:
sumWithOverflow
Computes the sum of the numbers, using the same data type for the result as for the input parameters. If the sum exceeds the maximum value for this data type, it is calculated with overflow.
Only works for numbers.
sumMap
Syntax: sumMap(key, value)
or sumMap(Tuple(key, value))
Totals the value
array according to the keys specified in the key
array.
Passing tuple of keys and values arrays is a synonym to passing two arrays of keys and values.
The number of elements in key
and value
must be the same for each row that is totaled.
Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
Example:
minMap
Syntax: minMap(key, value)
or minMap(Tuple(key, value))
Calculates the minimum from value
array according to the keys specified in the key
array.
Passing a tuple of keys and value arrays is identical to passing two arrays of keys and values.
The number of elements in key
and value
must be the same for each row that is totaled.
Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys.
Example:
maxMap
Syntax: maxMap(key, value)
or maxMap(Tuple(key, value))
Calculates the maximum from value
array according to the keys specified in the key
array.
Passing a tuple of keys and value arrays is identical to passing two arrays of keys and values.
The number of elements in key
and value
must be the same for each row that is totaled.
Returns a tuple of two arrays: keys and values calculated for the corresponding keys.
Example:
skewSamp
Computes the sample skewness of a sequence.
It represents an unbiased estimate of the skewness of a random variable if passed values form its sample.
Arguments
expr
— Expression returning a number.
Returned value
The skewness of the given distribution. Type — Float64. If n <= 1
(n
is the size of the sample), then the function returns nan
.
Example
skewPop
Computes the skewness of a sequence.
Arguments
expr
— Expression returning a number.
Returned value
The skewness of the given distribution. Type — Float64
Example
kurtSamp
Computes the sample kurtosis of a sequence.
It represents an unbiased estimate of the kurtosis of a random variable if passed values form its sample.
Arguments
expr
— Expression returning a number.
Returned value
The kurtosis of the given distribution. Type — Float64. If n <= 1
(n
is a size of the sample), then the function returns nan
.
Example
kurtPop
Computes the kurtosis of a sequence.
Arguments
expr
— Expression returning a number.
Returned value
The kurtosis of the given distribution. Type — Float64
Example
uniq
Calculates the approximate number of different values of the argument.
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
Returned value
A UInt64-type number.
Implementation details
Function:
Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values up to 65536. This algorithm is very accurate and very efficient on the CPU. When the query contains several of these functions, using
uniq
is almost as fast as using other aggregate functions.Provides the result deterministically (it does not depend on the query processing order).
We recommend using this function in almost all scenarios.
uniqExact
Calculates the exact number of different argument values.
Use the uniqExact
function if you absolutely need an exact result. Otherwise use the uniq function.
The uniqExact
function uses more memory than uniq
, because the size of the state has unbounded growth as the number of different values increases.
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
uniqCombined
Calculates the approximate number of different argument values.
The uniqCombined
function is a good choice for calculating the number of different values.
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
HLL_precision
is the base-2 logarithm of the number of cells in HyperLogLog. Optional, you can use the function as uniqCombined(x[, ...])
. The default value for HLL_precision
is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).
Returned value
A number UInt64-type number.
Implementation details
Function:
Calculates a hash (64-bit hash for
String
and 32-bit otherwise) for all parameters in the aggregate, then uses it in calculations.Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
Provides the result deterministically (it does not depend on the query processing order).
NOTE
Since it uses 32-bit hash for non-String
type, the result will have very high error for cardinalities significantly larger than UINT_MAX
(error will raise quickly after a few tens of billions of distinct values), hence in this case you should use uniqCombined64
Compared to the uniq function, the uniqCombined
:
Consumes several times less memory.
Calculates with several times higher accuracy.
Usually has slightly lower performance. In some scenarios,
uniqCombined
can perform better thanuniq
, for example, with distributed queries that transmit a large number of aggregation states over the network.
uniqCombined64
Same as uniqCombined, but uses 64-bit hash for all data types.
uniqHLL12
Calculates the approximate number of different argument values, using the HyperLogLog algorithm.
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
Returned value
A UInt64-type number.
Implementation details
Function:
Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses the HyperLogLog algorithm to approximate the number of different argument values.
Provides the determinate result (it does not depend on the query processing order).
We do not recommend using this function. In most cases, use the uniq or uniqCombined function.
quantile
Computes an approximate quantile of a numeric data sequence.
This function applies reservoir sampling with a reservoir size up to 8192 and a random number generator for sampling. The result is non-deterministic. To get an exact quantile, use the quantileExact function.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.
Note that for an empty numeric sequence, quantile
will return NaN, but its quantile*
variants will return either NaN or a default value for the sequence type, depending on the variant.
Syntax
Alias: median
.
Arguments
level
— Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
— Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
Approximate quantile of the specified level.
Type:
Float64 for numeric data type input.
Date if input values have the
Date
type.DateTime if input values have the
DateTime
type.
Example
Input table:
Query:
Result:
quantiles
Syntax: quantiles(level1, level2, …)(x)
All the quantile functions also have corresponding quantiles functions: quantiles
, quantilesDeterministic
, quantilesTiming
, quantilesTimingWeighted
, quantilesExact
, quantilesExactWeighted
, quantilesTDigest
, quantilesBFloat16
. These functions calculate all the quantiles of the listed levels in one pass, and return an array of the resulting values.
quantilesExactExclusive
Exactly computes the quantiles of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.EXC Excel function, (type R6).
Works more efficiently with sets of levels than quantileExactExclusive.
Syntax
Arguments
expr
— Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
— Levels of quantiles. Possible values: (0, 1) — bounds not included. Float.
Returned value
Array of quantiles of the specified levels.
Type of array values:
Float64 for numeric data type input.
Date if input values have the
Date
type.DateTime if input values have the
DateTime
type.
Example
Query:
Result:
quantilesExactInclusive
Exactly computes the quantiles of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.INC Excel function, (type R7).
Works more efficiently with sets of levels than quantileExactInclusive.
Syntax
Arguments
expr
— Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
— Levels of quantiles. Possible values: [0, 1] — bounds included. Float.
Returned value
Array of quantiles of the specified levels.
Type of array values:
Float64 for numeric data type input.
Date if input values have the
Date
type.DateTime if input values have the
DateTime
type.
Example
Query:
Result:
quantileExactWeighted
Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Each value is counted with its weight, as if it is present weight
times. A hash table is used in the algorithm. Because of this, if the passed values are frequently repeated, the function consumes less RAM than quantileExact. You can use this function instead of quantileExact
and specify the weight 1.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.
Syntax
Alias: medianExactWeighted
.
Arguments
level
— Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
— Expression over the column values resulting in numeric data types, Date or DateTime.weight
— Column with weights of sequence members. Weight is a number of value occurrences.
Returned value
Quantile of the specified level.
Type:
Float64 for numeric data type input.
Date if input values have the
Date
type.DateTime if input values have the
DateTime
type.
Example
Input table:
Query:
Result:
quantileTiming
With the determined precision computes the quantile of a numeric data sequence.
The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.
Syntax
Alias: medianTiming
.
Arguments
level
— Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
— Expression over a column values returning a Float*-type number.If negative values are passed to the function, the behavior is undefined.
If the value is greater than 30,000 (a page loading time of more than 30 seconds), it is assumed to be 30,000.
Accuracy
The calculation is accurate if:
Total number of values does not exceed 5670.
Total number of values exceeds 5670, but the page loading time is less than 1024ms.
Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.
NOTE
For calculating page loading time quantiles, this function is more effective and accurate than quantile.
Returned value
Quantile of the specified level.
Type: Float32
.
NOTE
If no values are passed to the function (when using quantileTimingIf
), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero. See ORDER BY clause for notes on sorting NaN
values.
Example
Input table:
Query:
Result:
quantileTimingWeighted
With the determined precision computes the quantile of a numeric data sequence according to the weight of each sequence member.
The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.
Syntax
Alias: medianTimingWeighted
.
Arguments
level
— Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
— Expression over a column values returning a Float*-type number.weight
— Column with weights of sequence elements. Weight is a number of value occurrences.
Accuracy
The calculation is accurate if:
Total number of values does not exceed 5670.
Total number of values exceeds 5670, but the page loading time is less than 1024ms.
Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.
NOTE
For calculating page loading time quantiles, this function is more effective and accurate than quantile.
Returned value
Quantile of the specified level.
Type: Float32
.
NOTE
If no values are passed to the function (when using quantileTimingIf
), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero. See ORDER BY clause for notes on sorting NaN
values.
Example
Input table:
Query:
Result:
quantilesTimingWeighted
Same as quantileTimingWeighted
, but accept multiple parameters with quantile levels and return an Array filled with many values of that quantiles.
Example
Input table:
Query:
Result:
Last updated