I'm profiling some numeric time measurements that cluster extremely closely. I would like to obtain mean, standard deviation, etc. Some inputs are large, so I thought I could avoid creating lists of millions of numbers and instead use Python collections.Counter objects as a compact representation.
Example: one of my small inputs yields a collection.Counter
like [(48, 4082), (49, 1146)]
which means 4,082 occurrences of the value 48 and 1,146 occurrences of the value 49. For this data set I manually calculate the mean to be something like 48.2192042846.
Of course if I had a simple list of 4,082 + 1,146 = 5,228 integers I would just feed it to numpy.mean().
My question: how can I calculate descriptive statistics from the values in a collections.Counter
object just as if I had a list of numbers? Do I have to create the full list or is there a shortcut?
While you can offload everything to numpy
after making a list of values, this will be slower than needed. Instead, you can use the actual definitions of what you need.
The mean is just the sum of all numbers divided by their count, so that's very simple:
sum_of_numbers = sum(number*count for number, count in counter.items())
count = sum(count for n, count in counter.items())
mean = sum_of_numbers / count
Standard deviation is a bit more complex. It's the square root of variance, and variance in turn is defined as "mean of squares minus the square of the mean" for your collection. Soooo...
total_squares = sum(number*number * count for number, count in counter)
mean_of_squares = total_squares / count
variance = mean_of_squares - mean * mean
std_dev = math.sqrt(variance)
A little bit more manual work, but should also be much faster if the number sets have a lot of repetition.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments