1093. Statistics from a Large Sample

MediumArrayMathProbability and Statistics
Leetcode Link

Problem Description

You are provided with an array called count representing a large sample of integers where each element's index (k) corresponds to an integer, and the value at each index (count[k]) corresponds to how many times k occurs in the sample. The range of integers is from 0 to 255. Your task is to calculate various statistics about the sample: the minimum value (minimum), maximum value (maximum), mean, median, and mode of the sample. The mean is the sum of all the integer occurrences multiplied by their respective values divided by the total number of elements. The median is the middle value once the sample is sorted, or if the sample size is even, then it's the average of the two middle values. The mode is the value that appears most frequently in the sample, and it's given that there will be a single, unique mode. The result should be returned as an array of floating-point numbers containing the statistics in the order they were mentioned.

Intuition

The solution first involves finding the minimum and maximum values present in the sample. Since the count array represents how many times each integer in the range [0, 255] occurs, we iterate through this array and look for the first and last indexes that have a non-zero count which represents the minimum and maximum, respectively.

Next, we calculate the mean by summing up each integer multiplied by its occurrences (k * count[k]) and then dividing by the total number of elements in the sample (cnt).

Calculating the median is a bit trickier because we need to consider whether the total number of elements is odd or even. We use a helper function find(i) to find the i-th smallest value in the sample according to its sorted position. If the total number of elements is odd, we want the element that is in the middle position. If it is even, we average the two middle elements, which involves finding the cnt // 2-th and the cnt // 2 + 1-th elements if you sort the sample.

Lastly, the mode is the integer value that has the highest count in the count array. As we iterate through the count array to perform other operations, we also keep track of the mode by comparing the current integers' count.

Combining all these steps, the function calculates and returns the statistics as a list [minimum, maximum, mean, median, mode].

Learn more about Math patterns.

Not Sure What to Study? Take the 2-min Quiz to Find Your Missing Piece:

Which of the following is equvalent to O(3*2^n + n^3 + n!+ log n)?

Solution Approach

The sampleStats function in the provided solution uses a straightforward algorithm to calculate the various statistics for the given sample.

  1. Initialization of Variables: The variables mi (minimum), mx (maximum), s (sum), cnt (count), and mode are initialized. The inf keyword in Python represents an infinite number, used here to make the initial minimum as high as possible. The mx is initialized to -1 to later find the maximum which will certainly be higher than -1. The sum s and count cnt are initialized to 0 to calculate the mean. The mode is initialized to 0.

  2. Finding Minimum, Maximum, Sum, Count, and Mode: A single loop iterates through all possible integer values (0 to 255) as indices in the count array. If an index k has a non-zero count (x), the algorithm does several things:

    • Updates minimum and maximum using min(mi, k) and max(mx, k) functions.
    • Adds k * x to the sum s.
    • Increases the total count cnt by x.
    • Checks if the current count x is greater than the count of the current mode and if so, updates the mode.
  3. Finding the Median: The median calculation depends on whether the total count cnt is odd or even. A helper nested function find(i) is defined which returns the i-th smallest value in the sample (if the sample were sorted). For an odd total count, it finds the middle value. For an even total count, it finds the average of the two middle values. This is done by calling the find function accordingly.

The helper function find works by iterating through the count array again. It accumulates the total number of elements seen so far in a temporary count t and checks if it is greater than or equal to the target i. When it reaches or passes the target, it returns the current integer value k as the i-th smallest value.

  1. Returning the Result: Finally, the function returns a list with the calculated statistics: [mi, mx, s / cnt, median, mode], corresponding to minimum, maximum, mean, median, and mode.

The pattern used here is mainly iterating through the count array to gather all the needed statistics, utilizing a single pass wherever possible to optimize performance and then leveraging aggregated data to derive mean and median.

Understanding this solution approach is mainly about recognizing that the array index itself represents the integer value in the sample, and the increment in the array represents its frequency or count in the sample.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:

A heap is a ...?

Example Walkthrough

Let's consider a simple example to illustrate the solution approach. Suppose we have an array count which has non-zero values at indices 2, 3, and 4, which represent the integers 2, 3, and 4 in the sample. The count array looks like this:

1count = [0, 0, 1, 2, 1, 0, 0, ... 0]

This means the integer 2 occurs once, 3 occurs twice, and 4 occurs once in our sample. Given that the rest of the count array contains zeros, we will ignore them for brevity. Now, let's walk through the solution approach:

  1. Initialization of Variables: The variables mi, mx, s, cnt, and mode are initialized. We start with mi set to infinity, mx to -1, s and cnt to 0, and mode to 0.

  2. Finding Minimum, Maximum, Sum, Count, and Mode: We iterate through the indices 0 to 255 of the count array. Here's the breakdown:

    • At index 2, count[2] is 1. We update mi to min(inf, 2) which is 2, mx to max(-1, 2) which is 2, add 2 * 1 to s to make it 2, add 1 to cnt to make it 1, and update the mode to 2 since 1 > 0.
    • At index 3, count[3] is 2. We update mi (remains 2), mx to max(2, 3) which is 3, add 3 * 2 to s to make it 8, add 2 to cnt to make it 3, and update mode to 3 since 2 > 1.
    • At index 4, count[4] is 1. We update mi (remains 2), mx to max(3, 4) which is 4, add 4 * 1 to s to make it 12, add 1 to cnt to make it 4, and mode remains 3 (since count of 4 is not greater than count of 3).
  3. Finding the Median: The total count cnt is 4, which is even. We use our find function to locate the middle values. We need the average of the 2nd and 3rd smallest values.

    • Invoking find(2) will iterate through count and sum up the counts until it equals or surpasses 2. It surpasses 2 at index 3, thus the 2nd smallest value is 3.
    • Invoking find(3) will iterate again until the sum surpasses 3, which will happen at index 3 as well. Thus, the 3rd smallest value is also 3.
    • The average of these is 3, so the median is 3.
  4. Returning the Result: We calculate the mean as s / cnt which is 12 / 4 equals 3. Our final statistics array is [mi, mx, mean, median, mode], which in this case is [2, 4, 3, 3, 3].

From this example, you see how the count array index's value is used as an integer from the sample and the array's value at that index is its frequency. By iterating through the count array, we can gather all the information we need for our statistics using well-organized loops and conditional statements.

Solution Implementation

1from math import inf
2from typing import List
3
4class Solution:
5    def sampleStats(self, count: List[int]) -> List[float]:
6        # Helper function to find the index of the i-th number in the data stream
7        def find(i: int) -> int:
8            total = 0
9            for index, frequency in enumerate(count):
10                total += frequency
11                if total >= i:
12                    return index
13
14        # Initialize minimum and maximum to represent the range of the data stream
15        minimum = inf
16        maximum = -1
17        # Initialize variables for sum and count to compute mean
18        sum_values = 0
19        total_count = 0
20        # Initialize the mode value
21        mode = 0
22      
23        # Loop through count array to compute min, max, sum, total_count, and mode
24        for value, frequency in enumerate(count):
25            if frequency:
26                minimum = min(minimum, value)
27                maximum = max(maximum, value)
28                sum_values += value * frequency
29                total_count += frequency
30                # Update mode if the current frequency is greater than the max frequency found so far
31                if frequency > count[mode]:
32                    mode = value
33
34        # Compute median
35        if total_count & 1:  # If the total number of elements is odd
36            median = find(total_count // 2 + 1)
37        else:
38            # If the number of elements is even, average the middle two elements
39            median = (find(total_count // 2) + find(total_count // 2 + 1)) / 2
40
41        # Compute mean
42        mean = sum_values / total_count
43      
44        # Return a list containing the minimum, maximum, mean, median, and mode
45        return [float(minimum), float(maximum), mean, median, float(mode)]
46
1class Solution {
2    private int[] elementCounts; // This array holds the count for each number.
3
4    public double[] sampleStats(int[] count) {
5        this.elementCounts = count;
6        int min = Integer.MAX_VALUE, max = Integer.MIN_VALUE;
7        long sum = 0; // Used to calculate the mean
8        int totalCount = 0; // Total number of elements
9        int mode = 0; // The number that appears the most frequently
10      
11        // Iterate through the count array to find minimum, maximum, sum, total count, and mode
12        for (int number = 0; number < elementCounts.length; ++number) {
13            if (elementCounts[number] > 0) {
14                min = Math.min(min, number);
15                max = Math.max(max, number);
16                sum += (long) number * elementCounts[number];
17                totalCount += elementCounts[number];
18                if (elementCounts[number] > elementCounts[mode]) {
19                    mode = number;
20                }
21            }
22        }
23      
24        // Calculate median
25        double median = totalCount % 2 == 1 
26            ? findKthElement(totalCount / 2 + 1)
27            : (findKthElement(totalCount / 2) +
28               findKthElement(totalCount / 2 + 1)) / 2.0;
29      
30        // Calculate mean
31        double mean = sum * 1.0 / totalCount;
32
33        // Return the results as an array: [min, max, mean, median, mode]
34        return new double[] {min, max, mean, median, mode};
35    }
36
37    // Helper function to find the kth element when the count array is treated like an expanded array
38    private int findKthElement(int k) {
39        int elementsSoFar = 0;
40      
41        // Iterate through elementCounts and keep adding until we reach or pass the kth element
42        for (int number = 0;; ++number) {
43            elementsSoFar += elementCounts[number];
44            if (elementsSoFar >= k) {
45                return number;
46            }
47        }
48    }
49}
50
1class Solution {
2public:
3    vector<double> sampleStats(vector<int>& count) {
4        // Lambda to find the value at the ith position when the array is considered sorted.
5        auto findValueAtPosition = [&](int position) -> int {
6            for (int value = 0, accumulated = 0;; ++value) {
7                accumulated += count[value];
8                if (accumulated >= position) {
9                    return value;
10                }
11            }
12        };
13
14        // Initialize minimum and maximum values to extreme values.
15        int minimumValue = INT_MAX, maximumValue = INT_MIN;
16
17        // Initialize variables to calculate the sum and mode.
18        long long sum = 0;
19        int totalCount = 0, modeValue = 0;
20
21        // Loop through the count array to find the minimum, maximum, total count and mode.
22        for (int value = 0; value < count.size(); ++value) {
23            if (count[value] > 0) {
24                minimumValue = min(minimumValue, value);
25                maximumValue = max(maximumValue, value);
26                sum += static_cast<long long>(value) * count[value];
27                totalCount += count[value];
28                if (count[value] > count[modeValue]) {
29                    modeValue = value;
30                }
31            }
32        }
33
34        // Calculate median using the findValueAtPosition lambda.
35        double median;
36        if (totalCount % 2 == 1) {
37            // If the total count is odd, select the middle value directly.
38            median = findValueAtPosition(totalCount / 2 + 1);
39        } else {
40            // If the total count is even, take the average of the two middle values.
41            median = (findValueAtPosition(totalCount / 2) + findValueAtPosition(totalCount / 2 + 1)) / 2.0;
42        }
43
44        // Calculate mean by dividing the sum by the total count.
45        double mean = sum / static_cast<double>(totalCount);
46
47        // Store the stats results as a vector of doubles and return.
48        return vector<double>{static_cast<double>(minimumValue),
49                              static_cast<double>(maximumValue),
50                              mean,
51                              median,
52                              static_cast<double>(modeValue)};
53    }
54};
55
1// Helper function to find the ith smallest number based on the cumulative frequency
2function findIthNumber(numbersCount: number[], ith: number): number {
3    for (let k = 0, total = 0; ; ++k) {
4        total += numbersCount[k];
5        if (total >= ith) {
6            return k;
7        }
8    }
9}
10
11// Main function to calculate statistics from a sample provided as an array of counts for each value
12function sampleStats(numbersCount: number[]): number[] {
13    let min = Number.MAX_SAFE_INTEGER; // Initialize minimum to a large number
14    let max = -1;                      // Initialize maximum to a small number
15    let sum = 0;                       // Sum of all numbers
16    let totalCount = 0;                // Total count of all numbers
17    let mode = 0;                      // Mode value of the numbers
18
19    // Iterate over each count index to calculate min, max, sum, and mode
20    for (let k = 0; k < numbersCount.length; ++k) {
21        if (numbersCount[k] > 0) {
22            min = Math.min(min, k);
23            max = Math.max(max, k);
24            sum += k * numbersCount[k];
25            totalCount += numbersCount[k];
26            if (numbersCount[k] > numbersCount[mode]) {
27                mode = k;
28            }
29        }
30    }
31
32    // Calculate median based on whether the count is odd or even
33    let median = 
34        totalCount % 2 === 1
35            ? findIthNumber(numbersCount, (totalCount >> 1) + 1)
36            : (findIthNumber(numbersCount, totalCount >> 1) + findIthNumber(numbersCount, (totalCount >> 1) + 1)) / 2;
37
38    // Return an array containing min, max, mean, median, mode
39    return [min, max, sum / totalCount, median, mode];
40}
41
Not Sure What to Study? Take the 2-min Quiz:

Which two pointer technique does Quick Sort use?

Time and Space Complexity

Time Complexity:

The time complexity of the code can be analyzed by looking at the number of operations performed relative to the number of elements n in the count list:

  • The loop to find the minimum (mi) and maximum (mx) values, the total count (cnt), the sum (s), and the mode runs once for each of the n elements: O(n).
  • The find function is called either 2 or 3 times depending on whether cnt is odd or even. Each call to find runs in O(n) because, in the worst-case scenario, it iterates over the entire count list. Thus, the worst case for finding the median is O(n).
  • In summary, the overall time complexity for this piece of code is O(n) + O(n) for median calculation, which simplifies to O(n) since both are linear operations.

Space Complexity:

  • The space complexity is O(1) because the space used does not depend on the input size n. The variables mi, mx, s, cnt, and mode use a constant amount of space.
  • The find function uses a constant amount of space as well since the variables t and k are of fixed size.

In conclusion, the code exhibits a linear time complexity (O(n)) and constant space complexity (O(1)).

Learn more about how to find time and space complexity quickly using problem constraints.

Fast Track Your Learning with Our Quick Skills Quiz:

Which one best describes the time complexity of the following code?

1int factorial(int n) {
2  if (n < 0) {
3    return -1;
4  } else if (n == 0) {
5    return 1;
6  } else {
7    return n * factorial(n - 1);
8  }
9}

Recommended Readings


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫