π Background Information
This lab focuses on a simple data processing problem. We want to calculate some basic statistics for some data in a file. However, the catch is that you donβt know the amount of information in the file ahead of time.
This problem is available courtey of Professor Jason James (James, 2022).
Mean
The mean of a set of data points is given by:
In other words, we sum up all of the data points and then divide by the total number of data points in the set.
Variance
The traditional formula for sample variance (v) is given by:
But this formula has a problem in that it requires us to work out the average () first and then run back through the data to find the variance. This might not be feasible or desired if the data set is very large.
Thus, we often use a reworked version of the formula:
Since has no dependence on , we can pull it out of each expression:
From the definition of the mean, the sum over all can be rewritten in terms of .
With this expression, we donβt need the average until after we have summed the squares of the data points. We can add the squares at the same time we are adding the values themselves and then at the end calculate the average / variance.
Standard Deviation
The standard deviation is given by the square root of the variance.
π― Problem Statement
Write a program that calculates some basic statistics for a set of numbers stored in a file.
- Count
- Mean
- Standard Deviation
- Maximum
- Minimum
β Acceptance Criteria
- The program should print a welcome message when the user executes it.
- The user should be able to enter the name of their file.
- If the file does not open successfully or is not found, print out a warning and exit gracefully.
- If the file is opened successfully and contains space-separated numerical data, calculate the following:
- Count
- Mean
- Standard deviation
- Maximum
- Minimum
- If the file is empty, the program should print out a warning saying that there is no data in the file and exit gracefully.
π Dev Notes
- The data files will be a space-separated list of numbers.
- You can assume that there is one space in between each number and no newlines.
- You wonβt know the length of the list ahead of time. In such a situation, you should never try to read the entire file into memory. Process the file element-by-element without holding onto more than a single piece of data at a time.
- Test your program with a variety of files that you generate (0, 1, 10, 100, 1000+ data points)
π₯οΈ Example Output
$ ./statnumbs.out
Welcome to the Number Statistics Program!
Please enter the name of your data file: does-not-exist.dat
I'm sorry, I could not open 'does-not-exist.dat'.
Please enter another name: values
File 'values' opened successfully!
Reading data from 'values'...
Calculating...
Done processing data!
For your data, the statistics are as follows:
Count: 10
Minimum: 1
Mean: 5.5
Maximum: 10
StdDev: 3.02765
Thank you for using the Number Statistics Program!
π Thought Provoking Questions
- Do the numbers in the file have to be in a specific order in order to calculate the mean, standard deviation, maximum, and minimum?
- What strategies might you use to make sure that your program does not run out of memory space as you read a large data file?
- When finding the largest or smallest item in a list, what value should you start with as your assumed value?
πΌ Add-Ons For the Portfolio
(One Credit) Variable Number of Spaces
Update your program so that it can handle files with a variable number of spaces between each number. The output of the program should be identical to the original. For example, the data file might contain:
1 2 3 4 5 6 7 8 9 0
(One Credit) Newlines
Update your program so that it can handle files with newlines between numbers. The output of the program should be identical to the original. For example, the data file might contain:
1 2 3 4
5 6 7 8
9 0
π Useful Links
N/A