Before measuring the presence of anomalous values, it is important to define the expected range of values in the dataset. This can be done by calculating statistical measures such as the mean and standard deviation. Mean represents the average value of the dataset while standard deviation measures the variability of the data points around the mean. In a normal distribution, which is a symmetrical bell-shaped curve, around 68% of the data points are within one standard deviation from the mean, 95% of the data points are within two standard deviations, and 99.7% of the data points are within three standard deviations.
To calculate anomalous values, we need to first determine the expected range of values by calculating the mean and standard deviation. Once we have these values, we can identify anomalous values by measuring how far each data point deviates from the mean in terms of multiples of the standard deviation. Values that are more than three standard deviations away from the mean can be considered outliers as they occur very infrequently in a normal distribution.
There are different methods to calculate anomalous values, and we will discuss two of them:
– Z-score method: The Z-score method is commonly used to identify outliers in a dataset. A Z-score measures the number of standard deviations a data point is from the mean. Z-score can be calculated using the formula:
Z-score = (Value – Mean) / Standard deviation
If the Z-score of a data point is greater than 3 or less than -3, it is considered an outlier.
– Modified Z-score method: In some cases, the Z-score method may not be effective in identifying outliers, especially when the dataset is skewed. The Modified Z-score method was developed to overcome this issue. The formula for Modified Z-score is:
Modified Z-score = 0.6745 * (Value – Median) / MAD
where Median is the median value of the dataset, and MAD (Median Absolute Deviation) is a measure of variability that is less sensitive to outliers than standard deviation. MAD is calculated by finding the median of the absolute deviations from the median of the dataset.
Similar to the Z-score method, if the Modified Z-score of a data point is greater than 3.5 or less than -3.5, it is considered an outlier.
It is important to note that the above methods are not foolproof and may not identify all anomalous values. Therefore, visual inspection of the data is also recommended to detect outliers. Plotting the data points on a scatter plot or a box plot can help identify any extreme values that fall outside the expected range.
In conclusion, anomalous values in a dataset can have a significant impact on data analysis. Calculating these anomalies is an important step in data cleaning and preprocessing. The methods discussed in this article, namely Z-score and Modified Z-score, are effective ways to identify outliers. However, visual inspection of data is also recommended to ensure that all anomalous values have been detected. Handling outliers appropriately can lead to more reliable and accurate data analysis results.