For the past few weeks, I have been getting a few duplicate emails. I use an anti-spam redirecting email service for things like online purchases so I don't have to give out my real email address. This service accounts for something like 10% of all emails, but 100% of the duplicates. When I told them about this, they said that their logs show that they are only sending one copy to me, so the problem is not with them. I would like to calculate the odds that all of the duplicates would come through them if the duplicates were totally random. I'm not sure if I am doing this right. Let: N = Total number of emails D = Number of duplicates P = Probability of a duplicate email = D/N Q = 1-P S = Percentage of emails coming through the anti-spam service Ns = Number of emails coming through the anti-spam service = N*S Nn = Number of emails not coming through the anti-spam service = N-Ns Is this the right formula for the odds that all of the duplicate emails would come through the service? Pall = (P^Ns) x (Q^Nn) x (Combin(N,Ns)
Suppose I receive 5 emails and 2 of them are duplicates. (Note that I am counting each pair of duplicate emails as a single event. There would actually be 7 physical emails.) Let D represent a duplicate email and S represent a solo email. There are 10 combinations of how these 5 emails can be received. DDSSS DSDSS DSSDS DSSSD SDDSS SDSDS SDSSD SSDDS SSDSD SSSDD At first I thought this mattered and I had to add up the probabilities for all 10 combinations. But then I thought about the case where there I get just 2 emails and they are both duplicates. Now there is just one case: DD Let A = Emails received through the anti-spam channel. B = Emails received through the regular channel. Pa = The percentage of emails that come through the anti-spam channel (about 10%). Pb = The percentage of emails that come through the regular channel = 1-Pa (about 90%). Nd = The total number of duplicate emails. The probability that both of these (duplicate) emails came through the A channel would be Pa^2. For these 2 emails, there are 4 possibilities with these probabilities: DD = Pa*Pa DS = Pa*Pb = Pa*(1-Pa) SD = Pb*Pa = (1-Pa)*Pa SS = Pb*Pb = (1-Pa)^2 And if I add up these probabilities, they sum to 1. Pa^2 + Pa*(1-Pa) + (1-Pa)*Pa + (1-Pa)^2 = Pa^2 + Pa - Pa^2 + Pa - Pa^2 + 1 - 2Pa + Pa^2 = 1 Now, if I get 3 emails and 2 of them are duplicates, there are 3 combinations: DDS DSD DDS But the order doesn't seem to matter. For each of these possibilities, the probability that both duplicate emails came through the A channel appear to be the same as above. And the probability for the S (solo) email is "1" because we don't care which channel it took. If there were 5 emails and 2 duplicates, each combination, such as SSDSD, would still have the same probabilities because the probabilities of all of the S's = 1. If this is correct, them the total number of emails doesn't seem to matter. And the probability that all of the duplicate emails came through Channel A is just the probability that any email came through Channel A raised to the power of the number of duplicate emails: Pall = Pa^Nd Can anyone tell me if this is correct or, if not, why? Thanks
This looks correct. If the chance of a particular email coming through the anti-spam service is 1/10 and ALL duplicates are coming through the anti-spam service then the probability of this occurring randomly is (.1)^N where N is the number of duplicates.
So it doesn't matter how many non-duplicate emails I get. If I get 2 duplicate emails and both of them come through the anti-spam channel, it doesn't matter whether I got 20 emails or 20,000 as long as 10% of them come through the anti-spam channel. Is that right? Maybe I am asking the wrong question. Maybe I should be asking what the probability is that the anti-spam channel is causing the duplicates. It's like asking what are the odds that smoking causes cancer based on some data about the rate of cancer in smokers and non-smokers. How would I calculate the odds that the duplicates are not caused by the anti-spam channel if N of N duplicates come through that channel?
I think you're still making the case. If 100% of duplicates come through the anti-spam and you have sufficient numbers to show that it cannot be a coincidence then the anti-spam folks shouldn't deny it. The fact that some of their emails are not duplicates is not a defense for them.
Yes, I agree that the case is strong. I'm just curious about the statistics. I'd like to know how to calculate the odds that it isn't caused by that channel.