Probability all errors in one path

Discussion in 'Physics & Math' started by Jennifer Murphy, Dec 10, 2014.

  1. Jennifer Murphy Registered Senior Member

    Messages:
    239
    For the past few weeks, I have been getting a few duplicate emails. I use an anti-spam redirecting email service for things like online purchases so I don't have to give out my real email address. This service accounts for something like 10% of all emails, but 100% of the duplicates. When I told them about this, they said that their logs show that they are only sending one copy to me, so the problem is not with them.

    I would like to calculate the odds that all of the duplicates would come through them if the duplicates were totally random. I'm not sure if I am doing this right.

    Let:

    N = Total number of emails
    D = Number of duplicates
    P = Probability of a duplicate email = D/N
    Q = 1-P
    S = Percentage of emails coming through the anti-spam service
    Ns = Number of emails coming through the anti-spam service = N*S
    Nn = Number of emails not coming through the anti-spam service = N-Ns

    Is this the right formula for the odds that all of the duplicate emails would come through the service?

    Pall = (P^Ns) x (Q^Nn) x (Combin(N,Ns)
     
  2. Google AdSense Guest Advertisement



    to hide all adverts.
  3. Jennifer Murphy Registered Senior Member

    Messages:
    239
    Suppose I receive 5 emails and 2 of them are duplicates. (Note that I am counting each pair of duplicate emails as a single event. There would actually be 7 physical emails.)

    Let D represent a duplicate email and S represent a solo email. There are 10 combinations of how these 5 emails can be received.
    1. DDSSS
    2. DSDSS
    3. DSSDS
    4. DSSSD
    5. SDDSS
    6. SDSDS
    7. SDSSD
    8. SSDDS
    9. SSDSD
    10. SSSDD

    At first I thought this mattered and I had to add up the probabilities for all 10 combinations. But then I thought about the case where there I get just 2 emails and they are both duplicates. Now there is just one case:

    DD

    Let

    A = Emails received through the anti-spam channel.
    B = Emails received through the regular channel.
    Pa = The percentage of emails that come through the anti-spam channel (about 10%).
    Pb = The percentage of emails that come through the regular channel = 1-Pa (about 90%).
    Nd = The total number of duplicate emails.

    The probability that both of these (duplicate) emails came through the A channel would be Pa^2. For these 2 emails, there are 4 possibilities with these probabilities:
    1. DD = Pa*Pa
    2. DS = Pa*Pb = Pa*(1-Pa)
    3. SD = Pb*Pa = (1-Pa)*Pa
    4. SS = Pb*Pb = (1-Pa)^2
    And if I add up these probabilities, they sum to 1.

    Pa^2 + Pa*(1-Pa) + (1-Pa)*Pa + (1-Pa)^2 = Pa^2 + Pa - Pa^2 + Pa - Pa^2 + 1 - 2Pa + Pa^2 = 1

    Now, if I get 3 emails and 2 of them are duplicates, there are 3 combinations:
    1. DDS
    2. DSD
    3. DDS
    But the order doesn't seem to matter. For each of these possibilities, the probability that both duplicate emails came through the A channel appear to be the same as above. And the probability for the S (solo) email is "1" because we don't care which channel it took. If there were 5 emails and 2 duplicates, each combination, such as SSDSD, would still have the same probabilities because the probabilities of all of the S's = 1.

    If this is correct, them the total number of emails doesn't seem to matter. And the probability that all of the duplicate emails came through Channel A is just the probability that any email came through Channel A raised to the power of the number of duplicate emails:

    Pall = Pa^Nd

    Can anyone tell me if this is correct or, if not, why?

    Thanks
     
  4. Google AdSense Guest Advertisement



    to hide all adverts.
  5. RJBeery Natural Philosopher Valued Senior Member

    Messages:
    4,222
    This looks correct. If the chance of a particular email coming through the anti-spam service is 1/10 and ALL duplicates are coming through the anti-spam service then the probability of this occurring randomly is (.1)^N where N is the number of duplicates.
     
  6. Google AdSense Guest Advertisement



    to hide all adverts.
  7. Jennifer Murphy Registered Senior Member

    Messages:
    239
    So it doesn't matter how many non-duplicate emails I get. If I get 2 duplicate emails and both of them come through the anti-spam channel, it doesn't matter whether I got 20 emails or 20,000 as long as 10% of them come through the anti-spam channel. Is that right?

    Maybe I am asking the wrong question. Maybe I should be asking what the probability is that the anti-spam channel is causing the duplicates. It's like asking what are the odds that smoking causes cancer based on some data about the rate of cancer in smokers and non-smokers.

    How would I calculate the odds that the duplicates are not caused by the anti-spam channel if N of N duplicates come through that channel?
     
  8. RJBeery Natural Philosopher Valued Senior Member

    Messages:
    4,222
    I think you're still making the case. If 100% of duplicates come through the anti-spam and you have sufficient numbers to show that it cannot be a coincidence then the anti-spam folks shouldn't deny it. The fact that some of their emails are not duplicates is not a defense for them.
     
  9. Jennifer Murphy Registered Senior Member

    Messages:
    239
    Yes, I agree that the case is strong. I'm just curious about the statistics. I'd like to know how to calculate the odds that it isn't caused by that channel.
     

Share This Page