The conversion latencies do not affect sound reproduction, since the D/A converters work at a fixed clock rate (usually 44.1KHz.) Either the algorithm works and delivers data at that rate, or it doesn't work and you hear dropouts (or more commonly, you get no sound at all.)
And the dropouts would be "significant" enough that pretty much any listener who is paying attention would notice, correct? That is, it would be more than just a few ms--it would be glaringly obvious.
Because they wish to hear a difference. There have been plenty of examples where people are double-blind tested and couldn't tell the difference between various audio formats. In a recent test where listeners listened to music encoded in 128kbps MP3, 320kbps MP3, and WAV (i.e. lossless) the average score was about 60% accurate. Only 3% could accurately tell the difference across all six songs - and that's identical to random chance (i.e. pure guessing would give you 3.1%)
I've also seen studies of varying legitimacy in which listeners preferred 128 or 192 kbps MP3 to WAV files, though that speaks less to qualitative difference and more to plain personal preference.
Heck, there has been talk recently about the value of 24 bit recordings (250x better dynamic range than the standard 16 bit) so a website ran a test between a large-word-size audio sample and a small-word-size sample. Most people could not tell the difference. The website then revealed that it was actually comparing 16 bits to 8 bits - and people STILL couldn't tell the difference.
16 bit to 8 bit? That's kind of extreme. I sometimes make devices with things like the old ISD chips (1600, 2500, used in answering machines and such) or HT8950 (?? I think--they used in toys for voice modulation and suchlike) precisely
because they sound like crap (and that is what I'm aiming for on certain occassions). Granted, those also have something like 8Khz sampling rates (or even lower), but still... I suspect the difference would be more apparent with longer samples with something more than just a human voice.
99% of what people hear is what they think they SHOULD hear. Put a cheap amplifier in a $2000 tube amplifier case, and people will think it sounds better than a high-spec Carver amp in an old Onkyo case. (If they can see the cases, that is.)
Since you brought it up (well, not really, but kind of), should there be any discernable difference when replacing a germanium transistor (in an oscillator--within a transistor combo organ) with a comparable silicon transistor? I would think not, but, again, there are people who claim otherwise.