The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional, would have emphasized if I were being asked to record these). The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.Īnd even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. This is a step change in quality compared to SOTA.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |