Kinda like creating your acceptance test using the same scripts as your unit tes...

DougMerritt · on Aug 7, 2018

More than kind of like; you nailed it: exactly like. This has been an infamous issue with machine learning for decades, where unwary researchers/developers can do this quite accidentally if they're not careful.

The thing is that training data is very often hard to come by due to monetary or other cost, so it's extremely tempting to share some data between training and testing -- yet it's a cardinal sin for the reason you said.

Historically there have been a number of cases where the best machine learning packages (OCR, speech recognition, and more) have been best because of the quality of their data (including separating training from test) more than because of the underlying algorithms themselves.

Anyway it's important to note that developers can and do fall into this trap naively, not only when they're trying to cheat -- they can have good intentions and still get it wrong.

Therefore discussions of methodology, as above, are pretty much always in order, and not inherently some kind of challenge to the honesty and honor of the devs.