Kudurru, the new tool from the creator of Have I Been Trained?, can help artists block web scrapers and even “poison” the scraping by sending back the wrong image.
For those who can’t get through the paywall, this is an article about a system called Kudurru that is monitoring a bunch of websites with images listed in the LAION-5B metadata set. When it sees the same IP address downloading images from those websites simultaneously, it assumes that it must be a bot that’s scraping the data in order to train an AI with it and either blocks them or “poisons” the scrape by sending incorrect images back.
Frankly, I don’t see much likely impact from this. AI training has moved beyond simply using LAION-5B, we’re discovering that a smaller higher-quality dataset is better than just throwing mountains of data at the AI in training. So anything a trainer is downloading is going to be extensively curated before being used for training and this sort of obstruction will be fixed or filtered out.
For those who can’t get through the paywall, this is an article about a system called Kudurru that is monitoring a bunch of websites with images listed in the LAION-5B metadata set. When it sees the same IP address downloading images from those websites simultaneously, it assumes that it must be a bot that’s scraping the data in order to train an AI with it and either blocks them or “poisons” the scrape by sending incorrect images back.
Frankly, I don’t see much likely impact from this. AI training has moved beyond simply using LAION-5B, we’re discovering that a smaller higher-quality dataset is better than just throwing mountains of data at the AI in training. So anything a trainer is downloading is going to be extensively curated before being used for training and this sort of obstruction will be fixed or filtered out.
But the main result is achieved anyway, right? The picture that the system tried to download did not make it into the training set.
Unless the “this sort of obstruction will be fixed” part means the image is downloaded anyway. This is the weakest sort of DRM.
Thanks