Common Crawl; a large scale theft of intellectual property?

I was doing some research yesterday and discovered the Common Crawl dataset - from which GPT3 sources about 60% of its data.

It’s released under a non-sublicensable, non-assignable, non-transferable limited licence. I know the limited licence probably means something like the GPL but I don’t understand the other bits.

The reason this is important is because the data is made available in the states because of fair use protections however the dataset is, essentially, ripped wholesale off the clearnet regardless of copyright status of the original websites.

As mentioned, this dataset makes up the larger part of that used to train GPT3.5, which ChatGPT is a fine tuned version of GPT3.5. It also appears that use of the dataset is illegal in the EU without modification.

It is interesting to note that fair use within the datamining sector has a disputed legal precedent in the realm of digitisation of books, but it is relating to the digitisation of physical literature rather than the bulk processing of published, copyrighted digital works.

Why all this discussion of copyright? Because commercial services are being built off the back of machine learning products. In my eyes this looks like a flimsy excuse to rip off the public wholesale and sell our work back to us.

It almost smells like a class action lawsuit. Thoughts?

1 Like

@PrivacyDingus provided some very valuable research on this front. I’m yet to analyse it but an initial skim makes it seem incredibly relevant to this topic

1 Like

Okay, it doesn’t constitute thoughts, but I feel like you’re currently in thoughtsponge mode and so I’m just going to throw out something else which is relevant(ish). Late 2021 in search we were all talking about Australia’s quite bold/wild move to pressure Google/Facebook to pay for links (a “link tax”) claiming that their work was being disseminated through these (ad-laden) networks and they weren’t being compensated for it. Essentially the product of the news was an attraction to people, a reason to visit Google or Facebook, and this product wasn’t providing any revenue stream back.

We all thought this was quite out there, especially as the websites themselves run ads, and really they have a desperate need for the two beasts which they very much assisted in the creation of in order to provide them with their traffic. We fundamentally didn’t think this was going to happen, then…

For a long while people have been able to brush off these claims by the MSM (using that term in the least conspiracy-theorist way possible) because what was happening wasn’t like what you’re looking to dig into. Really the relationship between a search engine and a newspaper was symbiotic as well as a social media network and a newspaper.

This being said, the process which you’re talking about here is fundamentally different. Before people were building products and services based off of data gleamed from the open web, but there was abstraction between that process and the data itself. I can see a lot of legal issues for sure, especially as a state was willing to class snippets from an article which was viewed in its goddamn entirety by the end user a lot of the time (I’d wager, but then again, as my grandad said during Christmas “doesn’t anyone read these days?”) as deserving of a revenue stream… who is to say this doesn’t go the same way?

Hard to attribute mind, I’d reckon that @MediaActivist probably has something on this particular case if they are across it. Even just a take if not :smiley:

1 Like

I wonder if attribution matters in a class action lawsuit? As long as you can prove a significant proportion of the work is based on copyright infringement you don’t necessarily have to make direct attribution. Also with regard to platforms like midjourney it’s a lot easier because art style is a lot more practical to trace. Not easy, mind you, but easier.

Especially as, if you’re arguing the creation of an original work, you cross into the territory of AI and consciousness, imo. If it’s a machine learning algorithm then it’s surely impossible to argue that the resultant work is not derivative? If it is conscious (a very high bar to pass) we have bigger problems