@t0k I think you're right that they wanted the data, but I also think they wanted the potential customer base to sell into.
I don't understand what you mean by breaking copyright law, though. Plenty of the code on Github isn't owned by the people uploading it, and therefore they cannot give Microsoft any more rights to it than they would have had if Microsoft had just downloaded it from Github without needing to buy Github.
@freakazoid Circumventing copyright law is probably a better wording.
There's two somewhat independent things that nevertheless play together:
1) Microsoft got access to a enormous code base. Even though many repos on GH are public, accessing and indexing ALL of it as a third party is difficult (for example GH blocks your IP).
2) Circumventing copyright law: I believe that #Copilot basically does some context sensitive statistical copy-pasting. In some sense this automatically rewrites existing code (protected copyright) into another piece of code.
Like a compiler translates code into binary. There's a automated transformation process. The question is: Is the output also protected by copyright? In case of compilers it is.
I speculate that in the view of MS this transformation strips away copyright.
@t0k That's interesting, given that in the past MS has had a fairly maximalist interpretation of GPL, where they would not allow their engineers to even LOOK at GPL code for fear of subjecting their other code to potential copyright claims. But perhaps they believe that training algorithms is the equivalent of "clean room" reverse engineering, where one team documents the thing, then leaves, and another team uses the documentation to implement a new thing. But I would find that surprising.
@t0k Google Books got shut down because it wasn't considered permissible to even make copyrighted works searchable without permission. But in that case copying wasn't allowed at all.
I suspect that there will be a big legal fight over just this issue when companies start using AIs to produce works that would be copyrightable if they were produced by a human. Is AI-generated music trained on pop songs a derivative work of every song?
@t0k If that's the case, wouldn't every piece produced by a composer who had ever listened to copyrighted music also constitute a derivative work? For music, the standard has been the amount of the song that sounds the same, and it's primarily focused on "sampling."
Maybe similar could be applied to software, but I would personally oppose expanding IP even further. I don't think we need IP at all.
A social network for the 19A0s.