StarCoderData.23 A large-scale code dataset derived from the permissively licensed
GitHub collection The Stack (v1.2). (Kocetkov et al., 2022), which applies deduplication and
filtering of opted-out files. In addition to source code, the dataset includes supplementary
resources such as GitHub Issues and Jupyter Notebooks (Li et al., 2023).
That’s not random Github accounts or “delicensing” anything. People had to opt IN to be part of “The Stack”. Apertus isn’t training itself from community code.
I’m tired of arguing with you about this, and you’re still wrong. It was opt-out, not opt-in, based initially on a GitHub crawl of 137M repos and 52B files before filtering & dedup.
But again, you’d have to set your project to public and your license to “anyone can take my code and do whatever they want with it” before it’d be even added to that list. That’s opt-in, not opt-out. I don’t see the ethical dilemma here. I’m pretty sure I’ve found ethical AI, that produces good value for me and society, and I’m going to keep telling people about it and how to use it.
Apertus is most certainly trained on source code hosted on GitHub. It is laid out here in their technical report:
https://github.com/swiss-ai/apertus-tech-report
It uses a large dataset called TheStack, among others.
That’s not random Github accounts or “delicensing” anything. People had to opt IN to be part of “The Stack”. Apertus isn’t training itself from community code.
I’m tired of arguing with you about this, and you’re still wrong. It was opt-out, not opt-in, based initially on a GitHub crawl of 137M repos and 52B files before filtering & dedup.
But again, you’d have to set your project to public and your license to “anyone can take my code and do whatever they want with it” before it’d be even added to that list. That’s opt-in, not opt-out. I don’t see the ethical dilemma here. I’m pretty sure I’ve found ethical AI, that produces good value for me and society, and I’m going to keep telling people about it and how to use it.