{"id":2410,"date":"2025-06-07T15:15:35","date_gmt":"2025-06-07T15:15:35","guid":{"rendered":"https:\/\/musictechohio.online\/site\/ai-built-ethical-data\/"},"modified":"2025-06-07T15:15:35","modified_gmt":"2025-06-07T15:15:35","slug":"ai-built-ethical-data","status":"publish","type":"post","link":"https:\/\/musictechohio.online\/site\/ai-built-ethical-data\/","title":{"rendered":"The Tech Industry Said It Was &#8220;Impossible&#8221; to Create AI Based Entirely on Ethically-Sourced Data, So These Scientists Proved Them Wrong in Spectacular Fashion"},"content":{"rendered":"<div>\n<div><img loading=\"lazy\" width=\"1200\" height=\"630\" src=\"https:\/\/wordpress-assets.futurism.com\/2025\/06\/ai-built-ethical-data.jpg\" class=\"attachment-full size-full wp-post-image\" alt=\"A team of researchers have created a dataset using only openly licensed or public domain content and used it to train an AI model.\" style=\"margin-bottom: 15px;\" decoding=\"async\"><\/div>\n<p><span style=\"font-weight: 400;\">A team of more than two dozen AI researchers from MIT, Cornell University, the University of Toronto, and other institutions have trained a large language model only using data that was openly licensed or in the public domain, the <\/span><a href=\"https:\/\/www.washingtonpost.com\/politics\/2025\/06\/05\/tech-brief-ai-copyright-report\/\"><i>Washington Post <\/i><\/a><a href=\"https:\/\/www.washingtonpost.com\/politics\/2025\/06\/05\/tech-brief-ai-copyright-report\/\">reports<\/a>, providing a blueprint for ethically developing the technology.<\/p>\n<p>But, as the creators readily admit, it was <i>far<\/i> from easy.<\/p>\n<p>As they describe in a yet-to-be-peer-reviewed <a href=\"https:\/\/github.com\/r-three\/common-pile\/blob\/main\/paper.pdf\">paper<\/a> published this week, it quickly became apparent that it wouldn&#8217;t be computing power holding them back, but personpower.<\/p>\n<p>That&#8217;s because the text in the over eight terabyte dataset they put together, which they&#8217;re calling the Common Pile v0.1, had to be manually cleaned up and reformatted to make it suitable for AI training, <i>WaPo<\/i> explains. Then there was the amazing amount of extra legwork that had to be done of doublechecking the copyright status of all the data, since many online works are improperly licensed.<\/p>\n<p>&#8220;This isn&#8217;t a thing where you can just scale up the resources that you have available,&#8221; like access to more computer chips and a fancy web scraper, study coauthor Stella Biderman, a computer scientist and executive director of the nonprofit Eleuther AI, told <i>WaPo<\/i>. &#8220;We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that&#8217;s just really hard.&#8221;<\/p>\n<p>Still, Biderman and her colleagues <i>did<\/i> get the job done.<\/p>\n<p>Once the painstaking odyssey of creating the Common Pile was over, they used their guilt-free dataset to train a seven billion-parameter LLM. The result? An AI that admirably stacks up against industry models like Meta&#8217;s Llama 1 and Llama 2 7B \u2014 which is impressive, but those were versions released over two years ago. That&#8217;s practically a lifetime in the AI race.<\/p>\n<p>Of course, this was accomplished by a\u00a0more or less ragtag team and not a corporation with billions of dollars of resources, and had to make up for this in scrappiness. One particularly resourceful find was a set of over 130,000 English language books in the Library of Congress that&#8217;d been overlooked.<\/p>\n<p>Copyright remains one of the biggest ethical and legal questions looming over AI. Leaders like OpenAI and Google burned through <a href=\"https:\/\/futurism.com\/the-byte\/ai-training-data-shortage\">unfathomable amounts of data<\/a><span style=\"font-weight: 400;\"> on the surface web to get to where they are, devouring everything from news articles to stuff as invasive as your social media posts. And Meta has been <\/span><a href=\"https:\/\/futurism.com\/meta-copyrighted-books-no-value\"><span style=\"font-weight: 400;\">sued by authors<\/span><\/a><span style=\"font-weight: 400;\"> who allege that it illegally used <\/span><a href=\"https:\/\/futurism.com\/zuckerberg-books-train-meta-ai-libgen\"><span style=\"font-weight: 400;\">seven million copyrighted books<\/span><\/a><span style=\"font-weight: 400;\"> that it pirated to train its AIs.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The tech industry has justified its rapacious data demands by <\/span><a href=\"https:\/\/www.theverge.com\/news\/630079\/openai-google-copyright-fair-use-exception\"><span style=\"font-weight: 400;\">arguing<\/span><\/a><span style=\"font-weight: 400;\"> that it all counts as fair use \u2014 and more existentially, that it would be &#8220;<\/span><a href=\"https:\/\/futurism.com\/openai-content-new-york-times-lawsuit\"><span style=\"font-weight: 400;\">impossible<\/span><\/a><span style=\"font-weight: 400;\">&#8221; to develop this technology without vacuuming everyone&#8217;s content up for free.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This latest work is a rebuff to that Silicon Valley line, though it doesn&#8217;t obviate all ethical concerns. This is still a large language model, a technology fundamentally intended to destroy jobs, and perhaps not everyone whose work has ended up in the public domain would be happy with it being regurgitated by AI \u2014 if they aren&#8217;t dead artists whose copyright has elapsed, of course.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Even if AI firms are reined in and are made to only use works with permission or compensation \u2014 a big if \u2014 the fact remains that as long as these companies stick around, there will be significant pressure on copyright holders to allow AI training.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Biderman herself doesn&#8217;t have any illusions that the likes of OpenAI will suddenly turn over a new leaf and start being paragons of ethical data sourcing. But she hopes her work will at least get them to stop hiding what they&#8217;re using to train their AI models.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&#8220;Even partial transparency has a huge amount of social value and a moderate amount of scientific value,&#8221; she told <\/span><i><span style=\"font-weight: 400;\">WaPo<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><strong>More on AI: <\/strong><em><a href=\"https:\/\/futurism.com\/meta-moderators-ai-lgbtq\">If You Thought Facebook Was Toxic Already, Now It&#8217;s Replacing Its Human Moderators with AI<\/a><\/em><\/p>\n<p>The post <a href=\"https:\/\/futurism.com\/ai-built-ethical-data\">The Tech Industry Said It Was &#8220;Impossible&#8221; to Create AI Based Entirely on Ethically-Sourced Data, So These Scientists Proved Them Wrong in Spectacular Fashion<\/a> appeared first on <a href=\"https:\/\/futurism.com\/\">Futurism<\/a>.<\/p>\n<\/div>\n<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>","protected":false},"excerpt":{"rendered":"<p>A team of more than two dozen AI researchers from MIT, Cornell University, the University of Toronto, and other institutions have trained a large language model only using data that&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[182,177,183,1453],"tags":[],"class_list":["post-2410","post","type-post","status-publish","format-standard","hentry","category-ai-chatbots","category-artificial-intelligence","category-generative-ai","category-large-language-models"],"_links":{"self":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/2410","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/comments?post=2410"}],"version-history":[{"count":0,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/2410\/revisions"}],"wp:attachment":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/media?parent=2410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/categories?post=2410"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/tags?post=2410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}