您准备好提高您的品牌知名度了吗? 考虑成为人工智能影响之旅的赞助商。 详细了解这里的机会 。
斯坦福大学互联网的一份新报告称,大型开源人工智能数据集 LAION-5B 已用于训练稳定扩散和谷歌 Imagen 等流行的人工智能文本到图像生成器,其中包含至少 1,008 个儿童性虐待材料实例天文台发现——还有数千起疑似病例。
报告称,LAION-5B 数据集于 2022 年 3 月发布,包含来自互联网的超过 50 亿张图像和相关说明文字,还可能包括数千条疑似儿童性虐待材料(CSAM)。
该报告警告说,数据集中的 CSAM 材料可以使基于这些数据构建的人工智能产品能够输出新的且可能真实的虐待儿童内容。
作为回应,LAION 周二告诉 404 Media,出于“高度谨慎”,它暂时删除了其数据集,“以确保在重新发布之前它们是安全的”。
LAION 数据集之前曾受到过批评
但这并不是 LAION 的图像数据集第一次受到攻击。
早在 2021 年 10 月,认知科学家 Abeba Birhane(现任 Mozilla 值得信赖的人工智能高级研究员)就发表了一篇论文《
,该论文研究了早期的图像数据集 LAION-400M。
VentureBeat 的 AI Impact Tour 即将来到您附近的城市,与企业 AI 社区建立联系!
2022 年 9 月,一名艺术家发现 LAION-5B 图像数据集中引用了她的医生于 2013 年拍摄的私人医疗记录照片。
艺术家 Lapine 在 Have I Been Trained 网站上发现了这些照片,该网站允许人们在流行的人工智能训练数据集中查找他们的作品。
And a class-action lawsuit, Andersen et al. v. Stability AI LTD et al., was brought by visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt in January 2023. While LAION was not sued, it was named in the lawsuit, which said that “Stability is alleged to have ‘downloaded of otherwise acquired copies of billions of copyrighted images without permission to create Stable Diffusion’ known as ‘training images.’ Over five billion images were scraped (and thereby copied) from the internet for training purposes for Stable Diffusion through the services of an organization (LAION, Large-Scale Artificial Intelligence Open Network) paid by Stability.”
Ortiz, an award-winning artist who has worked for Industrial Light & Magic (ILM),Marvel Film Studios, Universal Studios and HBO, spoke at a virtual FTC panel in October and discussed the LAION-5B dataset.
“LAION-5B is a dataset that contains 5.8 billion text and image pairs, which…includes the entirety of my work and the work of almost everyone I know,” she said. “Beyond intellectual property, data sets like LAION-5B also contain deeply concerning material like private medical records, non consensual pornography, images of children, even social media pictures of our actual faces.”
As VentureBeat reported in September, Andrew Ng, former co-founder and head of Google Brain, has made no bones about the fact that the latest advances in machine learning have depended on free access to large quantities of data, much of it scraped from the open internet.
In an issue of his DeepLearning.ai newsletter, The Batch, titled “It’s Time to Update Copyright for Generative AI, he wrote that a lack of access to massive popular datasets such asCommon Crawl,The Pile, andLAIONwould put the brakes on progress or at least radically alter the economics of current research.
“This would degrade AI’s current and future benefits in areas such as art, education, drug development, and manufacturing, to name a few,” he said.
And in theJune 7 editionof The Batch, Ng admitted that the AI community is entering an era in which it will be called upon to be more transparent in our collection and use of data. “We shouldn’t take resources likeLAIONfor granted, because we may not always have permission to use them,” he wrote.
Hamburg, Germany-based high school teacher and trained actor Christoph Schuhmann helped found LAION, short for “Large-scale AI Open Network. According to an April 2023 Bloomberg article, Schuhmann was hanging out on a Discord server for AI enthusiasts and was inspired by the first iteration of OpenAI’s DALL-E to make sure there would be an open-source dataset to help train image-to-text diffusion models.
“几周之内,舒曼和他的同事就获得了 300 万个图像文本对。
三个月后,他们发布了包含 4 亿对的数据集,”彭博社的文章称。
“这个数字现已超过 50 亿,使 LAION 成为最大的免费图像和字幕数据集。”
从那时起,非营利组织 LAION 就开源人工智能话题公开发表意见:例如,在 2023 年 3 月一封呼吁人工智能“暂停”的公开信引发了围绕风险与炒作的激烈争论后,LAION 呼吁加快研究并建立用于大规模开源人工智能模型的联合国际计算集群。
LAION 被从购物网站的视觉数据中删除
LAION 的部分内容是通过使用 Shopify、eBay 和 Amazon 等在线购物服务的视觉数据而被删除的。
在艾伦人工智能研究所最近发表的一篇名为“我的大数据中有什么?”的论文中,研究人员研究了 LAION-2B-en,它是 LAION-5B 的子集,后者包含 23.2 亿张英文照片说明。
例如,它发现 LAION-2B-en 中 6% 的文档来自 Shopify。
艾伦人工智能研究所的研究科学家杰西·道奇(Jesse Dodge)去年 11 月告诉 VentureBeat:“这令人惊讶,因为之前没有人研究过这一点。”
VentureBeat 的使命