Microsoft Innovates: Rewarding Contributors for AI Training Data Contributions
Microsoft is initiating a significant research project aimed at understanding the influence of specific training examples on the outputs of generative AI models, including text, images, and other media types. This endeavor was highlighted in a job listing that resurfaced on LinkedIn.
Overview of Microsoft’s Research Initiative
The job listing seeks a research intern to assist with the project, which aims to establish that models can be trained to efficiently estimate the impact of particular data sources, such as photos and books, on their generated outputs. The listing states:
“Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this.”
This initiative aims to create incentives and recognition for individuals who contribute valuable data, potentially reshaping the future of AI model training.
Legal Challenges and Copyright Issues
Generative AI technologies, including those that produce text, images, and music, are at the center of numerous intellectual property lawsuits. Many AI companies utilize vast amounts of publicly available data for training their models, often without explicit permission from copyright holders. While these companies argue that their practices fall under the fair use doctrine, many creatives—including artists, programmers, and authors—strongly oppose this justification.
Microsoft, in particular, is facing at least two legal challenges from copyright holders:
- The New York Times has sued Microsoft and its collaborator, OpenAI, for allegedly infringing on their copyright by using millions of articles for model training.
- Several software developers have filed lawsuits against Microsoft, claiming that the GitHub Copilot AI coding assistant was unlawfully trained using their copyrighted works.
The Concept of “Data Dignity”
Microsoft’s new research effort, referred to as “training-time provenance,” reportedly involves the expertise of Jaron Lanier, a prominent technologist at Microsoft Research. In an April 2023 op-ed in The New Yorker, Lanier discussed the idea of “data dignity,” which connects digital content with its human creators.
Lanier posits:
“A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output.”
For example, if a user requests an animated movie based on their children in a fantastical setting, the key contributors—such as artists and writers—could be acknowledged and compensated for their influence on the creation.
Existing Efforts in Data Compensation
Several companies are already exploring similar concepts:
- Bria, an AI model developer, recently raised $40 million and claims to compensate data owners based on their overall influence.
- Adobe and Shutterstock provide regular payouts to dataset contributors, though the exact payment amounts are not always clear.
Despite these efforts, many large labs have not established individual contributor payout programs. They often only allow copyright holders to opt out of training, which can be a cumbersome process and applies only to future models.
Potential Implications of Microsoft’s Project
While Microsoft’s research could serve as a proof of concept, there is a history of similar initiatives lacking implementation. For instance, OpenAI announced a tool to let creators specify the inclusion of their works in training data, but nearly a year later, it remains unfulfilled.
Critics speculate that Microsoft may be attempting to “ethics wash” its practices to preemptively address regulatory or legal challenges that could threaten its AI operations. This is especially relevant given recent calls from major AI labs, including Google and OpenAI, to weaken copyright protections related to AI development.
Microsoft has not yet responded to requests for comments on this developing story. For more information on copyright and AI, visit Nolo.