OpenAI wants to work with organizations to build new AI training datasets

The company says the effort will produce public and private databases.


OpenAI is rolling out a new partnership program to collect datasets from third parties that it intends to use to train its AI models. The initiative, OpenAI Data Partnerships, will seek large-scale private and public information that it says is “not already easily accessible online to the public.” The company says the data it will collect doesn't necessarily have to be quantitative or in text formats — the program will also accept images, audio or video.

Notably, the company says it's on the lookout for data on “any topic” and in “any language” so long as it “expresses human intention,” which it likens to long-form essays or transcribed conversations. Human-centric data collected by OpenAI is expected to help the company improve tools like its automatic speech recognition technology which is used to transcribe spoken words. This initiative also lines up with ChatGPT’s recent expansion to support voice queries to engage with users in a conversational manner. Exposing its AI models to more information that teaches it how to hold up human-like conversations will only further improve this feature and other tools that will follow in function.

The model testing conducted throughout the data partnership program will also naturally expand the capabilities of OpenAI’s consumer-facing GPT-4 Turbo, which has been updated to provide users with more complex and meaningful responses. OpenAI says it has already started working with interested organizations, including authoritative bodies like the Icelandic government. Through curated datasets, OpenAI says its working to improve GPT-4’s ability to comprehend queries made in the Icelandic language.

If a private or public organization wants to participate in the program, a representative can submit a form on the company’s website and share information on the data type and size that they intend to share. There are two pathways for datasets. The first is the Open-Source archive, which is ideal for datasets relevant to training language models. However, submissions made to it will be public for anyone to use. Alternatively, OpenAI says a company can submit information through its private dataset pathway which will be funneled to train proprietary AI models, which the company says includes their “foundation models” and “fine-tuned and custom models.” This is recommended for companies or institutions that want to keep their data confidential. But in that same regard, OpenAI says it is not looking for datasets that contain sensitive or personal information.

ChatGPT has already set records for its soaring user base. It has about 100 million weekly active users around the world, meaning privacy will only continue to be a focal point for the tool. Previously, Samsung employees were put in the hot seat for leaking sensitive data to the AI model. While OpenAI claims it does not use data generated by its API to train its models unless a user explicitly submits information through an opt-in form, all eyes will be on how the company handles the data collected through this initiative, especially the private datasets.