AI data training companies like Scale AI are hiring poets

Silicon Valley’s biggest generative artificial intelligence developers are looking for a new kind of data worker: poets.

A string of job postings from high-profile training data companies, such as Scale AI and Appen, are recruiting poets, novelists, playwrights, or writers with a PhD or master’s degree. Dozens more seek general annotators with humanities degrees, or years of work experience in literary fields. The listings aren’t limited to English: Some are looking specifically for poets and fiction writers in Hindi and Japanese, as well as writers in languages less represented on the internet.

The companies say contractors will write short stories on a given topic to feed them into AI models. They will also use these workers to provide feedback on the literary quality of their current AI-generated text.

The listings illustrate the often-obscured connection between generative AI’s impressive capabilities and the invisible annotation work that powers them. When ChatGPT launched in November 2022, observers were particularly impressed by its ability to write poems in English. Now, annotation firms are collecting creative writing data samples that could extend those powers into other languages. It is a sign that AI developers have flagged fluency in poetic forms as a priority, while refining their generative writing products.

The investment could have dividends for AI firms, according to Dan Brown, a professor at the University of Waterloo who researches computational creativity. “If you can properly generate tabloid headlines in French, that’s one thing. But if [a product] can replicate [Victor] Hugo’s style or somebody famous, that gets a different kind of credibility,” he told Rest of World. “Replicating classical language forms is a way of looking prestigious.”

Scale AI and Appen’s client rosters include some of the biggest names in AI development, including OpenAI, Meta, Google_, and Microsoft. These are companies that are trying to take the lead in an increasingly competitive generative AI race. “The first company advantage in this space is incredibly big,” Brown said. “If there are countries and languages for which companies are failing and somebody can come in and snap those spaces up, it’s an opportunity for them to wrap up the market before any new players can come in.”

In a statement to Rest of World, an Appen spokesperson said the demand for writing contractors has increased significantly since the end of 2022, including in languages other than English. “When hiring for contributor roles like this one, we identify the types of skills required to develop high-quality training data for a particular use case and client,” the spokesperson said. “In this case, creative writers have a unique expertise that enables us to develop high-quality training data for creative AI generation like poetry, song lyrics and narrative writing,” they said.

A spokesperson for Scale AI declined to answer any specific questions about their recruitment efforts for competitive reasons. “Our work has and always will include humans in the loop as it’s critical for developing responsible, safe, and accurate AI,” they wrote in a statement to Rest of World.

Training an AI tool to generate high-quality literary writing, like poetry, is no small challenge. Many large language models (LLMs) are not trained to be creative. One of the criteria used by AI researchers to judge creativity is novelty — how different the writing generated by a model is from what already exists in the world. But tools like ChatGPT were built to mimic human writing, not to innovate on it.

“They are trained to reproduce. They are not designed to be great, they try to be as close as possible to what exists,” Fabricio Goes, who teaches informatics at the University of Leicester, told Rest of World, explaining a popular stance among AI researchers. “So, by design, many people argue that those systems are not creative.”

There is a reason many of the first regularly published stories written by AI were football recaps and financial news reports. These are types of writing that often follow easily replicable formats, and rarely require originality. Poetry, meanwhile, is often judged by its ability to weave imagery in surprising ways or conjure a certain mood.

“When human beings [write poetry], it’s very, very difficult for human beings to do it well,” said Brown, noting that most poets go through rounds of editing and revision that LLMs are not trained to do. “Even now, after this LLM revolution has started, these machines are not machines for novelty.”

ChatGPT, for example, even struggles to imitate the structure and rhythm of well-established poets in English, especially when the poets are famous for breaking literary norms. A recent study found ChatGPT largely fails to produce English-language poems in the style of Walt Whitman, one of the more easily accessible poetry catalogs in the American canon. Whitman’s style features fluid and unstructured verse, but ChatGPT often wrongly defaulted to the rigid norm of four-line stanzas. It continued to do this even when prompted not to.

These issues are often exacerbated when ChatGPT is asked to produce poetic writing in languages other than English. The same researchers struggled to imitate common Polish styles of poetry, according to Goes. Earlier this year, researchers attempted to refine models to address shortcomings in AI-generated Japanese poetry, such as haiku and waka.

Rest of World observed similar problems when we tested ChatGPT’s ability to write a poem in Tamil. The poems were incoherent at best.

To date, there is evidence that major AI developers have been relying on easily scrapable databases to train their models for literary writing. That includes Project Gutenberg, an open-source database with tens of thousands of literary works in the public domain. Some researchers also speculate developers have been scraping Archive of Our Own, commonly known as AO3, a platform hosting over 5 million works of fan fiction. The copyrighted works of famous authors including Stephen King, Zadie Smith, and George Saunders were recently reported by The Atlantic to be part of the popular LLM data set Books3.

Like most data assembled by scraping the internet, many of these databases are largely dominated by the English language.

“Replicating classical language forms is a way of looking prestigious.”

Scale AI and Appen’s clients are paying a clear premium for creative writers to help fill this literary language gap. In Japanese, for example, Scale AI only offers $13.98 per hour for a standard data worker. But for an expert Japanese-language poet, book editor, or creative writer, the company has rates as high as $50 per hour. The requirement that applicants have a graduate school degree likely contributes to this pay bump.

Rest of World previously reported that Scale AI pays a mere fraction of $50 per hour for standard data workers in underrepresented languages. Telugu-speaking contractors, for example, can only earn $1.43 per hour.

There is precedent for these companies to lean on experts for data work — whether that be clinicians annotating medical images, or former military personnel working on defense-related AI products. Milagros Miceli, a researcher at the Distributed AI Research Institute (DAIR), told Rest of World this trend towards professionalization has only picked up in the last six months. Companies are shifting from building LLMs from scratch, to fine-tuning them for specific applications.

“It’s not enough now that someone just speaks the language. It’s not enough that someone is native,” said Miceli, noting rising standards for crowd-based data work. “They have to have a very broad vocabulary and be in total command of the language.”

Julian Posada, an assistant professor at Yale University, and a member of the law school’s Information Society Project, questions whether creatives will accept this work as a sustainable source of employment. But he told Rest of World it may sidestep one of the main criticisms of AI coming from creative industries: copyright infringement.

In recent months, workers in creative industries including manga illustrators in Japan, musical artists in India, and TV writers in the U.S. have been protesting AI developers’ blasé approach to copyright law. Most recently, several class action lawsuits have been filed against OpenAI by prominent authors and playwrights, including Pulitzer Prize winner Michael Chabon. They claim their copyrighted work was included in ChatGPT’s training data without permission, since the tool can accurately summarize their work and imitate their style. Any text written for Scale AI or Appen, however, is likely to be owned in full by the training data company or its clients.

“We could be going to the point where you cannot fit copyrighted material into many models,” Posada said, forecasting a change in the industry if this recent wave of copyright litigation is successful. “This could be a solution that the tech sector is considering: just purchasing creative writing to feed AI models.”

Source link