Open-source AI chatbots are booming — what does this mean for researchers? - AiScholar Website

The craze for generative artificial intelligence (AI) that began with the release of OpenAI’s ChatGPT shows no sign of abating. But while large technology companies such as OpenAI and Google have captured the attention of the wider public — and are finding ways to monetize their AI tools — a quieter revolution is being waged by researchers and software engineers at smaller organizations.

Whereas most large technology companies have become increasingly secretive, these smaller actors have stuck to the field’s ethos of openness. They span the spectrum from small businesses and non-profit organizations to individual hobbyists, and some of their activity is motivated by social goals, such as democratizing access to technology and reducing its harms.

Such open-source activity has been “exploding”, says computer scientist Stella Biderman, head of research at EleutherAI, an AI research institute in New York City. This is particularly true for large language models (LLMs), the data-hungry artificial neural networks that power a range of text-oriented software, including chatbots and automated translators. Hugging Face, a New York City-based company that aims to expand access to AI, lists more than 100 open-source LLMs on its website.

LLaMA leak

Last year, Hugging Face led BigScience, a coalition of volunteer researchers and academics, to develop and release one of the largest LLMs yet. The model, called BLOOM, is a multilingual, open-source system designed for researchers. It continues to be an important tool: the paper that described it has since amassed more than 300 citations, mostly in computer-science research.

In February, an even bigger push came for the open-source movement when Facebook’s parent company, Meta, made a model called LLaMA freely available to selected external developers. Within a week, the LLaMA code was leaked and published online for anyone to download.

The availability of LLaMA has been a game-changer for AI researchers. It is much smaller than other LLMs, meaning that it doesn’t require large computing facilities to host the pretrained model or to adapt it for specialized applications, such as to act as a mathematics assistant or a customer-service chatbot. The biggest version of LLaMA consists of 65 billion parameters: the variables set during the neural network’s initial, general-purpose training. This is less than half of BLOOM’s 176 billion parameters, and a fraction of the 540 billion parameters of Google’s latest LLM, PaLM2.

“With LLaMA, some of the most interesting innovation is on the side of efficiency,” says Joelle Pineau, vice-president of AI research at Meta and a computer scientist at McGill University in Montreal, Canada.

Open-source developers have been experimenting with ways of shrinking LLaMA down even more. Some of these techniques involve keeping the number of parameters the same but reducing the parameters’ precision — an approach that, surprisingly, does not cause unacceptable drops in performance. Other ways of downsizing neural networks involve reducing the number of parameters, for example, by training a separate, smaller neural network on the responses of a large, pretrained network, rather than directly on the data.

Within weeks of the LLaMA leak, developers managed to produce versions that could fit onto laptops and even a Raspberry Pi, the bare-bones, credit-card-sized computer that is a favourite of the ‘maker’ community. Hugging Face is now primarily using LLaMA, and is not planning to push for a BLOOM-2.

Shrinking down AI tools could help to make them more widely accessible, says Vukosi Marivate, a computer scientist at the University of Pretoria. For example, it could help organizations such as Masakhane, a community of African researchers led by Marivate that is trying to make LLMs work for languages for which there isn’t a lot of existing written text that can be used to train a model. But the push towards expanding access still has a way to go: for some researchers in low-income countries, even a top-of-the-range laptop can be out of reach. “It’s been great,” says Marivate, “but I would also ask you to define ‘cheap’”.

Looking under the hood

For many years, AI researchers routinely made their code open source and posted their results on repositories such as the arXiv. “People collectively understood that the field would progress more quickly if we agreed share things with each other,” says Colin Raffel, a computer scientist at the University of North Carolina at Chapel Hill. The innovation that underlies current state-of-the-art LLMs, called the Transformer architecture, was created at Google and released as open source, for example.

Making neural networks open source enables researchers to look ‘under the hood’ to try to understand why the systems sometimes answer questions in unpredictable ways and can carry biases and toxic information over from the data they were pre-trained on, says Ellie Pavlick, a computer scientist at Brown University in Providence, Rhode Island, who collaborated with the BigScience project and also works for Google AI. “One benefit is allowing many people — especially from academia — to work on mitigation strategies,” she says. “If you have a thousand eyes on it, you’re going to come up with better ways of doing it.”

Pavlick’s team has analysed open-source systems such as BLOOM and found ways to identify and fix biases that are inherited from the training data — the prototypical example being how language models tend to associate ‘nurse’ with the female gender and ‘doctor’ with the male gender.

Pretraining bottleneck

Even if the open-source boom goes on, the push to make language AI more powerful will continue to come from the largest players. Only a handful of companies are able to create language models from scratch that can truly push the state of the art. Pretraining an LLM requires massive resources — researchers estimate that OpenAI’s GPT-4 and Google’s PaLM 2 took tens of millions of dollars’ worth of computing time — and also plenty of ‘secret sauce’, researchers say.

“We have some general recipes, but there are often small details that are not documented or written down,” says Pavlick. “It’s not like someone gives you the code, you push a button and you get a model.”

“Very few organizations and people can pretrain,” says Louis Castricato, an AI researcher at open-source software company Stability AI in New York. “It’s still a huge bottleneck.”

Other researchers warn that making powerful language models broadly accessible increases the chances that they will end up in the wrong hands. Connor Leahy, chief executive of the AI company Conjecture in London, who was a co-founder of EleutherAI, thinks that AI will soon be intelligent enough to put humanity at existential risk. “I believe we shouldn’t open-source any of this,” he says.