Hi everyone-
One final bank holiday in the rear-view mirror and June is here… and hopefully some warmer weather. We’re mixing things up this month and having an Open Source special… we still have our regular sections on bias and ethics and generative AI but then focus on recent developments in the open source community, specifically focusing on LLM … if you’re curious about finetuning your own 30 billion parameter large language model on one gpu, don’t miss this!
Following is the June edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. NOTE: If the email doesn’t display properly click on the “Open Online” link at the top right.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here
handy new quick links: committee; ethics; generative ai; open source explosion; reader updates; jobs
Committee Activities
We are continuing our work with the Alliance for Data Science professionals on expanding the previously announced individual accreditation (Advanced Data Science Professional certification) into university course accreditation. Remember also that the RSS is now accepting applications for the Advanced Data Science Professional certification- more details here.
Our recent post from Martin Goodson (CEO and Chief Scientist at Evolution AI) “The Alan Turing Institute has failed to develop modern AI in the UK”, highlighting the lack of government support for the open source AI community in the UK, generated significant interest and reached the front page of hacker news.
Various members of the committee were involved with an engaging conversation with RealWorldDataScience on ChatGPT and Large Language Models: check out part 1 (How is ChatGPT changing Data Science) and part 2(LLMs- do we need to understand the maths or simply recognise the limitations).
The section (together with the RSS South Wales Group) hosted an engaging and thought provoking online event on May 23rd- “What is the future of data science and AI?”, chaired by Dr Nicola Brassington (Deputy Director of Digital Analytics and Transformation, Dept for Health and Social Care and NHSE), and including Professor Wendy Dearing (Dean, Institute of Management and Health, University of Wales) and Osama Rahman (Director of the Data Science Campus)
Martin also continues to run the excellent London Machine Learning meetup and is very active with events. The next event is a special one on June 7th- the first in person event since 2020- with a great speaker, Jane Dwivedi-Yu (a researcher at Meta AI), and a great topic "Language Models Can Teach Themselves to Use Tools” - this is Meta’s Toolformer. (RSVPs will close on Monday 5th June at noon). Videos are posted on the meetup youtube channel - and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics...
Bias, ethics and diversity continue to be hot topics in data science...
We are starting to see examples of generative AI created political ad content, with the Republicans in the US reportedly exploring opportunties
And Sam Altman (CEO of OpenAI) has warned of the potential of systems like ChatGPT manipulating users
“The general ability of these models to manipulate and persuade, to provide one-on-one interactive disinformation is a significant area of concern. Regulation would be quite wise: people need to know if they’re talking to an AI, or if content that they’re looking at is generated or not. The ability to really model … to predict humans, I think is going to require a combination of companies doing the right thing, regulation and public education.”
Meanwhile Palentir, the big data company with links to defence and the intelligence community, has showcased new AI driven military capabilities - on which the stock price soared
Lots of activity on the regulatory front this month..
In the US, the FTC and a number of other regulatory bodies released a joint statement on AI
“We already see how AI tools can turbocharge fraud and automate discrimination, and we won’t hesitate to use the full scope of our legal authorities to protect Americans from these threats,”
Meanwhile some see China ahead of the US in terms of regulation of AI, releasing a draft of novel rules around training data and accuracy of generated media
However, it does appear that European regulators are taking the most aggressive approach:
Behind EU lawmakers' challenge to rein in ChatGPT and generative AI
"Under new proposals targeting "foundation models," companies like OpenAI, which is backed by Microsoft Corp (MSFT.O), would have to disclose any copyrighted material - books, photographs, videos and more - used to train their systems."
There are even beginning to be rumours of EU regulators targeting open source systems
"While the act includes open source exceptions for traditional machine learning models, it expressly forbids safe-harbor provisions for open source generative systems."
With all this going, it is not surprising that Google’s new AI functionality is available pretty much anywhere apart from Europe…
Interesting commentary from Stephanie Hare (interviewed by RealWorldDataScience) on AI and ethics.
“Another generation or two, when we’re older, might look at some of what technology we’ve built or our behaviour on climate change, our track record – did we do what we could have done to slow global warming, to improve biodiversity? – and they might hold us to account, saying, ‘You could have stopped this and you didn’t, right? It’s not just what you did. It’s what you did not do.’ So we have to be super careful when we think about ethics, because ethics change, values change over time. And what seems okay today may not be okay in 10, 20, 30 years time. That is on my mind all the time. It’s not very relaxing.”
We are certainly seeing real world impact of the new generative AI models and their take up.
Stack Overflow’s traffic is apparently down 14% in March co-incident with the huge increase in ChatGPT usage
We have workers taking on multiple jobs fuelled by ChatGPT productivity gains
Others who were already working multiple jobs have used recent advancements in AI to turbocharge their situation, like one Ohio-based technology worker who upped his number of jobs from two to four after he started to integrate ChatGPT into his work process. “I think five would probably just be overkill,” he said.
Spotify is having to eject thousands of AI-made songs in an effort to purge fake streams
News of a photographer trying to get his photos removed from the training sets… with limited success so far
"Lawyers replied that he owes $979 for making an unjustified copyright claim."
Although perhaps there are technical solutions on the way for artists: GLAZE: Protecting Artists from Style Mimicry by Text-to-Image Models
And maybe an alternative could be to embrace the change? - Grimes Unveils Software to Mimic Her Voice, Offering 50-50 Royalties for Commercial Use
A useful primer on the dark art of Prompt Injection and some good commentary on how dangerous is could be
"Increasingly though, people are granting LLM applications additional capabilities. The ReAct pattern, Auto-GPT, ChatGPT Plugins—all of these are examples of systems that take an LLM and give it the ability to trigger additional tools—make API requests, run searches, even execute generated code in an interpreter or a shell. This is where prompt injection turns from a curiosity to a genuinely dangerous vulnerability."
How dangerous is dangerous? Well, there is an often quoted statistic going the rounds that AI researchers believe there’s a 10% chance AI will kill us all… but is that claim really true? Perhaps not…
Either way, when one of the OGs of Deep Learning (Geoff Hinton) resigns and warns of dangers ahead… it’s definitely worth taking seriously
“The idea that this stuff could actually get smarter than people — a few people believed that,” he said. “But most people thought it was way off. And I thought it was way off. I thought it was 30 to 50 years or even longer away. Obviously, I no longer think that.”
Generative AI ... oh my!
Still such a hot topic it feels in need of it's own section, for all things DALLE, IMAGEN, Stable Diffusion, ChatGPT...
The pace of innovation does not seem to be letting up… let’s start with another ‘Wes Anderson’ AI generated trailer (a sort of barometer for capability improvements…) - “The Lord of the Rings by Wes Anderson”
A month or two ago we talked about how Google had been asleep at the wheel on GenAI, prompting the ‘code red’ … we’ll they’ve been busy!
Google IO 2023 was wall to wall AI, packed with announcements and new tools
Underpinning the majority of new capabilities is Google’s new Large Language Model, PaLM 2 - although it has sadly not released many details of the new model (there is at least a technical report), and it is by no means clear that it is better than. GPT4, OpenAI’s current LLM.
Bard, Googles’ ChatGPT competitor, gets a big upgrade through PaLM2 - it is clearly much improved (and very impressive) but again the jury is still out on whether it beats ChatGPT
However, Google is now actively competing on many fronts…
What look to be significant improvements in a healthcare specific LLM model - with Med-PaLM 2 setting new state of the art results on medical question-answer data sets (paper and more commentary here)
"Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations"
Pushing into utilising GenAI for security applications
Exploring new creative and artistic applications, including Haiku Imaginged
And launching a brand new AI music generator - MusicLM
We are continuing to see the emergence of ‘vertical’ LLM applications- foundational models trained or tuned to a specific vertical industry or application
Futuri Launches RadioGPT™, The World’s First AI-Driven Localized Radio Content
Ask Skift - chatbot trained on decades of travel industry specific information
ChatBots specifically for academic researchers to help scanning literature: elicit.org and consensus
And of course, we still barely scratching the surface of potential applications and use cases… here some pretty amazing examples of analysing data using the new ‘Code Interpreter’ plugin which enables ChatGPT to upload and download files and run python code…
This new research from OpenAI is very elegant and potentially groundbreaking… Language models can explain neurons in language models
"We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2."
Interesting behind the scenes post from the github CoPilot team on how they have developed the model over time
Finally some amazing releases from Anthropic (remember Google invested $400m…) this month talking about the functionality and approach taken with their ChatBot, Claude
One of the challenges of training Large Language Models is making the ‘human in the loop final tuning’ (RLHF) fair and unbiased - recent research shows that LLMs do have opinions…
"Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups"
However, the team at Anthropic have come up with a very elegant approach to countering this called Constitutional AI - well worth a read, paper here
"Constitutional AI responds to these shortcomings by using AI feedback to evaluate outputs. The system uses a set of principles to make judgments about outputs, hence the term “Constitutional.” At a high level, the constitution guides the model to take on the normative behavior described in the constitution – here, helping to avoid toxic or discriminatory outputs, avoiding helping a human engage in illegal or unethical activities, and broadly creating an AI system that is helpful, honest, and harmless."
And in addition, Anthropic released 100,000 token windows - amazingly, this allows you to include, for instance, the whole of the Great Gatsby novel as part of the prompt (about 75,000 words)! Not perfect but impressive results
"The average person can read 100,000 tokens of text in ~5+ hours[1], and then they might need substantially longer to digest, remember, and analyze that information. Claude can now do this in less than a minute. For example, we loaded the entire text of The Great Gatsby into Claude-Instant (72K tokens) and modified one line to say Mr. Carraway was “a software engineer that works on machine learning tooling at Anthropic.” When we asked the model to spot what was different, it responded with the correct answer in 22 seconds."
An Open Source Explosion
Specially for June.. digging into the explosion of open source activity that has been stimulated by the development of Generative AI… We all know and love numpy, scipy, pandas and scikit … but what what about open source genai?
To start with, a leaked internal document from Google (‘We have no moat’) caused quite a stir, pointing out that Google’s competition is not from OpenAI but from the rapid development of open source capabilities…
"But the uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch.
I’m talking, of course, about open source. Plainly put, they are lapping us. Things we consider “major open problems” are solved and in people’s hands today. Just to name a few:
- LLMs on a Phone: People are running foundation models on a Pixel 6 at 5 tokens / sec.
- Scalable Personal AI: You can finetune a personalized AI on your laptop in an evening.
- Responsible Release: This one isn’t “solved” so much as “obviated”. There are entire websites full of art models with no restrictions whatsoever, and text is not far behind.
- Multimodality: The current multimodal ScienceQA SOTA was trained in an hour."
The great Andrej Karpathy summed up the situation best (also great “how to guide” here)
So how do we stay ontop of all this development…
Well, there’s lots of this type of article from LightningAI, comparing various open source implementations directly to the likes of GPT3/4
Various people are now attempting to keep track of the individual models, their size and scope and their performance - here’s a good one from Eugene Yang.
We also now have an open sourced Elo style rating of ‘LLMs in the wild’ - ChatBot Arena (week 2 update here) produced by LMSYS.org
While it may seem like this open source explosion of activity has suddenly materialised, its routes lie in the BigScience effort initiated ini 2021, and driven by leading open source players like HuggingFace, Idris.fr, NaverLabs, LAION etc.
"The BigScience project takes inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are useful for the entire research community."
The BigScience initiative created arguably the first serious open source LLM - Bloom
Since then, many other open source AI outfits have sprung up (such as Eleuther.ai, stability.ai, lightning.ai) as well as amazing researchers like conceptOfMind - who has re-created Google’s PaLM model- , all contributing to the community at a rapid pace.
HuggingFace has developed into a leading AI community of researchers and you can now access a formidable array of ready to use open source capabilities through https://huggingface.co/
To start with, they publish their own ‘OpenLLM’ leaderboard (and hot off the press, FalconLM is now top of the leaderboard - open source and commercially usable, developed by the Technology Innovation Institute (TII) in Abu Dhabi, United Arab Emirates)
With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art. The 🤗 Open LLM Leaderboard aims to track, rank and evaluate LLMs and chatbots as they are released. We evaluate models on 4 key benchmarks from the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks. A key advantage of this leaderboard is that anyone from the community can submit a model for automated evaluation on the 🤗 GPU cluster, as long as it is a 🤗 Transformers model with weights on the Hub
And all this is based on their original Transformers library, which allows anyone to to easily download and train state-of-the-art pretrained models
You can access state of the art Text to Video capabilities
They have just released StarCoder, an open source AI coding assistant (together with a simple implementation guide)
Also GPT-JT, a fully fledged open source ChatGPT alternative
And most recently, Transformer Agents, which allows you to plug LLMs into other services
The underlying challenge for cost constrained open source researchers is how to improve performance with less money, less data and less compute (OpenAI apparently lost $540m developing ChatGPT)
And this has driven open research into improving the efficiency of the training process-
“Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes” (tweet)
"We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation"
“FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance”
"As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost"
And these innovations are leading to ever improving and increasingly efficient open source LLMs:
lit-parrot - “Hackable implementation of state-of-the-art open-source large language models”
LIMA: Less Is More for Alignment
"LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling"
A new (and simpler) approach to RLHF - SLiC-HF
Sophia - “a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M”
A new efficient approach to fine tuning using stacked LLMs
Gorilla - “Large Language Model Connected with Massive APIs”
Finally, QLoRA … absolutely amazing!
It’s all very well having an open source model available, but how do you deploy and fine tune it to your particular use cases? Fear not, lots of open source activity here as well (if you are delving into this, some “numbers every LLM developer should know”):
Finetuning Redpajama - great step by step tutorial
Introducing Lamini, the LLM Engine for Rapidly Customizing Models
"Lamini is an LLM engine that allows any developer, not just machine learning experts, to train high-performing LLMs, as good as ChatGPT, on large datasets with just a few lines of code"
“a universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications, plus a productive framework for everyone to further optimize model performance for their own use cases”
"Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point"
And then when you want to build ontop of the models, open source provides again, with fantastic libraries such as LangChain - now providing their new Plan and Execute Agents for AutoGPT applications; and more keeps arriving in this space, such as superagent.
Finally, plenty of open source “AI Inside” turbo-charged developer applications to experiment with:
pandas-ai - “make Pandas conversational, allowing you to ask questions about your data and get answers back, in the form of pandas DataFrames”
Kanaries RATH - “autopilot and copilot for exploratory data analysis”
Scikit-LLM - “Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks“
Even simple command line tools
Of course the open source community needs support which is why Martin’s post was so important, and why it’s great to see similar calls beginning to be made for more UK Government assistance
“This is a hugely important technology, arguably the most transformative in the next few decades, and the UK risks being left behind,” said Haydn Belfield, associate fellow at the University of Cambridge’s Leverhulme Centre for the Future of Intelligence, said."
Bigger picture ideas
Longer thought provoking reads - lean back and pour a drink! ... Some excellent commentary this month!
"Like an improv actor dropped into a scene, a language model-driven chatbot is simply trying to produce plausible-sounding outputs. Whatever has happened in the interaction up to that point is the script of the scene so far: perhaps just the human user saying “Hi,” perhaps a long series of back-and-forths, or perhaps a request to plan a science experiment. Whatever the opening, the chatbot’s job—like that of any good improv actor—is to find some fitting way to continue the scene.”
“Thus I'm seeing more and more teams use a process for development and deployment that looks like this:
- Use prompting to develop a model. This can take minutes to hours.
- Deploy the model to production and run it on live data quickly but safely, perhaps by running in “shadow mode,” where the model’s inferences are stored and monitored but not yet used. (More on this below.)
- If the model’s performance is acceptable, let it start making real decisions.
- Only after the model is in production, and only if we need to benchmark more carefully (say, to eke out a few percentage points of performance improvement), collect test data to create a more careful benchmark for further experimentation and development. But if the system is doing well enough, don’t bother with this.
"However, LLMs are not a direct solution to most of the NLP use-cases companies have been working on. They are extremely useful, but if you want to deliver reliable software you can improve over time, you can’t just write a prompt and call it a day. Once you’re past prototyping and want to deliver the best system you can, supervised learning will often give you better efficiency, accuracy and reliability than in-context learning for non-generative tasks — tasks where there is a specific right answer that you want the model to find. Applying rules and logic around your models to do data transformations or handle cases that can be fully enumerated is also extremely important."
“Usually, there are clear answers in mathematics—especially if the tasks are not too complicated. But when it comes to the Sleeping Beauty problem, which became popular in 2000, there is still no universal consensus. Experts in philosophy and mathematics split into two camps and ceaselessly cite—often quite convincingly—arguments for their respective side. More than 100 technical publications exist on this puzzle, and almost every person who hears about the Sleeping Beauty thought experiment develops their own strong opinion."
Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:
Migration - between search and reality: a visual exploration of the gap between the reality of the world’s migrants and search interest
Updates from Members and Contributors
Sam Young, Practice Manager (Data Science & AI) at Catapult, highlights an interesting new report on applying reinforcement learning in the energy sector
Jobs!
The Job market is a bit quiet - let us know if you have any openings you'd like to advertise
“At Muzz, we are hiring Product Data Scientists to join my Data Science team at the largest Muslim dating app and help us connect 2 billion muslims around the world. We are looking for ambitious and product-focused data scientists/analysts/engineers to help us understand our members and all the metrics that matter. Feel free to contact ryan.jessop@muzz.com for more information and set up an intro call.”
EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here
- Piers
The views expressed are our own and do not necessarily represent those of the RSS