August Newsletter
Hi everyone-
That was quick, August already, but at least we have had the occasional day when it properly feels like summer- and now we have some Olympics to watch which is always entertaining! ... How about a few curated data science materials for reading in while watching the marathon?
Following is the August edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity ... We are continuing with our move of Covid Corner to the end to change the focus a little.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science August 2021 Newsletter
RSS Data Science Section
Committee Activities
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don't hesitate to drop us a line.
We are still working on releasing the video and a summary of the latest in our 'Fireside chat' series- an engaging and enlightening conversation with with Anthony Goldbloom, founder and CEO of Kaggle. Sorry for the delay- we will post a link when it is available.
Thank you all for taking the time to fill in our survey responding to the UK Government's proposed AI Strategy (If you haven't already, you can still contribute here). We are passionate about making sure the government focuses on the right things in this area, and are now analysing the results which we will publish shortly.
The full programme for this year's RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed. The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers. Registration is open.
Speaking of the RSS Conference, we are running a session there, and we need your help! We would like to hear stories about your worst mistakes in data science. From these, we will select common themes and topics, and create a crowd-sourced compilation of the deadliest sins of data science. These will be presented - anonymously - to our panel, for a live, interactive discussion in front of an audience, at our session on Tuesday 7 September, 11:40 - 13:00. We hope this will both entertain and inform. Maybe your pain can help save someone else’s (data science) soul... CONFESS YOUR SINS HERE – the survey is anonymous, we won’t embarrass anyone!
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The most recent event was on July 14th when Xavier Bresson, Associate Professor in the Department of Computer Science at the National University of Singapore, discussed "The Transformer Network for the Traveling Salesman Problem". Videos are posted on the meetup youtube channel - and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics...
Bias, ethics and diversity continue to be hot topics in data science...
As we have discussed previously, voice cloning, or deep fake audio, is now pretty easily accessible.
The bbc reported of growing interest from both actors and cybercriminals, highlighting the way it can be used for both good and bad.
A recent film, documenting the life of celebrity chef Anthony Bourdain who died in 2018, has generated a fair amount of controversy. A deep fake of Bourdain's voice was used in the film without disclosure to the audience and this has prompted commentary about the ethics of such a practice as well as broader discussion about AI in journalism
In a similar vein, the SF Chronicle reports on the story of Joshua Barbeau who managed to, in some sense, recreate the personality of his long-dead fiancée by training a GPT-3-based chatbot on her old text messages
Facial recognition continues to be in the news with another example of more authoritarian regimes taking advantage of the broad capabilities now available. This time it's Russia and reports of a system that conducts racial profiling from video streams.
There are increasing instances of the use of AI in human resources departments, with problematic outcomes.
Bloomberg reports of Amazon delivery drivers fired by algorithm
A recent research paper from Cornell University talks through different frameworks for assessing algorithmic hiring systems while MIT Technology Review has conducted an in-depth analysis of AI driven interview assessment software, highlighting a number of shortfalls
"One gave our candidate a high score for English proficiency when she spoke only in German."
We talk about bias a fair amount, and it's always good to define terms - this summary from the ACM (Association for Computer Machinery) gives a good overview. They split biases in AI systems into four sensible high level areas (as well as splitting out more specific types in each area):
Data-creation bias
Biases related to problem formulation
Biases related to the algorithm/data analysis
Biases related to evaluation/validation
It's easy to overlook the first area highlighted above - data-creation bias. Often we train supervised learning models based on hand-labeled examples which we assume to be 'correct' but may not be. This article from O'Reilly talks through this issue and discusses different approaches (such as semi-supervised learning and weak supervision), while this article (from Sandeep Uttamchandani) gives some practical tips on data set selection for ML model building.
There is no such thing as gold labels: even the most well-known hand labeled datasets have label error rates of at least 5% (ImageNet has a label error rate of 5.8%!).
More positively, Apple has released information about their approach for face detection in photos, highlighting positive aspects such as on-device scoring, and fairness.
And this analysis charting the 'data-for-good' landscape shows it's not all doom and gloom...
Developments in Data Science...
As always, lots of new developments...
When the 'founding fathers' of Deep Learning (Bengio, Hinton and LeCun) get together it's always worth reading... here they discuss the future of Deep Learning and key research directions. They highlight key issues with existing approaches (large volumes of data for supervised learning or large numbers of iterations for reinforcement learning) but are not convinced by hybrid approaches including symbolic learning, believing research into more efficient learning from fewer examples will bear fruit.
“Humans and animals seem to be able to learn massive amounts of background knowledge about the world, largely by observation, in a task-independent manner. This knowledge underpins common sense and allows humans to learn complex tasks, such as driving, with just a few hours of practice.”
And there is certainly lots of research going on in this area:
'Deep Learning on a Data Diet' from Cornell University proposes an approach that identifies key influential examples in early training of ML models
While this paper also from Cornell brings together image and language approaches to move towards 'zero-shot' detection, identifying novel objects without a bounding box nor mask annotation
This looks like another innovative approach to identifying objects in unseen scenarios, by using a combination of real and synthetic data
Time series modelling has been explored for hundreds of years but despite the proliferation of data, and techniques, is still far from 'solved'. This looks like an interesting new approach out of MIT (SeqCLR: Self-supervised learning of features for time-series data) utilising contrastive learning
Some fun research into the language of colours and how and why it varies around the globe
Interestingly, the ways that languages categorize color vary widely. Nonindustrialized cultures typically have far fewer words for colors than industrialized cultures. So while English has 11 words that everyone knows, the Papua-New Guinean language Berinmo has only five, and the Bolivian Amazonian language Tsimane’ has only three words that everyone knows, corresponding to black, white and red
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
Deep Mind previously released details of their research into the modelling of advanced protein structures. Great to see we are now starting to see practical application of the underlying model- in this case in the hunt for a molecule to break down plastic (with more advances in the underlying model happening as well), based on more progress from Deep Mind.
Fantastic to see advanced AI research utilised in life-affirming use cases - in this case decoding speech from the brain-waves of a paralysed person.
Utilising AI out "in the field" in the hands of robots adds additional complexity to already hard problems:
Here robots are able to identify and categorise small-invertebrates (making up more than 90% of species level diversity)
Although OpenAI have called time on their robotics research team
"we’ve found that other approaches, such as reinforcement learning with human feedback, lead to faster progress in our reinforcement learning research"
Can computers write code? This is topical question this month...
First of all research assesses what is possible, with a GPT based model called Codex trained on github...
And then github brings Codex to life by releasing 'Co-Pilot: your AI Pair Programmer' ...
Lots of commentary ensues! - some background and general assessment here, with some more in-depth assessment from Fast.AI here, and Vlad Iliescu here
"GitHub Copilot has been described as ‘magical’, ‘god send’, ‘seriously incredible work’, et cetera. I agree, it’s a pretty impressive tool, something I see myself using daily ... In my experience, Copilot excels at writing repetitive, tedious, boilerplate-y code. With minimal context, it can whip up a function that slices and dices a dataset, trains and evaluates several ml models, and, if you ask it nicely, also makes a nice batch of french fries"
Ok, so maybe not quite so practical, but still great fun - AI driven art out of Berkley ('Alien Dreams')
"this CLIP method is more like a beautifully hacked together trick for using language to steer existing unconditional image generating models"
A useful rundown from DoorDash on how they use ML models to balance supply and demand, including some interesting discussion on optimisation approaches which are often the way of turning a ML model into something that is used in decision making.
How does that work?
A new section on understanding different approaches and techniques
For those looking for a bit more understanding on Deep Learning and different architectures...
Some elegant insight into why Deep Learning works when it feels like it shouldn't ("how can a billion parameters ever converge?!")
The Weights and Biases team dig into the 'MLP-Mixer' architecture and compare to traditional Convolutional Neural Networks.
Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution
A good tutorial on Jax, and how to use it to speed up Neural Network training from Will Whitney.
Some useful commentary on the process around building compelling visualisations from Loris Mattioni
Getting it live
How to drive ML into production
Andrew Ng brings to life the challenges of building an AI product...
Interesting 'retro' after a year of running more formal data science review processes with a data science team from Shay Palachy
"Unsurprisingly, things did not go exactly as planned. Thus, this post is about what worked and what didn’t. I have focused on the most challenging aspects of trying to get data scientists to get review from their peers. I hope this helps others who wish to formalize peer review processes in data science"
Useful insight into building a data/data science team in a growing start-up from Erik Bernhardsson
Good overview of the building blocks in a modern data-stack from the team at MonteCarlo
Not sure what I think of this, but worth a read - Kedro: a new framework for writing reproducible, maintainable and modular data science code
Correlation or Causation?
A deep dive into causal analysis in machine learning
You have a machine learning model and it seems to perform great, not only on the training set, but even on hold out test sets- sorted right? It's worth considering how you are going to use the model- if you are making predictions and using the output as is, then maybe you are ok; but if you are going to use the model for scenario planning, and counter-factual assessment ('what-ifs?') it would be worth thinking about causal analysis. Here's a good starting point, from Jane Huang.
The technique often relies on something called 'Double Machine Learning'
As any great technology, Double Machine Learning for causal inference has the potential to become pretty ubiquitous. But let’s calm the enthusiasm of this writer down and go back to our task
Finally, an intriguing approach for time series and econometrics... causal forests
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:
Analyse your apple music streaming behaviour...?
Use ML to detect what can be built from your pile of lego bricks!!
Finally ... a very worthwhile cause and potential riches!
How to get involved in the IRCAI AI Award 2021?
The International Research Centre in Artificial Intelligence under the auspices of UNESCO is launching an AI Award for individuals who have dedicated their work to solving problems related to the United Nations Sustainable Development Goals (SDGs) by means of the application of Artificial Intelligence.
Covid Corner
Not sure what to say here... vaccinations keep progressing in the UK, which is good news, but we now have what appear to be the highest covid case levels we have seen over the whole of the pandemic due to the Delta variant...
The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 65 people, up from 1 in 75 the week before and an almost unbelievable increase from only June, when the estimate was 1 in 1100.
More or Less gives an excellent review of the Delta variant and how it has come to dominate other strains of coronavirus the world over
One of the core findings about Delta, as discussed by More or Less, is its apparent ability to transmit through vaccinated individuals (or those with antibodies from prior infections) - in other words vaccinations, while still protecting against the worst outcomes, are not as effective at reducing transmission.
This definitely raises the stakes of the recent UK governmental re-opening and relaxation of restrictions on July 13th (symbolically welcomed by the prime minister in self-isolation...) which has been roundly condemned by the scientific community
In addition, in a recent article in the guardian, SAGE committee member Professor Robert West states the government's express intention is to allow infections to rip through the younger population, a very worrying statement.
“What we are seeing is a decision by the government to get as many people infected as possible, as quickly as possible, while using rhetoric about caution as a way of putting the blame on the public for the consequences”
Updates from Members and Contributors
Marco Gorelli announces the first official release (1.0.0) of his highly acclaimed nbQA repo, full of very useful code formatting features and pre-commit hooks for jupyter notebooks
Alex Spanos will be presenting TrueLayer's data science work at the RSS conference in Manchester ("An end-to-end Data Science workflow for building scalable and performant data enrichment APIs in Open Banking") - another great reason to attend in September!
Mark Baillie highlights an upcoming special issue of the Biometrical Journal
"Data scientists are frequently faced with an array of methods to choose from; often this makes selection difficult especially beyond one’s own particular interests and expertise. Neutral comparison studies are an essential cornerstone towards the improvement of this situation, providing evidence to help guide practitioners. For the special issue of Biometrical Journal we are interested in submissions that define, develop, discuss or illustrate concepts related to practical issues and improvement of neutral method comparison studies, as well as articles reporting well-designed neutral comparison studies of methods"
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
In memoriam
With great sadness I announce the untimely death of Rebecca Nettleship, a valued colleague and talented data scientist, on 22nd July 2021. She will be sorely missed. Our deepest condolences go out to her family and friends.
- Piers
The views expressed are our own and do not necessarily represent those of the RSS