September Newsletter
Hi everyone-
I don't know about you, but that didn't feel particularly August-like.... I miss the sun! Perhaps September will save the summer, together with some inspiration from the Paralympics ... How about a few curated data science materials for perusing during the lull in the wheelchair rugby final?
Following is the September edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity ... We are continuing with our move of Covid Corner to the end to change the focus a little.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science September 2021 Newsletter
RSS Data Science Section
Committee Activities
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don't hesitate to drop us a line.
Thank you all for taking the time to fill in our survey responding to the UK Government's proposed AI Strategy We are working on a series of posts digging into the results which we hope will be thought provoking.
This year's RSS Conference is almost here (Manchester from 6-9 September, register here), with some great keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers. There is online access to over 40 hours of content at the conference covering a wide variety of topics. The full list of the online content can be found here. We really hope to see you all there, particularly at "Confessions of a Data Scientist" (11:40-13:00 Tuesday, 7 September), chaired by Data Science Section committee member Louisa Nolan.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk is on September 7th when Thomas Kipf, Research Scientist at Google Research in the Brain Team in Amsterdam, will discuss "Relational Structure Discovery". Videos are posted on the meetup youtube channel - and future events will be posted here.
Many congratulations to Martin and the team at evolution.ai for winning the Leading Innovators in Data Extraction Award during the FinTech Awards 2021!
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics...
Bias, ethics and diversity continue to be hot topics in data science...
As we continue to highlight, AI based capabilities across many use-cases are increasingly powerful, and increasingly accessible to everyone to use for good or bad:
Wired digs into Ernst & Young's use of DeepFakes to help in business pitches, "presented openly as synthetic, not as real videos intended to fool viewers" ... but for how long?
Ars Technica highlights new 'cheat' capabilities available that provide computer-vision auto-aim in "any game"
We even have an AI system being recognised as an inventor in a patent application in Australia, challenging a fundamental assumption in the law that only human beings can be inventors
As we know, it is very hard to build robust un-biased AI applications and to think through all potential consequences of their use. So it is not surprising that we see more and more examples of un-intended consequences and miss-use:
Wired explores the use of a drug-addiction risk algorithm used in the US to help Doctors assess who to prescribe opioids to, and the pain caused to worthy patients who were turned down due to its recommendations.
This in-depth article highlights how hard it is to actually opt-out from some of these systems: in this case, removing your image from the controversial Clearview facial recognition system
It's also clear that sometimes models learn representations from data that we do not want from a societal perspective- in this case, research shows how racial identify can be identified from medical images. As Andrew Ng commented:
"The fact that diagnostic models recognize race in medical scans is startling. The mystery of how they do it only adds fuel to worries that AI could magnify existing racial disparities in health care"
The Stanford Institute for Human-Centered Artificial Intelligence released a comprehensive review of the opportunities and risks of what it calls "Foundation Models" - these are models (such as BERT, DALL-E, and GPT-3) that are trained on "broad data at scale and are adaptable to a wide range of downstream tasks"
The research paper is a weighty tome (available here) but definitely worth a look
A good review can be found here
"They create a single point of failure, so any defects, any biases which these models have, any security vulnerabilities . . . are just blindly inherited by all the downstream tasks"
Of course the models and algorithms could be perfect, but still cause harm if they are not solving the right problem, or the outputs are not used in the right way
Motherboard reports that police are apparently attempting to have evidence generated from gunshot-detecting AI system altered
And a short but well reasoned piece in defence of algorithms:
"These algorithms aren’t “mutant” in any meaningful sense – their outcomes are the inevitable consequence of decisions made during their design"
Harvard Business Review highlights the importance of discussing and understanding the implications of AI Ethics, throughout organisations
Finally, some progress on transparency - Google has released a feature in search which allows you to better understand how the specific results are generated
And new regulatory efforts, intriguingly from China, on recommendation algorithms
Developments in Data Science...
As always, lots of new developments...
All sorts of activity in the reinforcement learning/robotics space this month:
Facebook released droidlet, "a one-stop shop for modularly building intelligent agents". This combines cutting edge ML models in NLP and Computer Vision and allows for rapid prototyping in either simulated or real world environments.
DeepMind released a ground breaking paper showing how generally capable agents can emerge from open ended play - the key here is that no human interaction data is needed at all! There is a great article here, detailing how important this is, and how the agents were trained.
“As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent"
As always, lots of research is going on in the deep learning architecture space:
Researchers from Google, Facebook and Berkley have shown how a pre-trained transformer can perform vision, mathematical, and logical tasks without fine-tuning its core layers
A team from the Universities of Cambridge, Siena, Florence and Cote d'Azure have developed a new interpretable deep learning architecture called Logic Explained Networks (LENs) which could be very useful for explainability
Google and Deep Mind researchers have shown how Deep Learning architectures can be used for computationally expensive mixed integer optimisation problems
Similarly investigation into methods that learn from smaller data sets continues
Researchers at Facebook, PSL Research and NYU have developed an elegant unsupervised pre-training method called VICReg that attempts to minimise issues of variance (identical representations for different inputs), invariance (dissimilar representations for inputs that humans find similar) and covariance (redundant parts of a representation)- this shows great promise for aiding more efficient use of pre-training and data augmentation
This paper also gives a good survey of data augmentation methods for Deep Learning
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
One of the criticisms of Reinforcement Learning is that it has limited applicability outside of simulations and toy examples - Maze looks like a useful framework to address some of the underlying issues
Following on from their insightful article last month on using ML to balance supply and demand, Door Dash talk about solving the dispatch problem (how to get each order from the store to the customer as efficiently as possible)
Topical application of computer vision - recognising face mask usage
Increasing use of AI in agriculture and the natural world
John Deer (tractors) has acquired Bear Flag Robotics for autonomous machinery on the farm
Demetria is launching an AI based approach to optimising coffee yields
Progress in the early detection of dementia using a single brain scan
"If we intervene early, the treatments can kick in early and slow down the progression of the disease and at the same time avoid more damage"
Finally more great progress in leveraging satellite imagery - this time in mapping buildings in Africa.
"Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images."
How does that work?
A new section on understanding different approaches and techniques
Following on from our causal inference section last week "A light-hearted yet rigorous approach to learning impact estimation and sensitivity analysis. Everything in Python and with as many memes as I could find"
"ML is notoriously bad at this inverse causality type of problems. They require us to answer “what if” questions, what Economists call counterfactuals. What would happen if instead of this price I’m currently asking for my merchandise, I use another price?"
Excellent repo of pytorch implementations of well known research paper architectures
Continuing our reinforcement learning theme- a great repo of course materials
Good visualisations can really help in understanding concepts. A couple of strong examples this month:
Finally a couple of very specific practical applications in python:
Practical tips
How to drive analytics and ML into production
Some useful tips and tricks to avoid the common pitfalls in building machine learning models
The founders of Tractable, tell the story of how they went from recent undergrads to AI unicorn in 6 years and what they learned along the way
Don't over-complicate your models ... says HBR ("AI doesnt have to be too complicated or expensive")
Some useful tips on how specialised to go in your data science career
What makes a good analyst? Interesting perspective on the pitfalls of too narrow a recruitment strategy
"Analytics isn’t primarily technical. While technical skills are useful, they’re not what separate average analysts from great ones."
Bigger picture ideas
Longer thought provoking reads
What does it actually mean to count? Interesting article digging into the surprising complexity
Can AI systems learn from analogies? Melanie Mitchell (Davis professor of complexity at the Santa Fe Institute) thinks so, and believes it's the key to more efficient learning.
If you tell me a story and I say, ‘Oh, the same thing happened to me,’ literally the same thing did not happen to me that happened to you, but I can make a mapping that makes it seem very analogous. It’s something that we humans do all the time without even realizing we’re doing it. We’re swimming in this sea of analogies constantly.
We use gradient descent almost everywhere in machine learning - but are there limits to its performance?
"There’s a slightly humorous stereotype about computational complexity that says what we often end up doing is taking a problem that is solved a lot of the time in practice and proving that it’s actually very difficult"
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:
Instantly identify birds from their song - merlin
The Art Machine - put in text, get out AI art ... complete with code!
More computer generated imagery - this time from a story
All of the images in this post were synthesized by a combination of several machine learning models, directed by text that I provided, VQGAN for generation, and CLIP for directing the image to match the text.
Covid Corner
Still lots of uncertainty on the Covid front... vaccinations keep progressing in the UK, which is good news, but we still have very high community covid case levels due to the Delta variant...
The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 70 people, worsening again after a couple of weeks of slight improvement.
One area where statistics and data science has been very useful is in modelling - another example, this time on large scale mobility based modelling
We have seen many proposed applications of AI throughout the pandemic, particularly in the identification of covid - this in depth article from the MIT Technology Review paints a damming picture of the actual successes.
“In the end, many hundreds of predictive tools were developed. None of them made a real difference, and some were potentially harmful.”
Updates from Members and Contributors
David Higgins has recently published two excellent articles on AI in healthcare, and would be really interested in discussing UK regulatory plans for medical AI with members of relevant agencies (MHRA, NICE etc)
"OnRAMP for Regulating Artificial Intelligence in Medical Products" is a best-practices guide to the development of regulatory compatible 'AI' in the medical field, a direct response to the call from the US FDA for good machine learning practices guidelines in this area.
"Artificial Intelligence in Healthcare: Lost In Translation?", a pre-print overview of AI in Healthcare, with the two authors' combined views on what is going wrong with translation in this field.
Mani Sarkar has added an excellent tutorial on stats and modelling to his section on Kaggle
Ronald Richman has recently published an interesting paper on interpretable deep learning for tabular data, a very relevant topic for many applications.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
- Piers
The views expressed are our own and do not necessarily represent those of the RSS