July Newsletter
Hi everyone-
Not sure what happened to June - seemed to fly by - I know there were some lovely sunny days but then it got cold again... fingers crossed summer it's not over already! ... How about a few curated data science reading materials for reading in the garden, rain or shine?
Following is the July edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity ... We are continuing with our move of Covid Corner to the end to change the focus a little.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science July 2021 Newsletter
RSS Data Science Section
Committee Activities
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don't hesitate to drop us a line.
We are working on releasing the video and a summary of the latest in our 'Fireside chat' series- an engaging and enlightening conversation with with Anthony Goldbloom, founder and CEO of Kaggle. We will post a link when it is available.
We have released a survey to our readers and members focused on the UK Government's proposed AI Strategy. We are passionate about making sure the government focuses on the right things in this area, and feel like, as the organisation representing technical Data Science and AI practitioners, we need to make sure our voice is heard. If you havn't already, please give us your thoughts by participating here.
The full programme for this year's RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed. The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers. Registration is open with early-bird discounts available until Friday 4 June.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. On June 30th, the meetup hosted Frank Willet (Research Scientist at Stanford University) for a talk titled "High-performance brain-to-text communication via handwriting". Videos are posted on the meetup youtube channel - and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics...
Bias, ethics and diversity continue to be hot topics in data science...
Rightly or wrongly, the military can often be at the forefront in adopting new technology- it seems we have a new first, with fully automated drones attacking humans for the first time, revealed in a recent UN report.
Of course autonomous drones rely on accurate mapping information to function - what happens if that underlying information is falsified? Deep fakes are now being applied to satellite imagery, making "portions of Seattle look more like Beijing"
Imagine a world where a state government, or other actor, can realistically manipulate images to show either nothing there or a different layout
Capuchin, a leading behavioural science firm, are releasing their 'Techno Telepathy' study which highlights how seemingly innocuous data points can be linked to intimate personal information.
Not surprisingly, given all the activity we see every month, the current state of ethical AI is not great:
Wired has an in-depth review of what really happened with the Timnit Gebru firing at Google, and the challenges of running an ethical AI program within a corporate environment
Recent research from Corinium highlights that "most companies don't know what their AI is doing"
And a Pew Research Center study based on a survey of 'AI experts and advocates around the world' found:
"68% chose the option declaring that ethical principles focused primarily on the public good will not be employed in most AI systems by 2030"
However, there is plenty of development going on to combat bias and keep AI aligned to the public good
Open-AI are working on making their ground breaking GPT-3 language model "less racist and terrible" or as Open-AI put it "improving language model behaviour by training on curated data sets"
And Facebook have announced that they can now reverse engineer (and thus identify) deep-fakes with just a single image
Our method will facilitate deepfake detection and tracing in real-world settings, where the deepfake image itself is often the only information detectors have to work with.
Developments in Data Science...
As always, lots of new developments...
Reinforcement Learning (RL) has driven many breakthroughs such as Deep-Mind's Alpha-Go. However, RL sometimes struggles in more 'real-world' applications due to it requiring accurate simulations to work (the problem of the 'sim-to-real' gap). Google/Deep-Mind are actively working to improve this and make RL more widely applicable:
they have released 'AndroidEnv' which allows Reinforcement Learning agents to interact with real-world apps
they have also released research on more accurate physics simulators for use by robotic systems
This looks like a very elegant approach from Oscar Manas to labelling satellite imagery using Self Supervised Learning, leveraging seasonal changes at the same location:
In remote sensing images, we can use temporal information to obtain pairs of images from the same location at different points in time, which we call seasonal positive pairs. Seasonal changes provide more semantically meaningful content than artificial transformations, and remote sensing images provide this natural augmentation for free.
Facebook have released 'TextStyleBrush' allowing you to emulate a text style in an image using just a single word
Generating realistic synthetic video is computationally intensive - new work out of UC Berkeley, called VideoGPT, uses novel approaches to make the whole process more efficient, allowing anyone to generate video on a standalone computer.
A Chinese Lab is challenging the supremacy of Google and OpenAI in the language model space with a model containing 1.7 trillion parameters. Interestingly, the original article seems to have been removed - although copies are still available online, with more technical details:
The Chinese lab claims that Wudao's sub-models achieved better performance than previous models, beating OpenAI’s CLIP and Google’s ALIGN on English image and text indexing in the Microsoft COCO dataset
With the drive for larger and larger models, requiring more and more computational power and cost (and data) to train, this looks like an interesting approach to understand whether we are actually improving the underlying learning efficiency. The measure of interest is "reductions over time in the compute needed to reach past capabilities" and the analysis finds that the improvements correspond to "algorithmic efficiency doubling every 16 months over a period of 7 years".
Bringing us down to earth a little, this insightful piece in Quanta magazine highlights the difficulty ML algorithms have in understanding whether or not two things are the same or different.
"Will better engineering produce CNNs [Convolutional Neural Networks] that understand sameness and difference in the generalizable way that children do? Or are CNNs’ abstract-reasoning powers fundamentally limited, no matter how cleverly they’re built and trained?"
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
I'm not familiar with the underlying challenge, but I understand that this is a big breakthrough (nature paper here) : a team at Google has automated the design of the physical layout of computer chips using deep reinforcement learning.
This is pretty compelling- well worth a read: Facebook AI have released details of their advanced object recognition system which allows consumers to shop items from images. It uses an elegant compound approach, modelling the objects and attributes separately as well as multi-modal signals. Also good to see they are attempting to avoid bias by building an monitoring the models appropriately:
"As part of our ongoing efforts to improve the algorithmic fairness of models we build, we trained and evaluated our AI models across subgroups, including 15 countries and four age buckets."
Feels like an excellent use-case: Sonoma County in California is using AI based applications to monitor real-time video feeds and alert on wild-fires
Real-time AI based re-routing of flight-plans for Alaska Airways - impressive.
Not sure how practical the application at this point, but the NationalNovelGenerationMonth (NaNoGenMo) highlights the progress (or lack of) in automated "novels" - 50,000 word documents written by computers. Great summary from Greg Kennedy here
“Welcome to Hardcore High School” bellowed the script kiddo. We had just gotten to the kindergarten level when the music and lights began to blink. I frowned. “What is that?”
“Beats me” said the A.I. As he walked down the halls, mimicking the sounds of the various musical instruments, he fiddled with the script kiddo a bit. “Welcome to Hardcore High School” He said again, a bit more softly this time.
Insight into how the Instagram recommender actually works... as is the case in many real-world systems its a combination of a number of different 'algorithms'...
An open source AutoML option - PyCaret
How does that work?
A new section on understanding different approaches and techniques
Multi-Task learning is difficult - useful primer here, from the Gradient.
These days there are excellent Machine Learning libraries and you would rarely build from scratch in a production environment. However the process of doing so can really help with understanding how the different approaches work. This is a nice repository with hand-crafted examples of how things work from Oleksii Trekhleb
Another good tutorial on Transformers.
An excellent tutorial on Bayesian Hierarchical Modelling at scale from Florian Wilhelm, with some useful pointers on the difference between MCMC and Variational Inference approaches.
Understanding gaussian processes - useful interactive visualisation
Getting it live
How to drive ML into production
Facebook research discussing how standardising their ML modelling framework on PyTorch has driven dramatic benefits to their productivity ("bridging the research to production gap").
"On a daily average, there are over 4,000 models at Facebook running on PyTorch"
The importance of Data preparation and curation in the ML lifecycle is highlighted in this piece on Data Cascades from Google Research.
"One of the most common causes of data cascades is when models that are trained on noise-free datasets are deployed in the often-noisy real world. For example, a common type of data cascade originates from model drifts, which occur when target and independent variables deviate, resulting in less accurate models"
In a similar vein, Andrew Ng has launched the first data centric Kaggle style competition. Instead of keeping the data constant, and iterating over models, the modelling approach is kept static, and the competition is about generating the best data set - very interesting to see where this leads.
Good retrospective on the Netflix prize (the 'first' Kaggle competition) from Xavier Amatriain (who was in charge at Netflix at the time) and whether it was worth the $1m ... yes!
More discussion of the emerging role of the Analytics Engineer and the benefits it brings to organisations.
From Prediction to Decision
The art and science of decision making
Lovely extended essay from Hannah Fry on the history of graphs and how they help us understand data and make decisions
An excellent article published in HBR from Michael Ross on why company investments in AI often don't generate the gains they expect (the asymmetric cost function is particularly interesting)
(1) They don’t ask the right question, and end up directing AI to solve the wrong problem.
(2) They don’t recognize the differences between the value of being right and the costs of being wrong, and assume all prediction mistakes are equivalent.
(3) They don’t leverage AI’s ability to make far more frequent and granular decisions, and keep following their old practices
Thought provoking article from Scott Lundberg on the risks in interpreting causal connections from predictive models
Fun article highlighting game theory in action - "what's 2/3rds of the average"
Bayesian statistics can often be a good way of thinking about decisions- what is my prior understanding, and how does new information change that understanding. Here is a useful 30 minute tutorial on Bayesian Decision Science from Ravin Kumar at pydataglobal, with code and tutorials here.
Finally, Michael Mullany steps back and looks at 20 years of "hype cycles" - what technology trends have we successfully predicted (not many!) and what predictions have not come to fruition (a lot!). Niels Bohr comes to mind: “it is difficult to predict, especially the future”...
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:
Emulate any guitar pedal with NeuralPi - plenty to get stuck into there!
Covid Corner
Again, more positive progress in the UK on the Covid front with 45m people now having received their first vaccine dose and over 30m fully vaccinated. However, the new Delta variant originating in India is cause for concern and case rates and hospitalisations are now rising again.
The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 440 people, a significant increase from last month's estimate (1 in 1100). Recent research shows that we are really dealing with two different pandemics right now- a significant decline in infection rates of the original Coronavirus strains, but a significant increase in the new Delta variant (B.1.617.2)
The Gradient has a good review of where AI has been able to help in combating the pandemic - bottom line, quite a lot of different areas although perhaps not quite as successfully as hoped.
Apparently, China has led the way in using AI to stem the spread
However, this is intriguing - impressive sleuthing from Bloom Lab highlighting the case of missing sequencing data from around the time of the original outbreak in Wuhan.
Without a doubt, progress on the vaccine front has been astounding, and the rise of mRNA based approaches should be very beneficial for the future.
Updates from Members and Contributors
Everyone must be out enjoying themselves as no specific updates from members and contributors this month- let me know if you'd like to include anything here next month.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
- Piers
The views expressed are our own and do not necessarily represent those of the RSS