October Newsletter
Hi everyone-
As the rain pours down it definitely feels like winter has arrived- all the more reason to spend some time indoors huddled up with some good data science reading materials!
Following is the October edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while figuring the difference between second waves and spikes...
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
Industrial Strength Data Science October 2020 Newsletter
RSS Data Science Section
Covid Corner
As Trump tests positive, the inevitable seems to be happening, with COVID-19 cases on the rise again in many areas of the world. As always numbers, statistics and models are front and centre in all sorts of ways.
It is clear that cases are rising in the UK. However, there is a fair amount of confusion as to how the current levels compare to earlier in the year given the increased amount of testing going on. The false positive rate of the tests has been quoted at "under 1%" by Secretary of State for Health, Matt Hancock, which sounds good, but depending on the prevalence of actual cases, could be a problem as David Spiegelhalter pointed out. This was then miss-understood widely and so he had to re-iterate that the rise in cases could not be explained away by this (the excellent "More or Less" podcast covers the topic in more detail).
In a similar vein, Professor Jon Deeks kindly articulated 21 ways to spin Covid test results accuracy: a particular favourite...
#19 Publish your results in a newspaper first (all criticism
of the study by scientists will be old news and sour grapes
by the time they get a chance to make it, and government policy
will already have been made)
All this has caused sufficient concern for the RSS to convene a working group on diagnostic tests
The RSS has been concerned that, during the Covid-19 outbreak,
many new diagnostic tests for SARS-CoV-2 antigen or antibodies
have come to market for use both in clinical practice and for
surveillance without adequate provision for statistical
evaluation of their clinical and analytical performance.
Of course, this is all rather undermined when you discover the official national case tracking data is being managed in excel...
Elsewhere, Wired gives a good analysis of the different approaches being taken by the various vaccine research groups to show whether or not their vaccine actually works. A recent paper, Machine Learning for Clinical Trials in the Era of COVID-19 in the Statistics in Biopharmaceutical Research Journal, highlights how machine learning can help with some of these issues.
On the epidemiological front, a recent article in Nature, highlights how innovative use of anonymised mobile phone data can be used to track the virus spread.
Is dispersion (k) the overlooked variable in our quest to understand the spread of the virus? Breaking down the distribution of infection events (rather than using the average, as with R) could help better explain super-spreaders and inform test and trace programs. Really interesting article from the Atlantic.
If anyone is keen to roll up their sleeves and dig in to the data, the c3.ai COVID-19 Grand Challenge might be of interest...
Finally, The Alan Turing Institute is convening a public conference "AI and Data Science in the Age of COVID-19" on November 24th. In addition to public discussion there will be a series of closed workshop sessions to assess the response of the UK's data science and AI community to the current pandemic- if you are interested in participating in the closed sessions you can apply here.
Committee Activities
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don't hesitate to drop us a line.
As previewed in our last newsletter, and our recent release, we are excited to be launching a new initiative: AI Ethics Happy Hours. If you have encountered or
witnessed ethical challenges in your professional life as a data scientist that you think would make for an interesting discussion, we would love to hear from you at dss.ethics@gmail.com (deadline October 15th).
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and has been active in lockdown with virtual events. Next up, on Monday October 12th, is "From Machine Learning to Machine Reasoning", by Drew Hudson from Stanford University. Videos are posted on the meetup youtube channel - and future events will be posted here.
Anjali Mazumder is helping organise the Turing Institute event mentioned above in Covid Corner.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
Bias and more bias...
The more we collectively dig into the underlying models driving our every day activities, the more issues we uncover...
"AI researchers tried to gauge 'trust' by looking at faces. Surprise: it's racist." What is even more disconcerting about this one, is the apparently mediocre statistics applied: the fit on these graphs does not look particularly tight. Gelman chimes in here... As the great xkcd puts it:
"I don't trust linear regressions when it's harder to guess the
direction of the correlation from the scatter plot than
to find new constellations on it"
Another disconcerting example- Colin Madland noticed that zoom's background feature didn't work in certain circumstances (any guesses?); and on posting his findings on twitter, found something similar with how twitter crops images! More commentary on twitter image cropping here...
So why is all this happening? Rachel Thomas of Fast.ai gave an excellent talk on "Analysing and Preventing Unconscious Bias in Machine Learning" which digs into this very question and is well worth listening to. One key take-away is that it is very much the data-scientist's responsibility to check for these types of biases in the outcomes and think through potential issues- something that is best identified through more diverse teams.
To try and educate on these topics, an impressive set of AI researchers have set up the Trustworthy ML Initiative, with a set of bi-weekly seminars.
Recommenders Gone Wild ...
One example that Rachel Thomas discussed in the talk above, is recommendation systems. With the proliferation of content and product choices now available online, we could all use some help curating and narrowing down the options available. When implemented well, recommendation systems can elegantly assist in this. Many typically work through some form of collaborative filtering which really boils down to identifying similar behaviours and extrapolating:
If Alice likes oranges, pineapples and mangos,
and Bob likes oranges and pineapples,
maybe Bob will also like mangos...
However, depending on how these similarities are codified and calculated, it has now been shown that feedback loops can quite easily be generated.
Wired dug into the YouTube recommender in 2019 with Guillaume Chaslot, one of the original engineers on the project, highlighting the importance the metric chosen to optimise - in this case viewing time - has in driving the material recommended and so consumed.
In a recent follow up, "YouTube's Plot to Silence Conspiracy Theories" , they highlight some of the changes that have been implemented to reduce the issues identified. Interestingly the focus seems to be on identifying potentially hazardous material that is then excluded from the recommender rather than changing the recommender itself.
DeepMind recently released research digging into these feedback loops ("Degenerate Feedback Loops in Recommender Systems") giving a theoretical grounding to the concepts of "echo chambers" and "filter bubbles" and why they occur.
In "Overcoming Echo Chambers in Recommender Systems", Ryan Millar digs into alternative methods using the fabled MovieLens data set, giving an example of how, through different objective functions, you can reduce the feedback loop effects in the recommender system itself. This feels similar to the concepts of "explore" vs "exploit" in Thompson sampling, and an approach well worth considering if you are building a system yourself.
Finally Eugene Yan gives a useful summary of RecSys 2020, highlighting a number of research papers on the topic of feedback loops and bias in recommender systems.
Yet more GPT-3 ...
Continuing our regular feature on GPT-3 (OpenAI's 175 billion parameter NLP model) as it continues to generate news and commentary.
VWO has initiated a "friendly" challenge for copy-writers to take on GPT-3 via A-B testing...
Some more digging reveals GPT-3 doesn't really understand analogies... but we knew that anyway
However, somewhat incredibly, it seems GPT-3 can prove mathematical theorems!
AI Trends and Business
The well regarded State of AI report is out: all 177 slides of it. A few takeaways:
NLP is suffering from huge training costs making research accessible to very few
"Biology is experiencing it's AI moment"
As we have been discussing, bias and ethical issues are increasingly prevalent
Andreesen-Horowitz, the highly respected silicon valley based venture capital firm, does a good job of talking through the economics of AI based companies in "The New Business of AI"
On a different level, Lambert Hogenhout talks through how to help realise the potential of Data Science and AI teams within businesses.
Practical Projects
As always here are a few potential practical projects to while away the socially distanced hours:
A couple of interesting NLP projects/tutorials
Tracking gender equality and representation over time through the NY Times: a thoughtful end-to-end tutorial in python including topic modelling.
Sentiment analysis of US Presidential Speeches over time: again, useful python tutorial and libraries.
How about building a recommendation system?
TensorFlow Recommenders makes this relatively straightforward- but remember to look out for biases...
For a more home-grown approach, how about a stand-up comedy recommender based on NLP
Updates from Members and Contributors
David Higgins has published an excellent article on "Pharma's Data Problem" with a particular focus on addressing "wide" data.
Rafael Garcia-Navarro mentioned Ducit.ai's recent great success with implementing metaflow : python h2o and metaflow sounds like a strong ML platform.
Mani Sarkar's NLP Profiler which we mentioned in a previous newsletter is now on github and pypi
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
- Piers