Experience a day in the life of a data scientist at Malvern Panalytical

經過 Mark Nicholls, Thursday, 3rd September 2020

Background

Since the formation of the Data Science group at Malvern Panalytical, one line of questions keep coming up: “My daughter is thinking of a career in data science, how will she know if it’s a good fit”. Or, “My wife used to be a software engineer and is now returning to the workplace. She’s intrigued by Data Science, but doesn’t know if it’s right for her”.

We figured a good way to help answer this kind of question was to give a glimpse into a day in the life of a Data Scientist at Malvern Panalytical via an interactive workshop.

So on August 11^th, we held our inaugural Data Science Experience Workshop, inviting ten inquisitive minds to learn the answers first-hand.

Our data science team is currently comprised solely of male practitioners, which is unfortunate because gender bears no relevance to the required skills. Our team composition is simply a reflection of the cross-section of people who’ve applied for these roles. So as an additional goal of this workshop, we wanted to focus on addressing this imbalance by encouraging a more diverse cross-section of the population to learn about what we do. All ten of the attendees for this instance of the workshop were female.

Introductions

Umberto Esposito kicked off the event with a brief introduction to Malvern Panalytical, a bit of background on the kinds of precision instruments we develop for measuring material properties, and our recent journey through digital transformation. Following introductions, our three presenters then talked about their personal journeys and how the different studies and careers contributed towards becoming a Data Scientist.

Rowena Innocent – our VP of R&D – joined the session to help put our data science journey into context and to reinforce our commitment to employee diversity, explaining how we’re trying to break down the barriers and entice more women into engineering.

The morning session was brought to a close with a more in-depth presentation on Machine Learning. Here, we talked about some of the key technical concepts, comparing the three main categories of Machine Learning. Then, we gave a practical demonstration of the most basic algorithms such as k-Means and k-Nearest Neighbor. We also explored some of the tools we use in Machine Learning via a whistle-stop tour of Python and scikit-learn.

Umberto Esposito explains algorithm choices

Interactive session

Despite lock-down and the need for the workshop be virtual we were keen to include an interactive element. We also wanted participants to experience something more akin to real world data science work than the typical examples found online. To that end, Richard Green walked the participants through a “movie recommender” example, utilizing jupyter notebooks using real world movie ratings data to make movie recommendations. The notebooks were self-contained so participants with little experience would be able to run them and get results. Throughout the notebooks were small exercises for participants to alter the code to try things out for themselves. We set-up a JupyterHub server so that participants were given a fully pre-configured environment to work in and needed only a browser to take part.

The hands-on element included:

Exploration of the available data.
A simple recommender system based on movie genres and tags.
A more advanced recommender based on user ratings (‘users that are similar to you also liked…’).

This allowed to us to give participants hands on experience of several data science techniques, to see how they might be used in practice and to think critically about the results.

The session went well with participants managing to run the notebooks successfully while appreciating the real-life example.

Advanced Deep Learning Example

As an indication of what can be achieved with state-of-the-art open source software, Ed Morris presented an example of image classification using deep learning techniques. This example used Facebook AI Research’s neural network library, PyTorch. The problem posed was a conceptually simple but complex to solve challenge of bird species classification from images. Ornithologists will tell you that identification of bird species from photographs is challenging task for humans, let alone software algorithms.

Ed Morris delivers his bird-classification example

The workshop focused on the complete end to end workflow to construct, develop, train and evaluate a deep learning image classification algorithm. It takes in a picture of a bird at one end, and produces a prediction of the bird species as an output at the other. The workflow involved:

Setting up the data set
Investigating and visualizing it
Construction and training of the neural network
In-depth assessment of the model prediction performance.

In addition to this we explored the inner workings of these methods and visualized what they use to make the classification decisions.

We achieved a bird classifier with near state-of-the-art average class accuracy of ~82%, which compared to pre-deep learning era results in 2010 of just 10%. It is quite astonishing and a true testament to just what is achievable with open-source, state-of-the-art libraries.

Hopefully, this demonstrated just what can be achieved in the field with the latest tools and minimal coding. If you’re interested in the deep-learning techniques behind image classification, Ed gives a deeper-dive via a series of articles which you can find here.

Wrapping up

You might wonder what bird-classification and movie-recommenders have to do with typical challenges found working in a company like Malvern Panalytical, but it’s not as far removed you might think. Image classification is one of the most common applications of Deep Learning, and Ed and others are using these techniques internally on several of our products. The movie-recommender also uses many of the same concepts that our data scientists use routinely in their work.

After the workshop we sent all participants a feedback survey. Two thirds of respondents said the workshop was the right level of difficulty with the remaining third saying it was too hard. All but one respondent said the workshop was the right length with only one saying it was too long. Not bad given the sweltering temperatures on the day!

Overall the participants rated the workshop 4.5 out of 5 stars and on our key aim of helping participants decide whether to pursue a career in data science: 100% said yes it helped.

We’ll be using the feedback to help us improve the workshop further, and if you’re interested, we’re definitely planning on holding more events like this in future.