Machine Learning

12 Things I Learned During My First Year as a Machine Learning Engineer

Being your own biggest sceptic, the value in trying things which might not work and why communication problems are harder than technical problems.

Daniel Bourke

06 Jul 2019 • 10 min read

Machine learning and data science are both broad terms. What one data scientist does can be very different to another. The same goes for a machine learning engineer. What’s common is using the past (data) to understand or predict (build models) the future.

To put the points below in context, I’ll explain what my role was.

We had a small machine learning consulting team. And we did it all, from data collection to manipulation to model building to service deployment in every industry you can think of. So everyone wore many hats.

The past tense is because I’ve since left my role as a machine learning engineer to work on my own business. I made a video about it.

What my day looked like

9am, I’d walk in, say the good mornings, put my food in the fridge, pour a cup of joe and walk over to my desk. Then I’d sit down, look at my notes from the previous day and open up Slack. I’d read the messages and open up any links to papers or blog posts the team had shared, there’d be a few, this field moves fast.

Once the messages were cleared, I’d skim through the papers and blog posts and read the ones which stuck. Usually there was something which may have helped with what I was working on. Reading took up to an hour, sometimes more, depending on what it was.

Why so long?

Reading is the ultimate meta-skill, if there was a better way of doing what I was doing, I could save time and effort by learning it and implementing it.

It’s now 10am.

If there was a deadline approaching, reading would be cut short to push forward on the project(s). That’s where the biggest chunk of the day went. I’d review my work from the previous day and check my notepad for next steps I put down.

My notepad was a flowing journal of the day.

“I’ve manipulated the data to be the right shape, now I need to run it through the model, I’ll keep the training short to begin with, then step it up if progress has been made.”

And if I got stuck.

“There’s a data mismatch happening. Next will be to fix the mix match and getting a baseline before trying the new model.”

The majority of the time was making sure the data was in a form ready to be modelled.

4pm would roll around and it was time to start winding things down. Winding down involved cleaning the mess of code I’d created to make it legible, adding comments, restructuring. What if someone else had to read this? That’s the question I’d ask. Usually it was me. It’s amazing how quickly a train of thought can be forgotten.

By 5pm my code was on GitHub and notes for the next day were on my notepad.

This was an ideal day but not every day. Sometimes you’d have a beautiful idea pop into your head at 4:37pm and follow it.

Now you’ve got a rough idea of what happened day to day, we’ll get specific.

1. It’s always about the data

If you’re familiar with some data science first principles, this seems trite (fancy word for obvious). But it’s amazing how often I’d forget. Too many times you’d get focused on building a better model rather than improving the data you were building it on.

Building a bigger model and using more compute power can provide exciting short-term results. However, take enough shortcuts and you end up taking the longcut.

When first engaging with a project, spend an abnormal amount of time getting familiar with the data. I say abnormal because often you’ll have to take your first estimate and multiply it by 3. This will save you time in the long run.

This doesn’t mean you shouldn’t start small. We’ll get to that.

With any new dataset, your goal should be to become a subject matter expert. Check the distributions, find the different kinds of features, where are the outliers, why are they outliers? If you can’t tell a story about the data you’re working with, how do you expect your model to?

An example exploratory data analysis lifecycle (what you’ll do every time you encounter a new dataset). More on this in A Gentle Introduction to Exploratory Data Analysis.

2. Communication problems are harder than technical problems

Most of the major roadblocks I ran into were not technical, they were communicative. Sure, there were always technical challenges but that’s the role of an engineer, to fix the technical challenges.

Never underestimate the importance of communication, internal and external. There’s nothing worse than solving a technical challenge when it was the wrong technical challenge to be solved.

How does this happen?

Externally, it was a mismatch between what a client was after and not so much what we could offer but what machine learning could offer.

Internally, because many people wore many hats, it can get hard to make sure everyone is aligned with the same goal.

These challenges aren’t unique. Machine learning can seem magical. And in some cases it is. But in the cases it’s not, it’s important to acknowledge it.

How do you fix it?

Touch base. Regularly. Does your client understand what you can offer? Do you understand your client's problem? Do they understand what machine learning can offer and what it can’t? What’s a useful way you can communicate what you’ve found?

How about internally?

You can tell how hard of a problem internal communication is based on the number of software tools out there trying to solve it. Asana, Jira, Trello, Slack, Basecamp, Monday, Microsoft Teams.

One of the most effective ways I found was a simple message update in the relevant project channel at the end of the day.

Update:

3–4 points
about what I did
and why

What I’m going to do next based on the above

Was it perfect? No. But it seemed to work. It gave me an opportunity to reflect on what I did and wanted to do with the added benefit of being public, which meant my work could be criticised if it seemed off.

It doesn’t matter how good of an engineer you are, your ability to maintain and gain new business is correlated with your ability to communicate what your skills are and the benefits they bring.

3. Stable > state of the art (generally)

We had a natural language problem. Classifying text into different categories. The goal was for users to send a piece of text to a service and have it automatically classified into one of two categories. And if the model wasn’t confident on the predictions, pass the text to a human classifier. The load was about 1000–3000 request per day. Not massive, not small.

BERT had been the flavour of the year. But without Google scale compute, training a model with BERT to do what we needed required far too much massaging. And that was before taking it to production.

Instead, we used another method, ULMFiT, which on paper wasn’t the state of the art, but still produced more than adequate results and was far easier to work with.

Shipping something which works provides far more value than sitting on something you’re trying to push to perfection.

4. Two gaps in machine learning

There are two gaps in putting machine learning into practice. The gap of going from course work to project work and the gap going from models in a notebook to models in production (model deployment).

An internet search for machine learning courses returns a plethora of results. I used many of them to create my own AI Masters Degree.

But even after completing many of the best ones, when I started as a machine learning engineer, my skills were built on the structured backbone of courses. In reality, projects don’t come structured.

I lacked specific knowledge. The skills which can’t be taught in a course. How to question data, what to explore versus what to exploit.

Specific knowledge: skills which can’t be taught in a course but can be learned.

What was the fix?

I was lucky to be around the best talent in Australia. But I was also willing to learn and willing to be wrong. Of course, being wrong wasn’t the goal but in order to be right, you have to figure out what’s wrong.

If you’re learning machine learning through a course, keep going with the course but arm yourself with specific knowledge by putting what you’re learning to practice by working on your own projects.

What about deployment?

I’m still weak at this. But I did notice a trend. Machine learning engineering and software engineering are merging. With services like Seldon, Kubeflow and Kubernetes soon machine learning will be another part of the stack.

It’s one thing to build a model in a Jupyter Notebook but how do you make that model available to thousands, even millions of people. Judging by the view counts on recent talks at Cloud Native events, not many people outside of large companies know how to do this.

5. 20% time

We had a rule. 20% time. It meant 20% of our time could be spent on learning things. Things was a loose term, meaning things in the world of machine learning. There’s a lot.

This proved invaluable more than once. The ULMFiT usage over BERT came as a result of 20% time.

20% time meant 80% would be spent on core projects.

80% on core product (machine learning professional services).
20% on new things related to core product.

It didn’t always split like this but it’s a good target to have.

If your business advantage is being the best at what you do now, future business depends on you continuing to being the best at what you do. This means constantly learning.

6. 1 in 10 papers get read, less get used

This is a rough metric. But explore any dataset or phenomenon and you’ll soon learn it happens everywhere. It’s Zipf’s law or Price’s law, one of them, they’re both similar to me. Price’s law states that half the publications come from the square root of the number of all authors.

In other words, out of thousands of submissions per year, you may have 10 groundbreaking papers. And out of those 10 groundbreaking papers, 5 might come from the same institute or person.

The takeaway?

There’s no way you can keep up with every new breakthrough. Better to get a solid foundation of the basic principles and apply them. These are the ones which have stood the test of time. The original breakthroughs the new breakthroughs refer to.

But then comes the exploration versus exploitation problem.

7. Be your own biggest sceptic

You can deal with the exploration versus exploitation problem by being your own biggest sceptic.

The exploration versus exploitation problem is the dilemma between trying something new and reapplying what already works.

Exploitation

It’s easy to run a model you’ve already used and get a high accuracy number and then report it to the team as a new benchmark. But if you’re getting a good result, remember to check your work, and again, and again, get your team to do the same. You’re an engineer-scientist for a reason.

Exploration

20% time helps with this. But 70/20/10 could work better. Perhaps you spend 70% on core product, 20% on building upon core product and 10% on moonshots, things which might not (probably won’t) work.

I never practised this in my role but it’s something I’m moving towards.

8. The toy problem first thing works

Toy problems work. Especially to help understand a new concept. Build something small. It could be a small subset of your data or unrelated dataset.

Being in a small team, the trick was to get something working, then iterate, fast.

9. The rubber duck

Ron taught me this one. If you’re stuck on a problem, sitting and staring at code could may solve it, may not. Instead, talk it out in language with a teammate. Pretend they’re your rubber duck.

“Ron, I’m trying to loop through this array, and track the state whilst looping through this other array and track the state, then I want to combine the states into a list of tuples.”

“Loops in loops? Why don’t you vectorize it?”

“Can I do that?”

“Let’s find out.”

10. Models built from scratch are declining (or at least you don’t need them to start)

This ties back into the point about machine learning engineering merging with software engineering.

Unless your data problem is very specific, many of the major ones are quite similar, classification, regression, time series predictions, recommendations.

Services like Google’s and Microsoft’s AutoML and others are making world-class machine learning available to everyone who can upload a dataset and pick the target variable. It’s early days but they’re gaining momentum, fast.

On the developer side of things, you’ve got libraries like fast.ai which make state-of-the-art models available in a few lines of code and various model zoos (a collection of pre-built models) like PyTorch hub and TensorFlow hub which offer the same.

What does this mean?

Knowing the basic principles of data science and machine learning is still required. But knowing how to apply them to your problem is even more valuable.

Now there’s no reason your baselines shouldn’t be something close to, if not, state-of-the-art.

11. Math or code?

For the client problems I worked on, we were all code first. All the machine learning and data science code was Python. There were times I’d dabble in math through reading a paper and replicating it but 99.9% of the time, existing frameworks had the math covered.

This isn’t saying math is unnecessary, after all, machine learning and deep learning are both forms of applied math.

Knowing at a minimum matrix manipulation, some linear algebra and calculus, particularly the chain rule is good enough to be a practitioner.

Remember, my goal wasn’t to invent a new machine learning algorithm, it was to demonstrate to a client the potential machine learning had (or didn’t have) for their business.

Side note: In beautiful timing with this article, fast.ai have just released a new course, Deep Learning from the Foundations which goes through the math and code of deep learning from scratch. It’s designed for people like me who are familiar with applying deep learning and machine learning but lack a math background. To fix this, I’m going through it and it’s been immediately added to my list of favourite machine learning and data science resources. With solid foundations, you can build your own state-of-the-art rather than iterating on a previous one.

12. The work you did last year will probably be void next year

This is a given. And is becoming more so due to the merging of software engineering and machine learning engineering.

But it’s what you signed up for.

What stays the same?

Frameworks will change, libraries will change, but the underlying statistics, probability, math, these things have no expiration date (even better timing with the new fast.ai course).

The biggest challenge is still: how you apply them.

What now?

There could be more but 12 is enough.

Working at Max Kelsen was the best job I ever had. The problems were fun but the people were better. I learned a lot.

Leaving wasn’t an easy choice but I’ve decided to put what I learned to the test on my own. You’ll find me at the crossroads of health, technology and art. And live on YouTube.

If you have any questions, my Twitter DM’s are open or subscribe to my newsletter for updates on what I’m up to.