What roles do we need in a data science team?

When we think of building a data science team for your business, the first role we will want to have is data scientist. Is it good enough to have data scientists only? What else? And even only for data scientist role, what kind of skills do we need from them? Here we only talk about a data science team outside of the companies who create data science algorithms like Google, Facebook, or BAT.

Data scientist specialized in structured data — if your business mainly deals with structured data, they are the people you need. They will know supervised and unsupervised machine learning algorithms, including but not limited to naive bayes, linear regression model, logistic regression models, tree-based models, clustering algorithms, fully-connected neural networks, topic modeling methods. Within a good understanding of these models, they should be able to pick the right one.

Data scientist specialized in unstructured data— if your business mainly deals with unstructured data, like text or image, they are the people you need. Take image as example, their knowledge should cover convolutional neural networks, including the typical architectures like ResNet, EfficientNet, DenseNet, InceptionNet. There are other more basic architecture like VGG, which is less used nowadays. If they know these networks, and the commonly used activation functions, loss functions, optimizers within a neural network, they should have the knowledge to select a good one / ones.

Data engineers — they are the people who are programmers, who know how to deploy a data science model. It is a quite important but sometimes undervalued role. They should architect the CI/CD pipeline, find out the best way to expose the model result to the users. They don’t need to know the statistics behind the model, but they know the programming language, like Java, node.js, react.js to build the front-end and back-end. If this model is going to be deployed into the cloud, they should be familiar with the services provided by AWS, Azure, or Google cloud to host the models.

Lead who understands business needs — It is all about managing expectations of stakeholders! The gap between a business problem and viable technical solutions sometimes can be huge. How to translate the business problem into a data science problem, how to explain what can be done what cannot be done, how to get the stakeholder’s buy-in is equally important than creating the right data science model.

Sometimes a person may have multiple skills. It usually happens when this person is more senior. It is also rare to have all of them at the very beginning. So who to hire first is important. I will share some of my thoughts about who will be the first one to hire in my next post.

What are the top 10 machine learning questions I ask during a Data Science interview?

Recent 2 years in my career, I have been part of strategy making and recruiting new team members. The most frequent role I interviewed is Data Scientist. I summarized the top 10 questions I always ask. In my next few posts, I will give my answer separately.

  1. What is logistic regression model? When do you use it?
  2. How do you interpret R^2 and p-value?
  3. Why random forest is called “random”?
  4. How is decision tree built? How do you select the next node?
  5. What is bootstrapping / bagging? What is out-of-bag error?
  6. What is the support vector in SVM?
  7. What is PCA? How do you select PCA? What is the limitation of PCA? What are the precautions before apply PCA?
  8. What are the commonly used regularization methods?
  9. What is Bayesian theorem? What is the assumption?
  10. What will you do to avoid overfitting?

These questions focus on the basics of statistics. I do believe it is important to be able to answer them. It definitely demonstrates you don’t only know how to call a library but also know why. But at the same time, I would say, more than half of the candidates who claim themselves applied machine learning models successfully are not able to give me good answers. Maybe you can share with me your thoughts.

As I promised, I will give my answers in the next few posts.

What does a typical day look like for a data scientist?

Before you become a data scientist, maybe you have asked your friends or yourself this question: what is a typical day look like as a data scientist? My answer might not be representative enough or general enough, but at least I can give you some idea.

I work in a data science / consulting unit in finance sector. My typical day looks like this:

9:00– come to office / come to my home office during COVID. Get a coffee ready. Open up my Outlook and One Note to get the TODOs.

9:30 ~ 12:00–struggle with bugs in my code, switch between “git pull, git status, git commit, git push” and merge conflict error message.

12:00 ~ 13:00– Lunch. Grab food / restaurants near office. In our company, we have this culture of having lunch appointment with colleagues from other units. This is a good chance to exchange our ideas, share some work / life news, and socialise with other colleagues.

13:00 ~ 18:00– besides the normal coding stuff, there are more meetings (internal or with clients) because afternoon is the overlapping time period across different time zones.

18:00 ~ 19:00– reply some important and urgent emails before I call it a day for my office work.

19:00 ~ 21:00– dinner / take a break

21:00 ~ 23:00 — self-development activities. There are usually the activities I set for myself, like reading books, writing blogs, refreshing on some statistic basics, etc. I found it a good way to keep my knowledge refreshed all the time and be ready to answer any questions whenever I need to.