My New Take on Things from Six Months in Data Science

Since starting my job six months ago as a data scientist, I’ve had some profound experiences and a higher level of job satisfaction. In celebration of my first half year with this new role, here are lessons that I learned on my journey.

#1 — Review the paper arXiv

You may already know that reviewing arXiv is good for your career. It’s a wellspring full of amazing ideas and advancements in the field of data science.

I’ve been amazed by the volume of actionable insights that I encounter on the arXiv. As an example, I might not have resources like 16 TPUs and $7k to train BERT from scratch, but the suggested hyperparameter configuration from the Google Brain team are a good starting place for fine-tuning (check Appendix A.3).

Hopefully, the new package that you like most will have a good read on arXiv to add a splash of color to its documentation and to your life. As another example, I was delighted to learn to deploy BERT with the extremely readable and useful write-up on ktrain, a library that sits above Keras and offers a streamlined interface of machine learning for image, graph, and text.

#2 — Listen to podcasts for tremendous situational awareness

Podcasts will not enhance your coding abilities however they will add to your understanding of current developments in machine learning, trending packages and tools, questions without answers in the field, new methods to solve old problems, common psychological insecurities among those in the profession, etc.

The podcasts I daily enjoy have enhanced my understanding of the field and kept me up-to-date on the rapid changes in the field of data science.

Here’s my current favs for podcasts: Resources to Supercharge your Data Science Learning in 2020Advance your understanding of machine learning with this helpful collection of journals, videos, and lectures.towardsdatascience.com

Just recently, I’ve been hyped after hearing of advancements in NLP, follow the latest developments in GPUs and cloud computing, and question the potential symbiosis between advancements in artificial neural nets and neurobiology.

#3 — Study Issues on GitHub

From my experiences in this ocean of issues for giant tuna of wisdom, here are 3 possible wins:

  1. I usually gather ideas from how others are making use of and/or improperly using a package
  2. It’s also helpful to learn of the kinds of situations in which a package will tend to fail in order to develop a sense of potential future issues for your own work
  3. While in the pre-work portion of setting up your work environment and performing model selection, you could benefit to take into account the responsiveness of developers and the community before including an open source tool in your work

#4 — Understand the algorithm-hardware link

I’ve done a lot of NLP in the last six months, so let’s talk about BERT again.

In October 2018, BERT was born and the world was shook. Similar to Superman after jumping a tall building in one bound (it’s crazy to think about how Superman didn’t originally fly when first introduced!)

BERT was a big change in the capacity of machine learning to handle processing tasks involving text. The cutting-edge results are based on the parallelism of its transformer architecture running on the Google’s TPU computer chip.

Knowing and understanding how TPU and GPU-based machine learning differently impact your work is important for enhancing your capabilities in data science. It’s also a crucial step in improving your intuition around the inextricable link between machine learning software and the physical constraints that come from the hardware employed.

With the petering out of Moore’s law in 2010, more and more creative methods will be required to overcome the limitations to the field of data science and continue to make advances towards more and more intelligent systems.

Chart from Nvidia presentation showing transistors per square millimeter by year. This highlights the stagnation in transistor count around 2010 and the rise of GPU-based computing.

I’m bullish on the rise of ML model-computing hardware co-design, increased reliance on sparsity and pruning, and even “no-specialized hardware” machine learning that looks to disrupt the dominance of the current GPU-centric paradigm.

#5 — The Social Sciences Should be Studied

There’s much that this field can gain from studying the reproducibility crisis of the Social Sciences in the mid-2010s (and which is still occurring):

“p-value hacking” for data scientists. Comic by Randall Monroe of xkcd

In 2011, an academic crowdsourced collaboration had the goal to reproduce 100 experiments and psychological studies. It failed —only 36% of the attempts reported results that were statistically significant, versus 97% for the originals results.

The reproducibility crisis of psychology shows the issues associated with trying to combine science with shaky methodology.

Data science requires testable, reproducible methods to solve problems. To get rid of p-hacking, data scientists must set limits on how they investigate their data for predictive features and on the number of tests they run to analyze metrics.

There’s lot of tools to aid experimentation management. I have utilized ML Flowthis excellent article by Ian Xiao mentions another six, as well as recommendations among four other parts of the machine learning workflow.

We can gain many lessons via the missteps and algorithmic malpractice that occurred in the field of data science in recent years:

As an example, interested parties should look no further than recommendations engines in social engineering, discriminatory credit processes, and criminal justice systems that further deepen the status quo.

There’s good news. There are many smart and driven practitioners striving to resolve these problems and prevent future issues in public trust. Check out the following: Google’s PAIR, Columbia’s FairTest, and IBM’s Explainability 360. Collaborations with researches in social science can yield good results, such as the following project on algorithms to audit for discrimination.

When reporting the results of modeling (i.e. numbers representing accuracy, precision, recall, f-1, etc.), data scientists must exercise special care in order to manage expectations. It can be beneficial to offer a degree of hand-waviness on a scale of “we are still working on the issue, and these metrics are may change” to “this product is the final product.”

#6 — Connect data to business outcomes

Prior to starting, make sure the solution would be of value. The organization for which you work is not paying for you to waste time or money on something that doesn’t generate any value to the organization.

Leave a comment

Your email address will not be published. Required fields are marked *