tl;dr: Do a project you care about. Make it good and share it.
There’s a lot of interest in becoming a data scientist, and for good reasons: high impact, high job satisfaction, high salaries, high demand. A quick search yields a plethora of possible resources that could help — MOOCs, blogs, Quora answers to this exact question, books, Master’s programs, bootcamps, self-directed curricula, articles, forums and podcasts. Their quality is highly variable; some are excellent resources and programs, some are click-bait laundry lists. Since this is a relatively new role and there’s no universal agreement on what a data scientist does, it’s difficult for a beginner to know where to start, and it’s easy to get overwhelmed.
Many of these resources follow a common pattern: 1) here are the skills you need and 2) here is where you learn each of these. Learn Python from this link, R from this one; take a machine learning class and “brush up” on your linear algebra. Download the iris data set and train a classifier (“learn by doing!”). Install Spark and Hadoop. Don’t forget about deep learning — work your way through the TensorFlow tutorial (the one for ML beginners, so you can feel even worse about not understanding it). Buy that old orange Pattern Classification book to display on your desk after you gave up two chapters in.
This makes sense; our educational institutions trained us to think that’s how you learn things. It might eventually work, too — but it’s a unnecessarily inefficient process. Some programs have capstone projects (often using curated, clean data sets with a clear purpose, which sounds good but it’s not). Many recognize there’s no substitute for ‘learning on the job’ — but how do you get that data science job in the first place?
Instead, I recommend building up a public portfolio of simple, but interesting projects. You will learn everything you need in the process, perhaps even using all the resources above. However, you will be highly motivated to do so and will retain most of that knowledge, instead of passively glossing over complex formulas and forgetting everything in a month. If getting a job as a data scientist is a priority, this portfolio will open many doors, and if your topic, findings or product are interesting to a broader audience, you’ll have more incoming recruiting calls than you can handle.
Here are the steps I recommend. They are optimized for maximizing your learning and chances to get a data job.
1. Pick a topic you’re passionate or curious about.
Cats, fitness, startups, politics, bees, education, human rights, heirloom tomatoes, labor markets. Research what datasets are available out there, or datasets you could create or obtain with minimal effort and expense. Perhaps you already work at a company that has unique data, or perhaps you can volunteer at a nonprofit that does. The goal is to answer interesting questions or build something cool in a week (it will take longer, but this will steer you towards something manageable).
Did you find enough to start digging in? Are you excited about the questions you could ask and curious about the answers? Could you combine this data with other datasets to produce original insights that others have not explored yet? Census data, zip-code or state level demographic data, weather and climate are popular choices. Are you giddy about getting started? If your answer is ‘meh’ or this feels like a chore already, start over with a different topic.
2. Write the tweet first.
(A 21st century, probabilistic take on the scientific method, inspired by Amazon’s “write the press release first” practice and, more broadly, the Lean Startup philosophy)
You’ll probably never actually tweet this, and you probably think tweets are a frivolous avenue to disseminate scientific findings. But it’s essential that you write 1-2 sentences about your (hypothetical) findings *before* you start. Be realistic (especially about being able to do this in a week) and optimistic (about actually having any findings, or them being interesting). Think of a likely scenario; it won’t be accurate (you can make things up at this point), but you’ll know if this is even worth pursuing.
Here are a few examples, with a conversational hook thrown in:
- “I used LinkedIn data to find out what makes entrepreneurs different — it turns out they’re older than you think, and they tend to major in physics but not in nursing or theology. I guess it’s hard to get VC funding to start your own religion.”
- “I used Jawbone data to see how weather affects activity levels — it turns out people in NY are less sensitive to weather variations than Californians. Do you think New Yorkers are tougher or just work out indoors?”
- “I combined BBC obituary data with Wikipedia entries to see if 2016 was as bad as we thought for celebrities.”
If your goal is to learn particular technologies or get a job, add them in.
- “I’ve used TensorFlow to automatically colorize and restore black and white photos. Made this giant collage for Grandma — best Christmas ever!”
Imagine yourself repeating this over and over at meetups and job interviews. Imagine this in USA Today or story or Wall Street Journal (without the exact technologies; a vague “algorithm” or “AI” will do). Are you boring yourself and having trouble explaining it, or do you feel proud and smart? If the answer is “meh”, repeat step 2 (and possibly 1) until you have 2-3 compelling ideas. Get feedback from others — does this sound interesting? Would you interview somebody who built this for a data job?
Remember, at this point you have not written any code or done any of the data work yet, beyond researching datasets and superficially understanding which technologies and tools are in demand and what they do, broadly speaking. It’s much easier to iterate at this stage. It sounds obvious, but people are eager to jump into a random tutorial or class to feel productive and soon sink months into a project that is going nowhere.
3. Do the work.
Explore the data. Clean it. Graph it. Repeat. Look at the top 10 most frequent values for each column. Study the outliers. Check the distributions. Group similar values if it’s too fragmented. Look for correlations and missing data. Try various clustering and classification algorithms. Debug. Learn why they worked or didn’t on your data. Build data pipelines on AWS if your data is big. Try various NLP libraries on your unstructured text data. Yes, you might learn Spark, numpy, pandas, nltk, matrix factorization and TensorFlow – not to check a box next to a laundry list, but because you *need* it to accomplish something you care about. Be a detective. Come up with new questions and unexpected directions. See if things make sense. Did you find a giant issue with how the data was collected? What if you bring in another data set? Ride the data wave. This should feel exciting and fun, with the occasional roadblock. Get help and feedback online, from Kaggle, from mentors if you have access to them, or from a buddy doing the same thing. If this does not feel like fun, go back to step 1. If the thought of that makes you hate life, reconsider being a data scientist: this is as fun as it gets, and you won’t be able to sustain the hard work and the 80% drudgery of a real data job if you don’t find this part energizing.)
Write up your findings in simple language, with clean, compelling visualizations that are easy to grasp in seconds. You’ll learn several data viz tools in the process, which I highly recommend (it’s an underrated investment in your skills). Have a clean, interesting demo or video if you built a prototype. Technical details and code should be a link away. Send it around and get feedback. This being public will hold yourself to a higher standard and will result in good quality code, writing and visualizations.