Troy Sadkowsky: Tell us, what have you been up to (since you became a data scientist perhaps), what cool things are you doing now and what are your big goals for the up and coming years?
Jake Porway: I actually just recently got into the official role of a “data scientist” about a year ago when I joined the New York Times R&D lab. My work before that was in machine learning, computer vision in particular, and I spent a lot of my time getting robots to identify landmines and fly planes. It was only once I got to NYT that I got to extend that work to broader data science tasks. I’ve primarily been focusing on a project at the labs called Project Cascade (http://nytlabs.com/projects/cascade.html), which tracks NYT links across social media. It’s a super cool project with one of the coolest datasets I’ve ever gotten to work with, and I get to do everything from explore when @nytimes should tweet to finding the most influential people that tweet about politics. The lab’s a wonderful sandbox where we get to play with any projects related to data the Times has or that others create related to news and media. Up and coming, I hope to keep building out new big data tools and algorithms with the Times while I build up my side project, Data Without Borders, which matches data scientists with social causes.
TS: Can you tell us the path you took to get to the data scientist role?
JP: I was originally very interested in artificial intelligence, though for most of my college career in computer science I thought of intelligence as sets of rules or heuristics to pull off little intelligent “tricks”. It was only when I took my first machine learning class my senior year that I realized where things were really heading. The ability to use vast amounts of data to have machines teach _themselves_ blew my mind. I went to do a Ph.D. in computer science but, thanks to some great foresight by my advisors, ended up in a Ph.D. program in Statistics at UCLA. It was really seeing the power of statistics combined with computing techniques that showed me how powerful data and data science could be. Add in a dash of luck that big data started to get big as I was coming out of the program and the rest is history.
TS: What are the core fundamentals that you would say are critical for performing your best at the data scientist role?
JP: Lots of people have weighed in on “what makes a data scientist” from across the web and there doesn’t seem to be a definitive answer yet. Just as the varied skills needed to build databases, write compilers, and do computer graphics eventually got united under “computer science”, the skills that make a data scientist are a wide array of individual techniques at present. To me, it boils down to three main things (that I think I’ve simply absorbed from other, smarter peoples’ answers) – 1) practical computing skills: you need to be able to write scripts to scrape data as well as code up the algorithms you come up with in your head 2) statistical skills: you should know your basic stats (and more, ideally) if you’re going to really be able to assess whether the models you’re building or algorithms you’re writing are doing what you want, and 3) critical thinking skills: This one sounds obvious, but it really sets apart the hackers from the true scientists, for me. You’d be amazed at how many times I’ve seen someone build a model and report the results without realizing that they hadn’t thought critically about where the data was coming from or if their experiment was designed correctly. You must must MUST be able to question every step of your process and every number that you come up with.
TS: What are some common mistakes or misconceptions to watch out for when looking to become a data scientist?
JP: Hmm, not sure what big pitfalls come to mind off hand. Perhaps the only one I can think of is that you don’t need to have a Ph.D. to do good work. It might help, but don’t think you have to go off and do another 5 years of school to be able to call yourself a “data scientist”. Perhaps the other is to simply remember that data science is still a relatively new term and to be aware that a data scientist at Foursquare is going to look a lot different from a data scientist at Goldman Sachs.
TS: If I was pursuing a career as a data scientist what is the first thing I should do?
JP: Get on a data science project! Download some data, pick up some R, and start playing. I’m biased because I come from the stats world, but I’d say to focus on using something like R alongside a basic stats book to guide you through exploring some data. The machine learning + computing skills will come with that (of course this depends on your past experience – if you’re already a statistician, pick up some Python!) If there are local Meetups in your area go and seek them out – being a part of the community is the fastest way to know what you don’t know. After that it’s just hunkering down with some books.