How to get into big data
A couple of weeks ago, I gave a talk at Philly Women in Tech on how to get into the field of big data to a diverse group of women both in tech fields and not.
The slides are here:
but I didn’t take my own advice from my last post and didn’t add any context to each slide so you could tell what was going on without having been at my talk.
Here’s a breakout in a couple bullets:
Slide 3: Data comes from every department in the organization. If you’re a data scientist or analyst, your goal is to make sense to the hundreds of disparate sources and translate them not only into information, but into actionable insights.
Slide 4: For example, let’s say you’re marketing Wonka Bars and you want to understand how to sell more of them. From your sales, you probably already have data points like weekly sales, the demographics of who buys the bars (much easer to get if you’re doing ecommerce,) and which markets you should sell the bars in. And those are just three data points for one product in one department. Imagine how much data flows across the organization.
Slide 5: So why is it called big data? Because today, not only do we have those basic data points like markets and demographics, but because we can very closely track purchases at the second-by-second level, and we can store all of that data for later analysis. The big is not only the size of the data, it’s also the scope.
Slide 6: So what kinds of careers can you have in big data? I tend to see it as a progression from less technical to more technical, from left to right, where the closer you are to being hands-on with the data, the more technical you need to be. Think of data as water flowing through a pipe. You need the water to get to a specific location so you can consume it (i.e. get the data to a specific place so you can analyze it.) The people building the pipe are all the way to the left: the Java developers who build web services and set up Hadoop, a data analysis platform, ETL developers, data architects who specify how the pipe should be built, and sometimes data scientists also take on some of this role of building pipes, depending on what kind of organization the data scientist is in.
Next, you once you get the data to the right place, you need to analyze it. Here’s where the analysts and data scientists come in. There’s a lot of statistics, SQL, and R at this phase. Notice the Data Scientist is both in the less technical and more technical end of the spectrum. I’ll explain why in a couple of slides.
The final step is presenting all the data you’ve analyzed. The analyst/scientist usually does this through visualizations, especially if dealing with C-suite executives who need complicated concepts boiled down into concise, actionable language. These are the people taking actions and making decisions based on the data: how many servers should we buy? Where should we open the new plant? How should we structure our website, etc.
Slide 7: Because there’s so much data, a whole industry has now sprung up around it, with lots of different providers. It’s easy to get lost in the landscape, but just imagine each of these providers on a different step of the spectrum in the last slide, with operational infrastructure and technologies being all the way on the left, business intelligence being in the middle, and analytics and visualization all the way to the right.
Slide 8: The same way that there’s a spectrum for roles, there’s also a spectrum of technologies used in the big data space. Think about the whole process of analyzing data as analogous to what you do with an Excel spreadsheet (which is terrible,by the way, because Excel is terrible.) You get the spreadsheet from someone, do some analysis, and then present it. You can’t do that if the Excel file is too large. That’s where big data comes in.
If you work on capturing big data or moving it, you’ll be working primarily with high-level programming languages like Java, Scala, Python, Ruby, etc. Analysis usually happens in R, SQL, and at healthcare companies, proprietary software like SAS. Presentation is usually done in Tableau or Powerpoint.
Slide 9 This one’s about how I got to where I am; the gist of it is that I’ve been working in data for my entire career and have gotten progressively more technical. But I’m interested in the intersection of tech and business so in addition to learning tech I’m also currently working on my MBA. If you’re interested in learning more, shoot me an email.
Slide 10 - 12: So what I do I do every day? I like to think of myself as Benedict Cumberbatch…the lone genius working on very complicated puzzles. But in reality, a lot of a data scientist’s daily work invovles making sense of and cleaning data in order to get it ready for analysis. So my day-to-day involves a little of each of the three areas: programming, stats analysis, and domain expertise; i.e. knowing the business and the data associated with it. No one data scientist is good at all three; most will be very good at two. That’s why it’s good to have a team with complementary skills.
Slide 11 If you want to get into big data, you’ll need to be familiar with all the stuff on this slide. If you want a high-level overview, I recommend reading Data Tau every day (Hacker News also does the trick) and Googling stuff you’re not familiar with. OReilly’s Doing Data Science is also a great book. If you’re interested in stats, Khan Academy has great beginning lectures, and Learn Python the Hard Way is the hands-down best beginner’s guide to programming.
If you have any questions, feel free to email me!