Podcast: Real-time Data Science in a Streaming World with Dustin Garvey

June 19, 2020 in Eventador Streams Podcast



Podcast: Real-time Data Science in a Streaming World with Dustin Garvey
Eventador Streams Podcast: Real-time Data Science in a Streaming World with Dustin Garvey

In this episode of the Eventador Streams podcast, Kenny and I got the chance to talk with Dustin Garvey, head of Machine Learning at Archipelago Analytics, about how data science and machine learning pipelines have changed—and should continue changing—with the growth of streaming data.

Dustin’s background working with ML pipelines at oil & gas companies, GE Global Research, Oracle and more led to a fascinating conversation about the struggles data scientists face getting access to the streaming data they need and why it’s so important that they get that access in this episode of the Eventador Streams podcast:

Eventador Streams · Real-time Data Science in a Streaming World with special guest Dustin Garvey

Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play, Spotify, and more. Happy listening!

Episode 10 Transcript: Real-time Data Science in a Streaming World with Dustin Garvey

Leslie Denson: Have you ever dreamed of a world where all of your data is a stream? Kenny and I were joined by Dustin Garvey, head of machine learning at Archipelago Analytics, to chat about the current role of data streams in the data science world and just what the ideal future for those might look like in this episode of Eventador Streams, a podcast about All Things Streaming Data.

LD: Hey, everybody. Welcome back to another episode of Eventador Streams. Today, Kenny and I are joined by somebody we have talked to multiple times in the past just amongst us. So we’re really excited to have him on the podcast today and think that you guys are really going to enjoy this one. It is Dustin Garvey, who is currently the head of machine learning at Archipelago Analytics. So Dustin, welcome.

Dustin Garvey: Thank you. Thanks.

Kenny Gorman: Good morning. Welcome.

DG: Good morning.

LD: How’s it going today?

DG: It’s good, good, good considering.

LD: Good. We’re still in COVID times. Everybody’s still at home. Well, Dustin, thank you so much for joining us today. We really, really, really appreciate it. Why don’t we start off a little bit by you just giving a quick background on yourself, introduction to yourself to let everybody know who you are?

DG: Sure. Yeah. So name’s DG. I’ve worked in applied data science/machine learning for it’s going on a decade now. I’m a nuclear engineer by training. I took a first job in oil and gas, so went to an oil and gas services company called Baker Hughes. Spent three or so years doing analytics development for onboard tool. Basically, you drill a hole. You have this tool. You download its memory. You do analytics on that data that was captured to determine whether or not you want to use that tool again. So I did that for a couple years, and then spent a couple years at GE Global Research, where we … What’s nice about GE is that anything that spins, GE basically manufactures it. So we had a lot of research projects with wind turbines, oil pumps, trains, aircraft engines, so on, and so on.

DG: So I did that for a couple of years and then went to Oracle, where we did monitoring of enterprise cloud software, and then eventually made the transition to Archipelago, where it’s very much different use cases. Essentially, I’ve spent most of my career doing analytics about the monitoring of diagnostics of heavy equipment. It’s basically monitor all the data that comes off of those pieces of equipment, and make determinations about whether or not maintenance is required, and if maintenance is required, what type of maintenance should be performed, if that makes any sense.

KG: Yeah.

LD: Yeah.

KG: That’s very cool. Anything that spins. That just sounds fascinating. Your job is to work with anything that spins. Dang.

DG: Yeah, it was funny because I remember that day that I came into the office, and it was like, “Holy crap. It’s seriously everything that spins.” If you look at the portfolio, most of it is things that have little turbine blades. We had one research project with GE Aviation, and they walk you through the manufacturing plant. It’s fascinating because you see these little, bitty turbine blades. Some of the ones that are in where the combustion occurs in those aircraft engines, they’re as big as your thumb, but each one costs 12-plus-grand a piece because they’re very exotic materials.

KG: Yeah, like titanium or something, right?

DG: Yeah, yeah. They’re just fascinating. So you learn about these, all the details about those little blades. Then you go to GE Power, and they walk you through the manufacturing facility, and you’re like, “Huh. It’s the same thing except much, much bigger.” Yeah.

KG: Those are like hydro turbines for dams and generation of electricity, right?

DG: Oh, no. It’s mostly gas, gas turbines.

KG: Oh, okay.

DG: So It would be like liquid. It’s funny because GE, I’m sure, they make those too.

KG: Right.

DG: They do.

KG: They do spin.

DG: Yeah. They do spin, yeah.

KG: Interesting.

DG: Yeah, they make everything.

KG: That’s cool too because, right? Because there’s got to be something fascinating about traversing the physical, the blood-brain barrier between the physical/IoT world to a pure data application. That’s pretty cool.

DG: Yeah. It’s interesting though. The only downside was that there’s always… If you look at the literature and see all big data analytics, lots of data, lots, and lots, and lots like those sort of use cases, you never hit them. That was one of the reasons I went to Oracle because, with heavy equipment, you’re always constrained by sensors and bandwidth because those items, if they’re actually tethered to the ground, you can get a lot of data. But they often were installed so long ago that that wasn’t an option, right? Or disk space was expensive, or so on, and so forth.

KG: I was going to say that’s the problem with aircraft, right? It’s like only the thing that’s measuring what’s happening on the airplane is in the airplane. When it crashes, you’re like, “Hey, we should go find the data for this thing. Oh, it was also in the crash.” It’s like, “Crap, right? We didn’t want that to be in part of a crash.”

DG: It gets pretty tricky.

KG: That’s how it’s done. Yeah.

DG: Yeah. I mean when it comes to most of the data that’s used to do analytics on jet engines are… You’re probably going to hear my children in the background. The…

KG: Oh, do they know data science? Because that’d be great too.

DG: Yeah, I tried sitting my children down and saying, “Here’s a keyboard. Let’s see which one of you has aptitude.” I think it’s going to be my daughter. My son just threw it off the table. My daughter actually hit the keys, and then looked at the keys, and was like, “Oh, that’s pretty cool.”

DG: No, but for jet engines, it was kind of surprising when we did work with GE Aviation where it was like most of the data that’s used to make performance-based recommendations, as in you might have a blade that bent… And it’s these things that are not safety-critical. That’s the other flip side is we got exposure to all the stuff that they have to go through to certify a jet engine, and it is mind-blowing. It’s-

KG: Yeah, I can imagine.

DG: Yeah, it’s pretty substantial. Here, we’re talking about performance-based like it looked like a foreign object may have gone through the turbine blade. There’s a bent blade because it looks like the performance has dropped. Those assessments are made based on you get a snapshot of all the measurement at takeoff, cruise, and landing. When I was there, they were starting to expand, trying to get more high-frequency data, but it was kind of surprising to see that those sort of decisions are made… You can make them with that amount of data, but it’s sort of just you would have expected more.

KG: Right. It seems like there’s this future awaiting us where, finally, when the data starts streaming and flowing in a different way, that there’s going to be a whole world that unlocks.

DG: It’s going to be a whole world that unlocks, but probably… I don’t know who I was talking to, but there was some discussion about how… I think I brought up big data, and they were like, “Oh, that’s done.” I was like, “Oh, really?” I was like, “Big data’s not a thing anymore?”

KG: Oh, good. That just solves the problem. Perfect.

DG: Yeah, it was like, “Okay, cool. I missed it, so cool,” but-

LD: Glad to know that happened while I was sleeping.

DG: Yeah. Well, it was like everybody has the fear of missing out, right? The whole time, I’m like, “Well, I want to use big data, but our use cases never are that. The fleet size is 12,000 or you have 6,000 planes. The amount of data’s actually not that big, so you don’t need all these complicated things. So you talk to colleagues and you’re like, “Oh, man. You guys are doing fun stuff with the-“

KG: I like that you’re sad that you missed big data. That’s the, quote, “Oh, I missed it.”

DG: Ha yeah.

KG: “Dammit. I wanted to be big data.”

DG: Yeah. I think it was in that discussion, they were talking about how a lot of companies built it, but they didn’t have a plan around what to do with it. So I have a feeling it’s going to be very similar for all these things where they get high-frequency data. You’re like, “You really go to have a plan. You might want to just think about it because it-“

KG: Oh, yeah. That’s a huge thing, right? It’s like people put data into these “data lakes,” but it’s file systems or databases or whatever. We’ve come across this. I mean it’s been years, and years, and years now where it’s like, “Well, what do you do with it?” “Oh, we run some reports,” or, “We’d really like to know what’s in there. We don’t even know what’s in there right now.” Being a data practitioner, to me, that is just like, “Oh my god. I cannot believe that the state of affairs is so that we stuff our data in closests, and we don’t open the door ever again.”

DG: Right.

KG: Yeah, agree. The next generation of data, we aren’t learning that lesson very fast. That’s true. As a collective group, using the data and making sense of it is something that is still very painful. That’s true.

DG: Well, it’s actually interesting because when you were started talking through that, it reminded me of a lot of things that we kept hitting at pretty much every equipment company that I worked at. What’s interesting about the infrastructure that’s built to standup data is often extremely disconnected from whoever would use it. It’s just usually organizational split. One of the discussions that we kept hitting at those hardware vendors was that it costs money, and it takes design resources to put sensors into things. What was interesting is that the companies kept making decisions that were kind of just… If you were to look at the data analytics question, they made no sense. I think it was a major failure mode for locomotives is water leaks. One of the things we kept running into is certain models, they would remove the water pressure sensor and for legitimate reasons. You’re just like, “Eh.”

KG: Yeah.

LD: But why? Yeah.

DG: Yeah, I mean, it-

KG: I kind of needed that.

DG: Yeah. It’s really complicated because, in the hardware world, it’s they remove them because… I think I remember hearing one story about the sensor reliability and the harness reliability was much less than they expected. So they were spending a lot of money to maintain those, the sensors. It’s a fair point because if you detect a leak, there’s still, I think, it’s kilometers of pipe that you would have to inspect, so it’s like-

KG: Right. That’s totally believable.

DG: Yeah.

KG: Think about it. Think about it. There’s some maintenance guy in a rail yard, right? He’s been doing this for whatever many years. He knows when that thing, I don’t know, blinks or hisses or whatever the… It’s a very physical world. He’s seen a million of those things. He goes, “Oh, yeah. I know this thing fails. I’m just going to rip it out and replace it out with…” I don’t know what you’d do, tape or whatever. Whatever he puts in his place. You’re with your propeller hat are back in the office saying, “Hey, something changed on this locomotive. All this model is now broken,” or it’s now coming up in your anomaly detection as…

DG: Yeah.

KG: It’s like-

DG: … sensor failure.

KG: Yeah. Fantastically interesting collision between a very physical world and a data-driven world.

DG: Yeah.

LD: Yeah.

DG: It’s funny. It continues to be disconnected though. I’ve heard a couple of use cases around just put a microphone on a piece of equipment or the door. So as the blah, blah, blah-drives by it… What’s funny is that it’s the same thought models are the things that manifest themselves in every other business. That’s the part that was kind of fascinating was you go… When I went from hardware to software, it’s very similar, but it’s slightly different. So a lot of it was around you couldn’t do certain things because this thing is collected in this group, and this thing is collected in this group, and they go. They’re actually measured in asynchronous… The timestamps don’t line up, and they’re actually collected in different groups. If they’re in different groups, they go through different pipelines. You’re like, “Well, do I need that?”

KG: Right.

DG: It’s sort of fascinating. That’s the part that, when I started with you all, that was very exciting was that a lot of these things, I don’t think the teams that stand up the infrastructure, they’re not aware of it because they’re solving other problems. I think those problems are equally, if not more valid, right? Because it’s like you have to stand up a product that does base capability, but then if you want to extend it via anything else, you just have to have access to what you’re collecting, right? If you don’t do it properly, you’re always going to be constrained.

DG: That’s the part when we started talking the first couple times that was very exciting was that you have a mechanism by which you can decouple the view on what’s available from the actual application because then you don’t have to worry about it first. You can worry about it second. If it comes to go time, and there’s a reason why it should be pushed lower, then, yeah, it’s a valid engineering… But if you don’t know, you don’t know, right? Then you get into this perpetual cycle of… It’s funny because I literally just had this conversation the other day. It was like, “This thing is meant to be a test too. We’re not going to productize anything until we’ve demonstrated it’s valuable.” Then you’re like, “Well, we can’t demonstrate it’s valuable unless you build it.”

KG: Right. You actually have to use real data. If you synthesize the data, then… You bring up a good point. In a lot of demos and stuff, we use a pretty simple piece of data, right? We’ll say, “Sensor name and temp.” It’s like, “Okay. That’s cool and super important, especially in aggregate. If there’s millions of those things, it can be very, very interesting, but it’s almost like the something unit of data needs to have five times its size in context in metadata. The modern data payload is flying very fast, but the context for that piece of payload, while it’s a slowly changing dimension around that piece of data, right? It is supercritical. It’s almost like the more important your data gets, the more metadata you should have about it, right? Or maybe conversely say, ‘The more metadata I have about it, the more I can prove that this data is valuable,'” that kind of thing.

DG: That’s interesting. Is there a specific example that you might be able to talk through? Because I’m kind of curious about you’re talking about how it’s fully arrived.

KG: Well, I mean I’m just spit-balling here because you made me think of it. I’m not sure it’s a fully-formed thought. If you don’t know how a piece of data’s used, if we talk about event time versus capture time versus… There’s various timestamps you can use on a piece of data. In what relation does it have to your business?

DG: Oh, yeah, yeah, yeah.

KG: That kind of thing. That’s all I was thinking through. You mentioned the sensor. I think you mentioned things about data coming in in bulk, right? So at takeoff or whatever, and then the mid-flight. So you’ll get a payload or, even probably in that case, an aggregated piece of data, but the context around all that is almost more interesting than the actual values. That’s all I was musing about.

DG: Oh, no, no, no. Thanks for adding the example because that makes perfect sense. One concrete example that I’ve seen before is you would have this physical asset that moves and any time it comes to a certain type of shop, they will plug in this thing and do a full download. So you get high-frequency data, but if you don’t know all… If we’re just to talk about the timing because, most of that broad data, it’s like just relative time. So it would be one of those things where if you didn’t have all that information to be able to put it, you wouldn’t be able to build a timeline. You wouldn’t be able to actually align that with everything else, which is… Yeah, that’s a perfect example of that.

KG: Right. Time is only one dimension, right? There’s other dimensions, right? Yeah, exactly. That’s kind of where I was going with that.

DG: That’s actually interesting that you also bring this up too. That’s one aspect I’ve never actually thought much about because most of the stuff that’s been put that I’ve seen, it’s been all about naming. So there’s some code in the name of every attribute or tag, right?

KG: Right. Underscores are amazing for that.

DG: Yeah. Yeah, you need to have a decoder written to be able to say whether or not this is…

KG: Right, right.

DG: Well, I mean but it affects things because it’s like if you were to just look at a nuclear power plant in the US, it’s two loops. So it’s primary loop secondary loop, right? The mechanics of the primary loop are where all the heat is being generated and eventually transferred to the secondary loop, but if you don’t actually have the metadata to be able to actually determine that you’re looking at the… The mechanics of those sequences are very different. It’s interesting.

KG: Right. Very cool. I’m not sure we figured anything out here, but we’ve-

DG: No, no, it’s funny though because that would be fun if you didn’t have to do this metadata every single project where you’re like, “What does this underscore mean? What do you mean by -1A?”

KG: Right. When it starts with a Z, be really careful because that data whatever.

DG: I actually hit that. I remember looking at your accelerometer data, and it was like, “If you hit this, then you don’t want to use that.” I’m like, “Okay. Well, then what would I use? What about this one?” “No, no, no. You don’t want to use that because it’s blah, blah, blah. You want to use this one.” It’s one out of 20. You’re like, “If I didn’t know you, I would be screwed.”

KG: Right, right.

DG: Which is interesting because I think the last time we spoke, I think I had a lot of question. I don’t know if I actually asked a lot of questions, but a lot of questions about what’s the mechanism by which schemas are exposed and if any documentation is exposed along with them because that usually is problem one. It’s funny because I think the first time we spoke, we spoke about how if you wanted to do analytics on data in a production environment, it’s kind of like a mess, right? You have to have the magic keys to everything and a developer resource to help you do the hookup because you can’t know everything, right?

KG: Right.

DG: The other part that just popped into my head is that often when you these analytics use cases, part one is getting the data. I don’t want it’s equally as difficult, but there’s a significant amount of work required to understand what’s there. Have you all experienced that much or anything like what we’ve been…

KG: I think that’s been a thing all along, right? In an enterprise setting, one of the things that we found and, obviously, one of the reasons we built our product the way we did is that we see consistently that folks are trying to make sense of data. The contextual part of where’s it coming from, who owns it, what generated it, and it’s like, “Okay. We’re pulling this from S3, but that was processed by a Python from Logstash. Okay, okay. This is coming directly from clickstream Analytics, and it’s being dumped into Kafka. Okay. This is the historical database that lives on, I don’t know, Postgres or something. That’s where this comes from. Your job, if you choose to accept it, is super simple. All you got to do is join all those things together and make sense of it.”

KG: It’s sort of that’s a huge part of and has been a huge part of BI. I mean that’s why ETL even exists, right? In the streaming world, you can’t really ETL it because Kafka basically says, “There’s a new way to ETL. Sorry, folks, you can’t do an ETL in large batches. It’s got to be-“

DG: Start over, yeah.

KG: “Yeah, it’s got to be in a streaming context.” That kind of broke a lot of the historical stuff. I speak 10 years ago now or whatever. Yeah, so a huge part of everybody’s day presumably is how do I make sense of this data. Once I get this thing figured out, then don’t touch it because this is this magic thing that I figured out. It took me a month, and 10 people, and 100 meetings, that kind of thing.

DG: It was funny because while you were talking, it sort of made me start thinking about one thing that… I’ve worked on many projects, and a lot of them are of the… It’s funny because if you meet a bunch of data scientists, they kind of come in two flavors. Most of the time they come in the flavor where they sit down and be like, “We will not start until we understand fully the picture.” They get their pad of paper out. Then you end up having… You do the whole project, understanding the data world, right?

KG: Right.

DG: Then the other side, which I resonated very nicely with, was kind of like the bull in the china shop, which is, “Shut up. Give me the database. Tell me what table to start with. Don’t tell me anything. I’m just going to go in, and I’ll come back to you. In the next couple months, you will continue to tell me that I’m full of shit, that I don’t know what I’m doing, but that’s fine. That’s the point. I want you to correct me.” It’s funny because it makes me even think that if you have unfettered access to some things, and you just say… It’s like a culture thing too. It’s like just put it together, and then find the right people to verify, right? Is anything I’m saying making sense or are you all-

KG: Oh, totally. It’s super-

DG: Yeah.

KG: Totally. It’s super interesting because even going back into my Oracle days, data practitioners tend to be tribal because when you think… This has nothing to do with data science, so I apologize.

DG: No worries.

KG: People who work with data get tribal about it because they go, “Look. It took me a while to figure this out. Once I figured it out, now, it’s mine and my small little group, and we know about this stuff. Come to us, and we’ll tell you all about it, but we’re not going to socialize it broadly because it took us time and energy to figure it out, and, now, we’re the experts.” Especially in big corporations, people own these silos. It’s not like you’d think from the outside looking in, “Hey, I want to democratize this amongst the entire team. Wouldn’t that be great?” It’s really kind of the opposite. A lot of folks who’ve put in the energy to understand it are kind of like, “Come to me, and I’ll tell you what’s up. Then I’ll maybe give you a data feed. Maybe then you can make a product out of it or make a model off it or whatever.”

DG: Right. No, I am with you totally because it’s like that. What’s funny is you get into this… it’s kind of like a nasty… It’s funny because humans can be amazing, but also kind of viscous at the same time. You get into the perpetual gotchas like prove your worth, and then you can have access to this wonderful product that I have constructed.

KG: It’s a problem that plagues us all, right, Dustin? Because if we can’t make sense of the data as a group, right? At least my background is providing guys with data. Just to make a hypothetical team here, if you’re trying to perform some sort of advanced mathematics on a model, and I can’t you the data for it, then cool, I’m glad I know how the data works. But I’ve blocked you from being successful for the company. It seems like as a group, if we can’t get that figured out as a… By a group, I meant the entire data community. Then I think that’s going to be a major roadblock for how to go forward. It seems like that should be solvable. We have all this data lake. It’s just in there. Yeah, but pulling it out requires specialized knowledge is really where I was going.

DG: Yeah. Yeah, that’s the part that, I mean, when you shared what you all were building, it started to resonate because… It was funny because also what I’ve sort of been thinking is that notion where data’s data, right? When it comes to usefulness, it’s mainly can you take it and apply it to a use case? It has to make basic sense, right? You shouldn’t be pulling something from sales if you’re doing something about monitoring. Those things are not connected at all. While you were talking, it also made me… because you were talking about the notion where you have compartmentalized knowledge, right?

KG: Right, right.

DG: What’s interesting is that I’ve seen a lot of scenarios where teams get… They get blocked because they don’t know the right people that get them access to that stuff or they have paranoia that if they don’t have access to the definition of how this data thing was created, then they’re going to be wrong, right?

KG: Interesting. Interesting.

DG: As things scale though, you can imagine a world where, and I think we’re already kind of there is like in some entities that there are so many of those that if you were to have complete knowledge, as in be able to go all the way down, and understand where something came from, and go all the way back up, you probably would shoot yourself in the foot. You wouldn’t be able to get to a POC, right? Because you’d just be terrified that you’d do something silly, but what’s interesting is that if you had the ability just to say, “Here it is. Here’s the general… This comes from this entity. This is what this stream or whatever topic is roughly.”

DG: Then you have the analytics use case of… There’s always going to be calibration. To imagine that you’re not going to get it, you’re just going to do stupid stuff in that in the first couple of iterations is just kind of silly because you always do that, right? You’ll have a couple of iterations to stabilize with respect to this is the use case we want to solve, and this is actually where you pull this data from, not this column because that’s that, not this column because that’s that, not this attribute because this is this, and blah, so on and so forth. It sort of makes me kind of curious about maybe the solution is just to relax a little bit. So if you’re able to demonstrate-

KG: Right.

DG: Yeah, I mean because if you can do math and statistics and you have firm grasp of your use case… Usually, if you have a firm grasp of the use case, you have a firm grasp of how you should measure it yourself, right? I’ll just give you one example is we had this one use case ranking. Let’s imagine you have a fleet of 100 devices, right? You want to pick the best one for the next job. Once you’re able to crystallize that use case, you have the ability to start calculating metrics. So you’d say, “What’s the probability of having one in the top 10? And it does not experience if it were in the next deployment,” right? So you would be able to capture that high-level statistic. You could see how that high-level statistic is connected to where you pull the data from. What’s interesting is that it makes me kind of curious if we’re stepping into a world where you don’t have to be so worried about where it comes from, as long as you can demonstrate that the thing is sufficiently…

KG: Right. It essentially puts more of an onus on you to test your model, train your model, and validate and monitor your model. I mean maybe it’s not a pure machine learning model thing you’re talking about, but monitoring thing is a huge thing that keeps coming up when we talk to data scientists. I hear you putting an underscore on that part of it is what I’m gathering-

DG: Yeah. Yeah, I think I would take it a little past the validation because it sort of makes me think that it’s a set of practices. It’s a culture of we’re-

KG: Interesting.

DG: … not going to… Instead of worrying about getting access to very detailed things, it’s like, “Just stop it. Just normalize. Provide a really simple way to get access to the incoming streams, a simple way to get access to the data warehouse across the different lines of businesses.” Then just say, “Do a practice. Take a time to assert, assess the statistical viability of all the features that you’ve used.” I don’t know if that’s been something that… I know I have never done it, but it’s funny because just this discussion made me start thinking about that. Because everywhere else I’ve been, it literally is like walking through a minefield because there’s like 20 people there that are ready to poke their fingers into you and say, “Gotcha.”

KG: Yeah. It’s interesting. We see that. In my other previous lives and I think even now, we kind of see these two camps. We see the camp where there’s a really robust data team or data entity within a company. That team has categorized, and cataloged, and built a schema that means something for the business, and that’s great. Presumably, it has the appropriate business metadata to go with it. So you know that when a person buys something on a shopping cart, that’s a buy or you know if you do an experiment, real-time experiments for instance, that when they do X or Y, that means that’s a bad thing for the business metrics. So we want to not use that model. You have to have context of the business for that.

KG: So we’ve seen two camps, right? We’ve seen the rigid schema approach, and that has its pluses because you tend to have a data dictionary, so to speak, that you can speak to. But the downside of that is that it’s rigid. So if you want to add a field to that, it’s a six-month process. All the approvers, and the architectural committees, and everybody has to figure it out. You just can’t slap in a new clickstream and say, “Look, our project went live in two weeks because the data was laying there.”

KG: The flip side of that is that you have a very unstructured business. Sometimes that’s unstructured data, and sometimes it’s schemas that are unknown. The plus of that is I can just get going, and make things, and fail fast, and be nimble. The downside of that is I’m not sure it’s 100% accurate, that that’s actually a click or a buy or whatever. Maybe you know, and maybe it’s tribal and Bob or Karen down in the engineering department emailed you and said, “Yeah, that’s the field for when you buy,” or whatever. So you have a data confidence about that, and you build your thing, but there’s really two camps there that… You’re making me think through the profiles that we see. I think it’s either you’re highly structured, and you’ve kind of put a lot of process around it, and you’re probably slow or it’s kind of the Wild West, and then you need this decoder ring, to kind of tie it back to what you were saying earlier.

DG: Have you all hit an entity that had both, that had one…

KG: Yeah, they hate each other like it’s a warring faction.

DG: Okay.

KG: I have. I worked at Shutterfly, and if you’re not familiar, they make cool photo books. They’re really cool. I went there because it was-

DG: Yeah, we have a couple of them with the kids, yeah.

KG: Oh, yeah. Right, that’s the common… yeah. I really liked it because we had these highs-peed printers that were going, I don’t know, 50 miles an hour. They make these prints really fast. It’s this big, physical, 100-foot-long printer. It’s amazing. Data’s a huge part of that business. I was really excited to go there. We had data camps. We had the rigid Oracle database and excellent folks around that, good DBAs. They knew their craft, but their world was Oracle. And we had the new-school feeds. I brought Mongo in there. We used Mongo there because we needed more nimble projects, and we needed to be able to go to from A to B without alter tables, set column, and then wait a month for that to happen, then index it, and then all the approvals-

DG: And you forget. Mine always comes back to by the time it gets through the process, you’ve forgotten why. It’s like, “I actually don’t remember.”

KG: Right. What were we building in the first place? That’s totally true because you get to the end, and you’re like, “The business changed. The business now wants to do X.” It’s like, “Well, we’re halfway through doing the last thing you wanted.” Do you see that from the data science side? You’re looking into this data cloud going, “Really, guys? Is this really what it is?”

DG: No, it’s actually kind of-

KG: Is it that way for you guys?

DG: Yeah. What’s interesting is that I’ve been fortunate enough to live on both sides of the spectrum with respect to placement because, usually, it’s org placement. It’s where are you within the organization.

KG: Oh.

DG: So I’ve been on one side where it’s full-on R&R. When I was at GE, it was… that’s one of the reasons I left is your three or four level… I don’t want to say levels, but you’re three or four handshakes away from product, right? So you’re going to be interacting with somebody that is aware of the business, but isn’t connected to actual product that gets developed and deployed, or even if you’re connected to that, you’re still a handful of steps away from the scrum teams. So it’s like you always end up operating at this left side where it’s like everything’s given to you, right?

KG: Right, right.

DG: You’re not really tightly coupled. Then when you go to the other side-

KG: So is it pre-curated like some data engineering group is curating what you’re going to see? Is that how it looks?

DG: Somebody along the chain should be curating it because you’d rarely get access to a data feed because it’s-

KG: Oh, so you’re not in a Notebook, and you’re not using SQL and Pandas, and pulling in your… They’re just like, “Hey, here’s the database. No problem.”

DG: No.

KG: Okay, got it.

DG: No. I think it was like the whole time I was there, I think maybe I had one relationship where it got to that level, and I think the only reason it was that because we had worked together for two or three years. It was just-

KG: You had to meet offsite. It was all cloak and dagger. They gave you a slip of paper that was like-

DG: Shh. Shh.

KG: Yeah. The production login.

DG: It never happened. It was always… I think it was the backup of the data warehouse, so it was always removed. What’s interesting is that I’ve also been at the other side where you’re deep in a product, and you run into what you’re talking about. It’s like everything moves slow because you’re in this structured environment. What’s been interesting is that I’ve been fortunate also to work with just kind of as advising roles, talking to different entities where you’re kind of in the middle, but what usually will happen is those data science teams tend to get pulled towards the structure because there’s this thought that because you write code, you’re an engineer, right? So you get pulled deep into the bowels of the beast.

DG: I don’t want to say bowels of the beast, but it feels like being pulled into the bowels of the beast of a data scientist because it’s like I sit in meetings, and I’m like, “It literally feels like I’m in…” Because I lived in Germany for a little while, it feels like when I would sit in a room with a bunch of Germans, and I’m like, “I understand maybe 10% of the words you’re saying. Oh, I don’t have anything to add to this conversation.”

KG: Right. Well, we feel that way when we look at data scientists too, so it’s-

DG: Oh, okay.

KG: Yeah. Right back at you.

DG: Yeah. It’s like a weird first date where you spend the whole time just like, “Uh.” What’s been interesting is even just this discussion, it’s made me start thinking about I think that all of those entities exist for a reason. Things are extremely structured, move slow. That’s fine. Then you have sort of more flexible environments to allow you to do other things. It makes me kind of curious if long-term, everybody will eventually… Maybe when data goes to rest, it goes to rest in a very structured way.

KG: It dies. It dies at rest.

DG: It dies.

KG: Is that what we figured out today? Put it to rest?

DG: Oh, no, no, no. I’m actually serious like-

KG: Rest in peace.

DG: Yeah, yeah, yeah. There’s part of me that’s curious about that. Does it make sense for us as humans to just say, “We don’t need to retain everything.” What if we just had streams everywhere and made it… I remember when we were talking, it was like, because to me as a data scientist, it’s like, “Give me access to all the streams and a domain expert to help guide me, right? And allow me a very easy way to publish back.”

DG: I think it was one of the questions that Leslie shared to set up for this. It was like, “What’s best practice?” What’s been kind of fascinating is as I get older… I used to think you need a job system. You need these tools. I can tell you the tools are A, B, C, D, E, and F. It was really not really. You just need access to data. You need the ability to run something, either triggered by time or an event and to publish results in a way that can be consumed somewhere else. You can imagine a scenario where you don’t need really any infrastructure. You could schedule a Python job to do that, right? Or a Java job or whatever.

KG: Or if you’re really sophisticated, you can actually send a message to a job to do it, right?

DG: Yeah, yeah. I mean it’s like but you can imagine a scenario where… I’ve been exposed to some sort of Fortune 500 companies where the data scientists are at such a high level, they’re not tied to any particular product. They’re looking across the business, fishing for use cases, right?

KG: Right, right, right, right.

DG: And there is no data lake or the data lake… This is the one thing I’ve also been hit a couple times is you talk to people that are in those entities, and they’re like, “Well, the data lake’s awesome. You should use it.” It took four years to build and, usually, there’s another data… I’m rambling at this point.

KG: No, I think you’re right. You know what’s interesting-

LD: It’s all good.

KG: Well, just to share a little bit to the listeners because you mentioned the list that Leslie put together. I’m just going to expose you, Leslie. Sorry. She puts together these well-thought-out, super well-researched bullet points and everything that we can all talk from. You, I guess, listeners going forward know that now. What’s funny is we’ve ignored every one of them so far, and we’ve just been talking.

LD: Those make for the best podcast episodes. I’m not going to complain.

KG: Okay. All right. I just want to make sure we basically give Leslie a shout out because she does all this hard work to prep us, and Dustin and I said, “Hey, turbines are cool.”

LD: “Forget that.”

KG: Yeah.

LD: It’s all good.

DG: Yeah. Well, it’s funny because I’m just reflecting now, and I think I’m a flow zone where I think I might have transcended reality. Then there is no databases evermore. There’s just streams, and a unified view, and an easy way to phish in public and prototype…

KG: Yeah, dude, we put data lakes to bed. So those are gone. We’ve decided that water pressure sensors matter. So that’s another thing that we have to think about. We figured lots of things out.

DG: Dude, I think that shit is good. Yeah, this is good.

LD: With what you guys are talking about and with this idea because it’s been a fascinating conversation, we kind of know, and we’ve talked a little bit about what has evolved in the last several years. We’ve talked a little bit about where we think this is going. Dustin, for you, my closing question for everybody, taking that one step further, what is it that you are excited about for the future of machine learning with the idea that as we say internally, all data is a stream at this point, and knowing that that’s coming? What is it that keeps you really excited about this space?

DG: It’s kind of interesting because the conversation today I’ve had with you all have been extremely engaging. The main aspect has been there’s often… It’s a lot of cultural issues. What’s interesting is if you were to rewind maybe 10 years ago, it was we had data, but it wasn’t all that much. Now, everything is transitioning to a data-rich world. Each one of those entities will be collecting their own, and you know that there are use cases that span, right? I think what’s interesting is that data is becoming more available, right? The one part that is going to be an issue is that complexity, right? You have it coming from multiple things, right? That’s the one part.

DG: Honestly, I think it’s like that’s the one part I haven’t really seen any movement on because I think it’s you all have been the first entity that I’ve talked to where there was some consideration of that complexity. Because data exists, but if there’s no easy way to get to it, and use it, and prototype… Keyword this is I need a scratchpad first, right? I need access and a scratchpad to be able to put something on the plate such that we can have a discussion about it and most of the time, those early discussions are going to be the other entity telling me that stuff is wrong, which is fine. That’s the beautiful arc of these things. Then incrementally, you iterate towards an analytic, and use case, and everything that makes sense.

DG: That really has been the part that’s been very engaging about talking with you all is that there is some energy being placed to that really big, big, hairy, messy part because, up to now, it’s been a human thing. It’s a squishy, human thing where the only way that you can make traction on it is if you have the skill set or are connected with a project owner that has the skill set that will do the squishy thing, which means walk around and talk to anybody until, eventually, you find the person that has the DB key, right? Or the credentials or knows where it is. That’s a part that I wasn’t aware of until just now that’s very exciting.

DG: If we were to talk in more general, it’s just that, right now, the amount of tools that are being made available for free is just kind of… It’s humbling. I don’t remember. Somebody asked me to make slides and say, “Can you give us a history of data science?” I was like, “Okay. That’d be curious.” So I went back and I took neural maps and stuff in college. They don’t use any of that stuff anymore. None of it and none of the stuff that I used in grad school and the first majority of my career is being used by anyone anymore, which is fascinating. The power of tools and the computational power to use those tools is just amazing these days.

DG: In my brain, it’s like you bring those two together, and I think one item that I think just has come up in this discussion is just that there’s the functionality. So the functionality of being able to get access to… the ease of access to more data, the fact that it exists. That part has been really energizing talking to you all. Is that if that part can be solved, then that’s the big, huge part of it. The other part is you have better tools these days.

DG: One part that I am curious to see how it gets figured out is the squishy part, which is where does data science fit in an org because I’ve been on… It’s been this weird yo-yo going back and forth. I think some people might think I’m just have some brain issues, like I’m perpetually paranoid about getting too close to product, and not too close to product, but I don’t want to be close. I don’t want to be too close. Don’t pull it, right? I would be curious to see how that gets worked out.

DG: I mean really, it’s been data, getting access to data. That is the thing that feeds everything else. Having that, in some ways, a bit more available because it’s like even just now as I’m talking, I’m sort of imagining a day, where when data systems are stood up or corporate data systems are stood up, there is attention paid to transparency and ease of access across the company. I know that you would have a security issue right there. Anybody that listens to this and is like, “Where’s the security? How it would be…” Uh-uh (negative).

KG: Yeah, right.

DG: You can imagine if you can’t do that then… but you have to. We have to because that really is the only way to make use of… Otherwise, it’s going to stay in one place, right? That’s going to be-

KG: It’s going to be yet another data lake type thing where you just aren’t seeing the value from the immense investment.

DG: Yeah. You build it, and then don’t come.

LD: Dustin, thank you so much for taking time today. This has been a very fun conversation. Again, this is-

DG: For sure.

LD: Topics like this are ones that are fascinating to me… I mean I would hope they’re fascinating to me considering what I do for a living, but they are super fascinating to me, and I know that they are fascinating to the listeners. So it’s always nice to also hear a new and different perspective and from the machine learning side. That’s one that we really enjoy, so thank you so much.

KG: Of course. Thank you for having me. I’m very curious to hear… I perpetually am looking at my inbox. I’m like, “They got any information? I’m curious how they… What’s going on?”

LD: Oh, we can make sure you have that information.

LD: Thanks so much.

DG: Thank you so much.

LD: The idea of giving data science and analytics teams plug and play access to all needed data in real time and as a stream is an idea that I can only see growing in popularity, especially as businesses need to react faster and faster to more and more data. A huge thank you to Dustin for joining us today. We always enjoy chatting with him, and we hoped you enjoyed listening, and just maybe you learned a little something along the way just like we always do. As always, you can connect with us on Twitter @EventadorLabs or on LinkedIn or you can always reach out to us at hello@eventador.io. Happy streaming.

Leave a Reply

Your email address will not be published. Required fields are marked *