078 - Machine teaching with Dr. Patrice Simard
Microsoft Research Podcast

Full episode transcript -


A lot of people have thought that the key to a I is the learning algorithm, and I actually don't believe it's the running argument. I think teaching is what makes the difference. So from the feudal difficult standpoint, I believe that machine learning algorithm is almost the easy part. Is the part that you can locally, Optimus teaching is the part that you have to optimize at a global level is a societal level, and I think that may actually be the key to a I. The same way it was the key to human development.


You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen, Using a machine Learning is a powerful tool that enables computers to learn by observing the world, recognizing patterns and self training via experience, much like humans. But while machines perform well when they can extract knowledge from large amounts of labeled data, they're learning outcomes remain vastly inferior to humans when data is limited. That's why Dr Patrice um, our distinguished engineer and head of the machine teaching group at Microsoft, is using actual teachers to help machines learn and enable them to extract knowledge from humans rather than just data. Today, Dr Samar tells us why he believes any task you can teach to a human you should be able to teach to a machine explains how machines can exploit the human ability to decompose and explain concepts. To train ML models more efficiently unless expensively, and gives us an innovative vision of how in a human teacher and a machine learning model work together in a real time interactive process, domain experts can leverage the power of machine learning without machine learning expertise. That, and much more on this episode of the Microsoft research podcast, Patrice Amar. Welcome to the podcast.


Thank you. This is a pleasure to be here.


I have to start a little differently than I normally do because you and I are talking at a literal transition point for you. Until recently, I would have introduced you as distinguished engineer, research manager and deputy managing director of Microsoft Research. But you're just moving along with a stellar team from Microsoft Research to Microsoft office, right? Yes, this is Well, we're gonna talk about that in a minute. But first, I always like to get a general sense of what my guests do for a living. And why, sort of in broad strokes. What are the problems you're trying to solve in general, And what gets you up in the morning makes you want to come to work?


I want to do innovation. I think this is where I have my background and talent. So I am completely reverent to the established wisdom. And I like to go and try to solve problems. And since I want to change things, I want to have an impact in terms of change. Then I picked the problem and tried to reformulate it and solve it in a different way or change the question. I think this is usually the best way to have an impact.


So that irreverence Is that something you've had since you were young? Is it part of here sort of DNA?


Ah, yes. The D issue there was that I was never really dead good in classrooms, and I always somehow misunderstood the question and sort of different question. And since I didn't do my homework very well, I never knew what the standard answer was. and so I kept changing the problem, and that was not very successful in class. But when I moved to research, then changing the question was actually part of the job. And so I got four more successful after I got past this color program.


That's actually hilarious, because the school system rewards people who get answers right. But over in research, you want to turn things on their head. Yeah,


I mean, changing the question is for more useful in research, then coming up with different answers all slightly better answer to an existing question.


You know, that's just a perfect setup for this podcast, because the entire topic is turning a particular discipline or field on its head a little bit. So let's set the stage for our conversation today by operationalize ing the term machine teaching. Some people have said it's just another way of saying machine learning, but you characterize it as basically the next new field for both programming and machine learning. So tell us how it's machine teaching a new paradigm and what's so different about it that it qualifies as a new field.


Okay, so I'm going to characterize machine learning as extracting knowledge from data. Mission teaching is different and the problem is choice to solve is what if you start and you have no data and then the task is about extract acknowledge from the teacher. And this is very similar to programming for Graham ings about extracting knowledge from the programmer. So this is where the two field are very close and it's a very different birding because now it's all about expressivity recognizing what the teacher meant and because you focus on the teacher, This is why H. C. I is so important that a c I is human computer interaction and so programming and teaching or absolutely the epitome of human communicating with computers.


Listen, I wanna ask cause I'm still a little fuzzy when you say you have no data, but you have a teacher. Is this teacher human? Is his teacher a person? Yes, Yes. Explain what that looks like with no data that is the teacher giving the data to the machine.


So let me give you a simple example. Good. Imagine I want to teach you how to buy a car. So I want to give you my personal feeling for how you buy a good car so I could bring you to the parking lot and point to good cars and bad cars. And at some point I may ask you what is a good car? And you may say, Oh, it's all the cars for which to second digit of the license plate is even And that may fit the data perfectly, and obviously this is not what I expected. But this is not the way we do it human to human. So the way we do it, human to human is I will tell you that you should look at the price. You should look at the crash test. You should look at the gas mileage. Maybe you should buy electric, and these are features They are What question? To ask to basically have the right answer about what a good car in a bad car.

And that's very different. It's little bit like Socrates teaching by asking the right question. A supposed to enumerating, positive and negative for the test. So when human T shirt a human, they teach in a very different way, then they teach a machine. Now, if you have millions and millions of label, then the task is about extracting the knowledge from that data. But if you start with no data, then you find out that labels are not efficient at all. And this is not the way human teacher, the human. So there must be another language. And the other language is what I call mission teaching.

This is like a programming language, and just to give you an idea of how natural it is, what I see happen over and over in industry is that when people want to build a new machine learning model, they start by collecting a whole bunch of data. They write leveling directives. And then they also said, and then they get back there 50,000 labels and then you have a machine learning algorithm tried to extract that knowledge from those labels. But this is ironic because the leveling directive contain all the information to do the labeling. So imagine now that the laboring directives could be inputted directly into the system. Now, when you look at the leveling directives they'll features, they're saying, Oh, this is a cooking recipe because it has a list of ingredients. So if we can make that the teaching language, then we can skip the middle men and get the machine do


that. I think that's exactly the word I was going to use. Is the middleman of the label Er's right, drilling in a little teachers air, typically more expensive in terms of ours and so on. So what's the business model here, except for the fact that you're missing the middleman, which usually marks up the price? How is it more efficient or less expensive?


Okay, so this is exactly what happened with programming. At first do programmers, where scientists that would programming assembly code and the name of the game in those days was performance and the biggest machine you could get in the fastest mission you could get. And over the years, the field has evolved to alot more and more people to program, and the cost became really the programmer. And so we wanted to scale with the number of programmers. This was the mythical man month, and you know how to reduce the cost of programmer how to scale a single task to multiple programmers. And if you look at the literature for programming, it moved from the literature of performance to literature of productivity and I believe that machine learning is still a literature of performance. Generalization is everything. And if you have a lot of data, this is the right thing. They're basically what makes the difference is the machine learning algorithm and harmony GPS. You're gonna put on it,

and this is what deep learning is. And I've worked in that field for many years, and I absolutely loved that game. But I believe that we are changing. We are the turning point where productivity and the teachers time becomes more and more important. And for custom problems, you don't have a choice. You cannot extract the dollars from the data because you don't have the data. It's to custom. Well, maybe changes too fast. In that case, the more efficient way to come in you get the knowledge is two features to scheme out through other constraint, and I'm not sure we get know what language it will be. It


will still evolved as a former teacher myself, albeit teaching teens, not machines. I'm intrigued by your framework of what we in education called decomposition or deconstructing down into smaller concepts to help people understand, and then scaffolding or using building blocks to support underlying knowledge to build up for future learning. Talk about how those concepts will transfer from human teaching to machine teaching.


So in philosophy is a movement called behaviorist that says that we estimated response. You can teach everything on. Of course, you can't. You won't be able to learn very complex things if you just do stimulus response. Well, in machine learning, I find that very often a missionary ing expert all what I would call machine learning behaviorist. And basically, they believe that with a very large set of input label payer, they can teach anything and it turns out machine to a mission, right? And if you have a tremendous amount of labels and you have a tremendous amount of computation, you can go very far, but they'll task that you will never be able to do. If I were to give you a scenario and ask you to write a book, you could feel buildings of scenario garbage and scenario book.

You will not be able to learn that task. The space of functions that are bad is too big compared to the space of function that actually fulfill the desired goal and because of that, there's absolutely no way you can select the right function from the original space in the time that's less than the length of the universe or something. Eso. But strangely enough, everyone in this building can perform the task. So we must have learned it some interesting right, and the way we've learned it is, we learn about characters. We learn about words. We learn about sentences. We learn about paragraphs within about chapters. But we also learn about tenses. We learn about metaphors. We learned about character development. We learn about sarcasm,

sarcasm, right. So with all these things that we've learned in terms of scales, we were able to go to the next stage and learn new skills on top of the previous scales and in machine learning. There's this thing that we call the hypothesis space, which is the space of functions from which we're looking to find the right function. Either hyper space is too big, then it's too hard to filter it down to get the right function. But if there's space of function is small enough, then with a few labelled example, you can actually find the right function and the decomposition that allows you to break the problem into these smaller subspace. And once you've learned that the sub task, then you can compose it and keep going to build on top of previous skills. And now you can do very complex task, even though each of the sub task was simple.


So it is really more similar to how humans teach humans than the traditional model of how we teach machines


to learn. Yes, and decomposition is also the hallmark of programming. So the art of programming is decomposition, and if you don't get it right, you re factor. And we've developed all these the zone patterns to do programming right and believe that there will be a complete 1 to 1 correspondence. There will be the design patterns. The teaching language will correspond to the programming language. I will even said that what correspond to the assembly language is the machine learning models, and they should be interchangeable and if they are interchangeable than the teachers are interchangeable, which is exactly what you want in terms of productivity because you know how it works. The person that start the project may not be the person that ends the project or maintains the project.


There's been a lot of talk within the computer science community about software 2.0, and the democratization of skills that usually require a significant amount of expertise. But you suggested that those terms don't do full justice to the kinds of things that you're trying to describe or define. So tell us why.


So you said software to point with an evolution thing, right? Right. And I believe that we need something for more radical in the way we view the problem. So to me, software is something that's intentional. Very often, people will suffer to point out what they think about dealing with large amount of label data and labeled data can be collected in a non intentional way. So, for instance, for quick prediction, the label data eyes whether you got a click or you didn't get a click and you collect that automatically, so it's not really intentional. When you write a program, it's very intentional. You decompose that you write each of the function with a purpose, right,

and I think when you teach and you decompose the problem into support the labels or intentional and they are costly because they come from human and now we need to manage that. So if I decompose the problem I want to be able to share departed, have decomposed. And now if I'm going to share it, I need to version it. And this is a discipline that is very well known in programming. Eso how do we manage as efficiently as possible? Knowledge that was created intentionally and now it would need to be maintained. It will need to be version. It told me to be shared. It will need to be specified. And now all the problems of sharing software will translate to sharing and maintaining models that are taught by human saw. The parallel is very, very direct.


What about this word? Democratization. You and I talked for foreign. I actually inserted the word into our conversation. You go. How we like that word. Yes. So


I started using that word at the beginning and I felt like the problem with the word is that everyone wants to democratize and I want mission teaching to be more ambitious. So let's think about the guarantee you have when you're program. If I ask you to quote a function and your programmer, you will tell me. Yes, I can do it. And then I'll say, Well, how long is it gonna take? And you're gonna say, Let's say, three months and your estimate. Maybe off people said that we're usually off by effective to me, but you were able to know that it was possible. And you can actually even specify the performance of your program. And you can say how long it's gonna take. Right now,

we don't have that kind of guarantee when we talk about machine learning. And the strange thing is that we have all these tools off a structural risk minimization that gives us guarantee on the accuracy given some distribution. But we don't really have guarantee on what the final performance is going to be before we have the data. And yet, in programming, we can have those guarantees. So what's different? Right, So we have to think about the problem of teaching differently. And if you start thinking about the problem of teaching in terms of decomposition, then you'll be able to reason executive, same way that your reason for programs we actually do this when you teach human rights. So if you wanted to teach a concept to a person, you would say, Okay, this person doesn't know about this.

And this and this one gonna first have to teach those sub scales and then I'll be able to build on top of the scale. So the task of decomposition is a task that's very important for teaching. We human do it all the time. We programmer do it all the time. The T shirt has spent a lot of time decomposing the problem into these sub field and sub skills


and then laying the foundation


foundation and building on top of the foundation and testing at each little whether you got the scare right and that testing is exactly what I want to make a discipline in machine teaching,


which is just, you know, music to my ears. As a former teacher, I don't think you ever stop being a teacher. I said they'll know I'm dead when the red pen falls out of my hand. Well, let's get a little more granular and specific. So you've developed a machine learning tool called pickle, which is spelled P i C. L not like the pickle that you put on a burger and it stands for platform for interactive concept learning. And it allows people this is key and allows people without ML expertise to build ml class if IRS and extractors, Yes, tell us more about pickle. What is it? How does it work? Implies that Cool.


Okay, so I believe that if you want to do machine teaching, you need a few elements. And if it's okay, I'm gonna describe the three elements that I believe


are essential. It is. Okay.


All right. So the first thing is that you're gonna need a machine learning algorithm, and the mission learning algorithm has to be able to find the right function from the hypothesis Space Eve, the research, a function that fits the data. So that's the first requirement. And if we have that requirement, we can even interchange machine learning algorithms. It's not super important. Which one we're gonna use. The second element I call teaching completeness. And what I mean by that is that Eve, the T shirt can distinguish two example of two different class. You should be able to give the machine of future that will distinguish these two classes at the same time. You need to be able to compose functions in a way that you can always bring the function that you want into the hypothesis of space. Now it may take several alteration. You may have to create sub models that all fairly simple or complex.

You can always decompose it. But eventually you have to be able to bring the function that you want in the high pedestal space. And if that's possible, then I call the system teaching complete. The last thing is that you need to have access to an almost and terrible pool of useful examples. So imagine I want to build a classifier for gardening, and I decide that the sub concept Botanical Garden is important to decide whether it's gardening on like maybe if it's gardening talks about plants. But if its botanical garden, I say what's more about entertainment and gardening? But I need to be able to distinguish this. So now I need a sub classifier to decide whether is botanical garden or not, and for that I need to find example for this. So if my sampling space has an infinite supply of all the sub concept that I may have learnt, then basically I have the ability to find all the example that I need that already of into the task that I want. So people tend to think that all that this is a very hard requirement, because how do you get an infinite supply of data?

But here is the key is that that data doesn't need to be labeled because if I have access to this pool of unlabeled data, I can query it. I can use my class fire to search it. I can combine multiple class fire and say, Well, this cast fire has to be very positive, and this one is to be very negative. And I can sample with that range, right? So if I have the right tools to discover the example that I need, I'm good enough. So these are the three requirement. And if you have these three requirement, then it's all about finding the right combination of labels features decomposition so that you can achieve your task. And this is what pickle does for next. So right now,

pickle only works on text. Interesting. And so in pickle, you can take document and classify them. But you can also extract scheme from the documents so you can find addresses. You can find the job history from a resume you can find menus. You mean you can extract product? You can extract quotes from email, all these things that human can do very easily. So this is the vision of mission teaching. Is that anything you can teach toe? Non expert human. You should be able to teach to a machine. And hopefully we have their language to do that easily. Now, let's be honest.

Not all humans are good teachers, and I believe that not all human will be good. Mission teachers. It takes a medication and familiarity with the typical. But hopefully we can get


better at this. I'm delighted by some of the phrases you use, and maybe they're not unique to you, but I find them a little provocative. And I like that. One of them is something you call ml letter. It sounds bad. What is sitting? What can be


done about it? Uh, okay. I'm gonna tell you a story that I've seen repeated over and over. So the story goes us. Follow some program manager, decide they're going to use machine learning to solve a particular task so they collect a bunch of data and then they write their leveling directions, and they send that to get label. It comes back and they look at it and it's not exactly what they meant with the leveling directions. So the change it a little bit this. Send it back the couple situation. Then you finally get that, I said, that they are happy with. They consult with some machine learning expert who will be command deep learning or support vector machine or boosted Decision Tree, and they will decide what parameters to use. Careful vegetation.

Keiko five. You know all these hyper mentors, and they will have some engineers that will do feature engineering and code some features. And then they finally build a model that they are happy with the Deploy it and it's a catastrophe because cooking recipes are confused with numerical recipe and they're missing important subset of recipes. So they go back. They do the tradition. The collect more data gets labeled blah, blah, blah, and eventually they get a function that does exactly what they want and the super happy it's deployed and everything is fine. Six months later, the distribution has changed the semantic of what is a recipe was not a recipe or whatever they were trying to do has change. The features that are available are not the same dawn new features, some features on the longer available, and then they go back and they look for the mission ending expert. The mission winning expert now is that Facebook or Google Amazon moved on,

right? So you have models that all no longer reproducible experts that are not the original expert, and so you can reproduce the model. You can answer the question, and someone asked, What can I remove that model? And you remove the model and then three? Everyone screamed because they're using that mole is a future for another model, but they don't even know if that future is performing to spec. And so you find all these models that are kind of sort of not defunct. They're still running. No one knows whether they're performing to Spec, and no one there removed them. And this is what I call a male later, and the amount of resource is that's wasted on. This is enormous. Now some people have identified the problem is a famous paper from the Google team on the technical depth that have identified these problems. But I haven't seen a lot of solutions for how to


deal with this. Now it's


gonna ask. So I actually think that machine teaching is bringing the discipline of programming to machine learning. So something as simple as using source control and including in the source control the data. Now, I mean the software two point. Oh, according to my definition, which is the intentional data, I don't think we need to version data that is collected automatically, but everything that is created by a human with a purpose should be version and in the same chicken in the same group of things that you save together. You should include the label, the features, the schemer, everything that's relevant to reproducing that mole. And if you do this and you bring all the discipline about decomposition and documentation and version ing, then suddenly that solved that problem, you will always be able to reproduce your models. If you bring all the discipline and design patterns that we've run from programming, then I believe that will solve the problem of ml letter.


I do want to ask you about this interesting program you told me about, and this program is called the Rotation program, and it seems to me like a novel way to ensure what I would call organizational hybrid vigor to use an agricultural or animal husbandry term. What is it? Why do you think it's a good thing? And what results have you seen since you implemented it?


Okay, so Microsoft, as this organization called Microsoft Research, where the primary goal is to move the theory forward to this a new principal and innovate in all the fields in the product group. You have very different imperatives in terms of producing value, and the question is, how did you transfer from one organization to the other? And how do you lubricate producing value and doing innovation? And this is not a simple answer. So what I wanted to do is to help the product group with the recruiting of research talent. I wanted the researchers to learn about the reality of Father Group, and you don't want to force people to do a move that they don't want to make. At the same time, you want to provide the right incentive. So the rotation program is for people above a certain level. They have the options to do a rotation in the product group with interesting constraint, which is that they cannot come back for two years.

And that constraint means that when they do that jump, they jump in the deep water and they have to basically swim for two years and then they have the option to come back. And so two years is so that they have to review cycle. But it's also so that we can hire a postdoc during the same time. So it all fits perfectly, and what we find out is that sometimes people come back. Sometimes they don't come back every time they do it. At first they say, You know, I hold to this contract with, like it's my dear life and after two years, they say, You know, I'm totally happy. They've given me a lot of resources and having a huge amount of impact, and certainly they absolutely are not worried about the future, and it's very, very


different. So that's the benefit for the product teams and the person who goes. What's the benefit for Microsoft research?


So for Microsoft research, there are many benefits. So first having an impact is always very satisfying. The people that do the jump becomes advocate of research and collaborators from the product group so that you can have both a theoretical impact in the academic community and have an impact in the real world. Basically, the world becomes your customer. It creates movement across the organ, so it brings fresh mind legs Exactly. And so so this is good in terms of diversity. It's good in terms of basically steering the pot a little bit. And some people need to do this after a while because we all need change.


That is such a good transition to talking about what you're doing right now. Moving from Microsoft researched what I might call the Microsoft mothership. You're going into Microsoft office. Tell us about what prompted that move. What's the goal? Who are you taking from here? Who are you getting? What's the deal?


Yeah, So I started the machine teaching effort about seven years ago, and at the time I had the choice of doing it in the product or doing it in Microsoft research. And to be frank, I worried that if it was done in the proper group. It would be hard to protect it from the imperative of delivering value immediately immediately s so I wanted to have a little bit of breathing room. And so I created this team in Microsoft research, and I was being seven years, and I believe that there's no really any question of the value. We can actually both deliver value and continue the investment in innovation. And we can do that almost anywhere in the company.


So why office?


So why office? We started doing this in Azure. And now the group that started the measure has moved to office. And basically, I'm rejoining a group that have influence in the past, and we are going to do both measure an office and those soldiers to main product of Microsoft.


You started here 20 years ago, you said and your career has been anything but linear. You've been back and forth already. The rotation program is not new to you. So tell us about your path, your journey, your


story. All right. So after my PhD, I started the Bell labs. Uh, you know, pretty famous group. This is the group of young racoon Vladimir Peptic your job, NGO Lubutu and your shoe. And Yang just got the Turing Award s o. I stayed there for eight years, and then I came to Microsoft Research, and when I moved to Microsoft, they did something very strange. I looked at the address book and I looked at everyone that had the title architect schedule one on ones and try to find out what problems need to be solved in the product group. It was a very bold kind of moved, and I started creating relationship,

and after a while I had two groups that we're providing me with the engineers to help them because they wanted to have more of my bed with. And I told them, while the best way to have moved my bed with is to provide me with engineers and then I will help with your product. So that's how I started. And then I thought that you know, Microsoft is the document company, because at the time, 95% of all documents were created on Microsoft software and we didn't have a document group. So I said we should have a document group, and then the answer came back said, Yeah, you should create it. And I said, Well, I'm a machine learning a person. I'm not a document person,

But after six months, I thought it make no sense. So I created that group and then more gross were put into me, and eventually year was asked to start live lab research on this is when I left Microsoft Research and created lifelong research under Gary. Flake was basically creating life labs. So I created this team and they moved to add Center as a group program manager. And I can tell you, I am not qualified for that position on. It was really crazy. And after six months, I sort of fired myself as I cannot do this, and it became a chief scientist again. But after three more years of that, I decided to come back to Microsoft Research to do mission teaching. And now I'm about to go again out of Microsoft research to try to have an impact.


All right, machine learning is full of promise, and machine teaching seems to be a promising direction. They're in, so we might call everything we've talked about up to now, what could possibly go right? But we always have to ask the what could possibly go wrong Question. At least I d'oh not because I'm a pessimist, but because I think we need to think about unintended consequences and the power of what we're doing. So given the work you do, Patrice and its potential to put machine learning tools in the hands of non experts Do you want to go there? Is there anything that concerns you Anything that keeps you up at night?


Um oh, yeah. Uh, I like to think of myself as someone that think strategically, and I feel like it's kind of my job to imagine everything that can go wrong.


That's good. Yes s


Oh, um, so many things can go wrong. The first thing is, 30 years ago we had, ah, expert system and you know, the first definition of a I and what happened is that we had these giant system with lots of rules, and we didn't have a good way to decompose the problem into simple problem and it didn't work out. We also have now deep learning and again, there's no decomposition. And the complexity is such that we don't understand what's going on inside, and I think it's for more successful than where we went through the years ago. And this is why we have something different today. And I'm trying to say we should be in between before it was all features, no label and now with deep learning, a sort of kind of all labels,

no future, and I'm advocating that we should be in between. And this is where mission teaching is. We should express things not just with labels, but with features. And we should do it in a way that's discipline and deliberate like we do for programming. Okay, what if I'm wrong? I don't believe this is the case, but of course I'm worried that I might like pulling a whole bunch of people in the direction that is not the right direction at the same time. To be honest, I really, truly believe that this is the way to go. So I have the fortitude to overcome those dumped. But it's something that always keeps me up at night. The other question that you're asking this Maur philosophical.

A lot of people have thought that the key to a I is the learning algorithm and actually don't believe it's the running algorithms. I think teaching is what makes the difference. So from the fetal difficult standpoint, I believe that machine learning algorithm is almost the easy part. Is the part that you can Locally, Optimus teaching is the part that you have to optimize at the global level as a societal level, and I think that may actually be the key to a I. The same way. It was the key to human development.


At the end of every podcast, I give my guests the proverbial last word. So here's your chance to give some advice or inspiration to emerging researchers who might be interested in teaching machines. What would you say


to them? Okay, what I tell people when they ask me advice for career. I always tell them of demons for growth. So challenge yourself. Don't be afraid of failure. Failure is growth. So that's the general advice for researchers and phone people doing their career formation teaching, I believe mention teaching is an incredible field right now because first, at the intersection of three fields that are very different, so when you are the intersection of multiple field, it's usually a very fertile ground for doing all sort of new things. I also believe that we are at the face transition where the field of machine learning, which is super popular, right, is about to transit to something different. And when you're at the time where transit,

it's the most exciting thing possible. So I think now it is a fantastic opportunity to create a new field. I don't know where it's gonna go, but it's very, very exciting.


Patrice Amar Thank you so much for coming on the podcast today. To learn more about Dr Patrice Tamar and the science of teaching machines, visit Microsoft dot com slash research.

powered by SmashNotes