September 2014

Volume 29 Number 9

Microsoft Azure : Introduction to Machine Learning Studio

James McCaffrey

There’s no unanimous agreement on exactly what the term “machine learning” (ML) means. In my mind, ML is any system that uses data to help make predictions. For example, you might want to predict who will win the Super Bowl, or to which group (cluster) of people a new customer will be most similar.

Writing ML systems from scratch using C# or any other programming language is fascinating, but it’s time-consuming, requires specialized knowledge and is often difficult. The new Microsoft Azure ML Studio (released in July 2014) makes creating ML systems much easier, faster and more efficient. In this article, I’ll walk you through a complete example that will get you up and running with ML Studio.

The best way to see where this article is headed is to examine the screenshot in Figure 1. The image shows a completed ML Studio experiment. The goal of the experiment is to predict the political party affiliation (Democrat or Republican) of a member of the U.S. House of Representatives, based on previous voting behavior.

A Complete Azure ML Studio Experiment
Figure 1 A Complete Azure ML Studio Experiment

At the top of the image, notice that ML Studio is running in Internet Explorer—it’s a Web-based application. More specifi­cally, ML Studio is the front end for the Microsoft Azure Machine Learning service. From here on, for simplicity, I’ll use the term “ML Studio” to refer to both the client front end and the Azure back end. In the address bar, you can see that I’m using an internal URL “passau.cloupapp.net.” During development, the ML Studio project was code-named “Passau” and you might come across that term in the documentation. By the time you read this article, the public URL for ML Studio will be available at azure.microsoft.com.

Using ML Studio to create ML systems is roughly analogous to using Visual Studio to create executable programs, though you shouldn’t get too carried away with this notion. To use Visual Studio, you can either buy the tool or use a free trial version. With ML Studio, you’re charged for using the service, but there will be ways to try the system out for free. The exact details are certain to change frequently—constant change is one of the major downsides, in my opinion, of working with cloud-based systems. I like to install a product on my desktop and have any changes be totally my decision. In the brave new world of cloud computing, you have to be prepared for a working environment where change is no longer completely under your control.

ML Studio has three primary working areas. On the left you can see items with names like Saved Datasets, Data Input and Output, and Machine Learning. These are categories and if you expand them, you see specific items that can be dragged onto the center design surface. This is somewhat similar to the Visual Studio Toolbox, where you can drag UI controls onto a design surface. However, ML Studio modules typically represent what you can think of as methods—that is, prewritten code that performs some sort of ML task.

The center area of ML Studio is called the experiment. This is analogous to the Visual Studio editor—the place where you do most of your work. In Figure 1, the experiment is titled Voting Experiment. An experiment title is roughly analogous to a Visual Studio Solution name. The rectangular boxes are modules that were dragged onto the design surface. For example, the module labeled “Voting data” is the raw data source, and the module labeled “Logistic Regression Binary Classification Model” (the label is partially cut off) is the core ML algorithm used.

The curved lines establish input-­output flows between modules. To be honest, my first impression as a developer was not altogether positive: “Oh great. Curvy lines. I don’t like curvy lines. That’s not real programming.” But it didn’t take me long to adapt to the ML Studio visual style of creating systems, and now I am a Believer.

The right-hand side of ML Studio shows details about what­ever is currently selected in the main work area. In Figure 1, because the Logistic Regression Binary Classification Model module is selected (its border is bolded), the information in the right-hand area, such as “Optimization tolerance,” with value 1.0E-07, refers specifically to that module. You can think of the information in the right-hand area as the parameter values (or equivalently, argument values, depending on your point of view) of the selected module/method.

You can run an experiment by clicking on the Run icon located at the bottom of the tool. This is somewhat equivalent to hitting the F5 key in Visual Studio to execute a program in the debugger. As each module finishes, ML Studio displays a green checkmark inside the module. You can also see a Save icon but, by default, ML Studio automatically saves your experiment every few seconds—working in the cloud can be hazardous due to issues like dropping a network connection.

In the sections that follow, I’ll walk you through the creation of the experiment in Figure 1 so you’ll be able to replicate it. Doing so will give you a solid basis for investigating ML Studio on your own, or for exploring the early-release documentation. This article assumes you have at least beginning-level programming skills (in order to understand ML Studio and Visual Studio analogies and terminology), but does not assume you know anything about ML Studio or machine learning.

Getting the Results?

If you’re new to ML Studio, you’re probably wondering where to find the output of the experiment. As it turns out, a typical ML Studio experiment often has multiple outputs. The bottom-line output, so to speak, is shown in Figure 2. I chopped out the center section to make the image a bit easier to view.

Azure ML Studio Experiment Results
Figure 2 Azure ML Studio Experiment Results

In order to see these results, I right-clicked on the right-most Score Model experiment module and selected the Visualize option from the context menu. This opened a separate window with the results as shown. For now, look at the bottom part of the image, which resembles:

unknown-party  y  n  y  n . . y  n  democrat    0.0013
unknown-party  y  y  y  y . . n  n  republican  0.7028

This output indicates that after the prediction model was created, it was presented with two new data items. The first, with an unknown party, is data for a hypothetical Representative who voted “yes” on a legislative bill related to handicapped infants (the columns have headers if you look closely), “no” on a bill related to a water project, and so on, through a “no” vote on a bill related to South Africa. The model created by ML Studio predicts the hypothetical Representative is a Democrat. The second data item is for a hypothetical Representative who voted “yes” on the first eight bills and “no” on the second eight bills; the model predicts the person is a Republican.

Setting up the Data

Now that you understand the goal of the demo experiment, you’re in a better position to understand how to create the experiment. It’s fairly safe to say that all ML Studio experiments start with some data, and one or more questions to be answered. Here, the demo data is a well-known (to the ML community, at least) benchmark data set often called the Congressional Voting Records Data Set (or the UCI Voting Data Set, because the primary location of the file is on a server maintained by the University of California, Irvine). The raw data, a simple text file named house-votes-84.data, can be found by doing an Internet search.

The first four lines of the raw data are:

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
...

There are a total of 435 comma-delimited lines of data, one for each of the 435 members of the U.S. House of Representatives in 1984. The first column/field is the party and is either democrat or republican (there were no independents or other parties at the time). The next 16 items on each line represent a yes vote (y), a no vote (n) or a missing vote (?).

ML Studio can read data directly off the Web, or from Azure storage, but I prefer to create my own data store. To do so, I copied the text file into Notepad on my local machine, and then added column headers based on the file description on the UCI Web site, like so:

political-party,handicapped-infants, . . ,south-africa
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
...

When writing ML code from scratch, working with column headers can be annoying, so headers are often left off data files. But with ML Studio, using column headers is actually easier than omitting them, as well as making the data easier to understand. I renamed the local file to VotingRawWithHeader.txt and saved it on my machine. If you want to use the same headers as I did, you can get the data file I used in the code download for this article at msdn.microsoft.com/magazine/msdnmag0914.

After navigating to the ML Studio homepage, I clicked on the Datasets category in the left-hand pane. In the main working area, ML Studio displays a list of built-in data sets, for example Iris Two Class Data and Telescope Data. Most of these data sets you initially see are more or less well-known benchmark sets (many from the UCI repository) that can be used for exploring ML Studio. In the lower-left corner of ML Studio, I located the New icon and clicked on it.

From there I could choose either a new Dataset or a new Experiment, so I clicked on Dataset and then on the From Local File icon. This brought up the dialog box shown in Figure 3. I used the Browse button to target the local file, named the data set “Voting data,” selected type “Generic CSV file with a header (.csv)” and typed in a brief description of the data set.

Creating a New Dataset
Figure 3 Creating a New Dataset

I clicked on the OK checkmark and ML Studio uploaded the local file into Azure storage and saved it. Back in the Datasets view in ML Studio, I did a page refresh and the voting data was now visible along with the demo data sets. Note that in the pre-release version of ML Studio I used, it wasn’t possible to delete a Dataset. So, when you’re investigating, I strongly suggest that you create a single data set with a generic name like Dummy Data. Then, when you need a different data set, use the “This is a new version of an existing dataset” option so your ML Studio workspace doesn’t become overrun with orphaned, dummy data sets that can’t be deleted.

Creating the Experiment

To create the experiment, I clicked on the New icon in the lower-left corner of ML Studio, and then on the Experiment option. Next, in the left-hand pane, I clicked on the Saved Datasets category, and then scrolled to the Voting Data item I just created and dragged it onto the design pane. At the top of the design surface, I entered Voting Experiment as the title. At this point, you could right-click on the bottom output node of the Voting Data module and select the Visualize option to verify your data set is correct.

Many developers, including me, when first working with ML, seriously underestimate how much effort is involved in manipulating the source data before applying ML algorithms. Typical tasks include rearranging data columns, deleting unwanted columns, dealing with missing values, encoding non-numeric data, and splitting data into training and test sets. From a developer’s point of view for the voting-data experiment, these tasks might take the form of code like this:

 

string[][] rawData = LoadData("VotingRawWithHeader.txt");
rawData = ProcessMissing(rawData, '?', 'n');
rawData = SwapColumns(rawData, 0, 16);
double[][] data = Encode(rawData);
double[][] trainData;
double[][] testData;
MakeTrainTest(data, 0.80, out trainData, out testData);

Figure 4 shows a close-up of the first four ML Studio modules that perform these tasks. In many ML scenarios, the most common approach to deal with missing values is to simply delete all data item rows that contain one or more missing values, and ML Studio gives you that option. However, with voting data, my hypothesis was that a missing vote was really an implied “no” vote. So, for the Missing Values Scrubber module, in the right-hand pane, I specified that all missing values (“?”) should be replaced by “n” values.

Processing the Data
Figure 4 Processing the Data

The Project Columns module allows you to specify any columns you want to omit. In this case, I selected the “Select all columns” option. ML Studio examines your data and makes intelligent guesses as to whether column values are string categorical data or numeric data. The Metadata Editor module allows you to override the ML assumptions and also allows you to specify the Label column, that is, the variable to predict. I selected the “political-party” column (here’s where having column headers is a big help) and specified it was the Label column. I left the other 16 columns as Feature (predictor) columns.

The Split module does just that, dividing data into a training set, used to create an ML model, and a test set, used to estimate the accuracy of the model. Here, I specified 0.8 in the module’s parameter pane so the training data would be 80 percent of the 435 items (348 items) and the test set would be the remaining 20 percent (87 items). The Split module also has a Boolean parameter named “Stratified split.” When working with ML Studio, you’ll certainly come across parameters whose meaning you don’t understand. The question-mark icon in the lower-right gives you access to the ML Studio Help.

Training the Model

You can think of an ML model as a collection of information—typically numeric values called weights—that are used to generate outputs and predictions. Training a model is the process of finding a set of weight values so that when presented with input data from the training set (in this case, 16 yes and no votes), computed outputs (either democrat or republican) closely match the known outputs in the training data. Once these weights have been determined, the resulting model can be presented with the test data. The accuracy of the model on the test data (the percentage of correct predictions) gives you a rough estimate of how well the model will do when presented with new data, where the true output isn’t known.

For the voting demo, a code-based approach to training might resemble:

 

int numFeatures = 16;
LogisticModel lm = new LogisticModel(numFeatures);
int maxEpochs = 10000;
lm.Train(trainData, maxEpochs);

Figure 5 shows a close-up of the equivalent ML Studio training-­related modules. In the demo, the Train Model module accepts as input the Logistic Regression Binary Classification Model module. Unlike the other connections, this isn’t really a data flow; it actually specifies exactly what kind of ML model is to be used. Alternatives to Logistic Regression Binary Classification include modules Averaged Perceptron Binary, Boosted Decision Tree Binary and Neural Network Binary Classifiers.

Training the Model
Figure 5 Training the Model

So, how do you know which model to use? Because the output to predict has two possible values, democrat or republican, you want a binary model. But there are dozens of ML approaches and probably the hardest part of using ML Studio is having to research the pros and cons of different ML classifiers. A developer analogy with Visual Studio is that the Microsoft .NET Framework has dozens of data structures, such as generic Dictionary, HashSet, and generic Queue, and it’s up to you to know exactly what each data structure does. In the same way, it’s up to you to learn about ML classifiers.

The Logistic Regression module has some parameters you likely won’t understand, including Optimization tolerance, L1 regularization weight, and Memory size for L-BFGS. Again, it’s up to you to learn what these parameters do. Fortunately, ML Studio has well-chosen default values for most module parameters. I accepted all default parameter values except I used zero for the “Random number seed.”

In the Train Model module, you have to tell ML Studio which column of the training data is the Label column; that is, which column is the variable to predict. With the Train Model module selected, I clicked on the Launch column selector button in the module’s parameters pane and chose the pick-by-name option in the dropdown control, and then typed political party. I could’ve used the column index, 1, because political party is in the first column (ML Studio indices are 1-based rather than 0-based as developers are accustomed to). Note that specifying the Label column for the Train Model module is required even if you do so in the Metadata Editor.

Evaluating the Model

After the demo model has been trained, the next steps are to feed the training data and the test data to the model, calculate the computed outputs, and calculate the accuracy of the computed outputs (against the known outputs). In code, this might look like:

 

lm.ComputeOutputs(trainData); // Score
double trainAccuracy = lm.Accuracy(trainData); // Evaluate
lm.ComputeOutputs(testData); // Score
double testAccuracy = lm.Accuracy(testData); // Evaluate

Figure 6 shows a close-up of the relevant scoring and evaluating modules. The two Score Model modules accept two input flows. The first input is the trained model (the information needed to compute outputs), and the second input is either a training set or test set (the data inputs needed). The results of these two scoring modules are sent to the Evaluate Model module, which computes accuracy.

Scoring and Evaluating the Model
Figure 6 Scoring and Evaluating the Model

Figure 7 shows the results in the Evaluate Model module. Notice in the upper right, the second of two Dataset items has been selected (it’s highlighted), which means the results are for the test data. The most important part of the result is the Accuracy value of 0.989. Recall the test set was 20 percent of the 435 original data items, or 87 items. The Logistic Regression model correctly predicted the political party of 86 out of the 87 test items. There’s a lot of other information in Evaluate Model module results. For example, the graph is called a Receiver Operating Characteristic (ROC) graph. It plots the percentage of “true positives” (correct predictions) on the y-axis vs. the percentage of “false positives” (incorrect predictions) on the x-axis.

Model Accuracy on Test Data
Figure 7 Model Accuracy on Test Data

More Than Just a Tool

The Azure ML Studio application, together with its back-end engine, the Microsoft Azure Machine Learning (ML) service, is much more than the client tool described in this short article. From a developer’s point of view, ML Studio dramatically simplifies the creation of prediction systems. But there are additional value propositions that are not so obvious.

One important topic not covered in this article is the ability of Azure ML to create and publish a Web service using just drag and drop and a few clicks. Azure ML automatically handles deployment, capacity provisioning, load balancing, auto-scaling and health monitoring. One pre-release customer estimated that with Azure ML, they were able to create a business solution (a fraud detection system) at a tiny fraction of the cost of using commercial analytics software.

Azure ML supports R, a popular data science programming language. Hundreds of existing open source R modules can be directly copied into an Azure ML system.

ML Studio allows easy collaboration. An experiment can be easily shared among several people. I’ve used this feature myself and it was much more efficient than my normal, back-and-forth e-mail conversation approach.

Azure ML provides more than just advantages to developers and data scientists. “Deploying advanced analytics is hard,” said Joseph Sirosh, Microsoft corporate vice president, Information Management and Machine Learning. “Enterprises are tired of paying high prices, recruiting expensive talent and waiting months to get results. Having the ability to quickly develop analytic models and deploy them without these bottlenecks is game-changing. Azure ML allows businesses to unlock value in their data and build systems to reduce expenses, grow revenue and serve their end customers better.”

Making Predictions

Once an ML Studio model has been created and evaluated, it can be used to make predictions on data with unknown outputs. I wanted to predict the political party of a hypothetical Representative who voted “yes,” “no,” “yes,” “no,” and so on, on the 16 legislative bills, and a second Representative who voted “yes” on the first eight bills and “no” on the remaining eight bills.

One approach would be to create and upload a new ML Studio Dataset and then score it in the same way as the training and test data sets. But for a limited amount of data, a more interactive approach is to use the Enter Data module as shown in Figure 8, which allows you to enter data manually.

Entering New Data to Predict
Figure 8 Entering New Data to Predict

The format of the data in the module must exactly match the format of the data used to train the model, so the column headers are required. The output from the Enter Data module is combined with the output from the Train Model module. After running the experiment, you can see the results by clicking on the Visualize option of the Score Model module, as shown earlier in Figure 2.

If you were making predictions using a procedural programming language, the code might resemble:

string[] unknown = new string[] { "party", "y", "n", "y", . . "n" };
double result = lm.ComputeOutput(unknown);
if (result < 0.5)
  Console.WriteLine("Predicted party is democrat");
else
  Console.WriteLine("Predicted party is republican");

Again, some of the ML Studio information is likely to be a bit mysterious. Recall that the predictions include two trailing numeric values:

unknown-party  y  n  y . . y  n  democrat    0.0013
unknown-party  y  y  y . . n  n  republican  0.7028

As it turns out, for Logistic Regression Binary Classification, an output value below 0.5 indicates the first class (democrat in this example) and an output value above 0.5 indicates the second class (republican). Keep in mind that, like Visual Studio, ML Studio has many features and generates a huge amount of information. You learn what the various pieces of information mean over time by using the system and tackling one new piece at a time, rather than doing a deep dive into the documentation and trying to learn everything at once.

No Code?!

This article just scratched the surface of ML Studio, but it provides enough information so you can replicate the voting experiment and give the tool a try. You may have noticed that this article doesn’t include any coding. No code? Bah! But one of the coolest things about ML Studio is that you can write your own custom modules using C#, and I’ll be covering this topic in future articles. My hunch is that ML Studio will generate an ecosystem in which developers write sophisticated, specialized modules and make them available both commercially and through various channels, such as open source and blogs.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. He can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Roger Barga (Machine Learning) and Michael Jones (Global Product Engineering)