Connect(); Special Issue 2018

Volume 33 Number 13

[Machine Learning]

ML.NET: The Machine Learning Framework for .NET Developers

By James McCaffrey

The ML.NET library is still in preview. All information is subject to change.

The ML.NET library is a new open source collection of machine learning (ML) code that can be used to create powerful prediction systems. Many ML libraries are written in C++ with a Python API for easier programming. Examples include scikit-learn, TensorFlow, CNTK and PyTorch. However, if you use a Python-based ML library to create a prediction model, it’s not so easy for a .NET application to use the trained model. Fortunately, the ML.NET library can be used directly in .NET applications. And because ML.NET can run on .NET Core, you can create predictive systems for macOS and Linux, too.

A good way to see where this article is headed is to take a look at the demo program in Figure 1. The demo creates an ML model that predicts the annual income for a person based on their age, sex and political leaning (conservative, moderate, liberal). Because the goal is to predict a numeric value, this is an example of a regression problem. If the goal had been to predict political leaning from age, sex and income, it would be a classification problem.

ML.NET Legacy Demo Program in Action
Figure 1 ML.NET Legacy Demo Program in Action

The demo uses a set of dummy training data with 30 items. After the model was trained, it was applied to the source data, and achieved a root mean squared error of 1.2630. This error value is difficult to interpret by itself and regression error is best used to compare different models.

The demo concludes by using the trained model to predict the annual income for a 40-year-old male with a conservative political leaning. The predicted income is $72,401.38. The demo in Figure 1 was written using the ML.NET legacy approach, which is a good way for beginners to get a feel for ML.NET. In the second half of this introductory article, I’ll discuss a newer approach that’s somewhat more difficult to grasp but is the better approach for new development.

This article assumes you have intermediate or better programming skill with C#, but doesn’t assume you know anything about the ML.NET library. The complete code and data for the demo program are presented in this article and are also available in the accompanying file download. As I’m writing this article, the ML.NET library is still in preview mode and is being developed very quickly, so some of the information presented here may have changed a bit by the time you’re reading this.

The Demo Program

To create the demo program, I launched Visual Studio 2017. The ML.NET library will work with either the free Community Edition or one of the commercial editions of Visual Studio 2017. The ML.NET documentation states that Visual Studio 2017 is required, and in fact I couldn’t get the demo program to work with Visual Studio 2015. I created a new C# console application project and named it IncomePredict. The ML.NET library will work with either a classic .NET Framework or a .NET Core application type.

After the template code loaded, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to IncomeProgram.cs, and I allowed Visual Studio to automatically rename class Program for me. Next, in the Solution Explorer window, I right-clicked on the IncomePredict project and selected the Manage NuGet Packages option. In the NuGet window, I selected the Browse tab and then entered “ML.NET” in the Search field. The ML.NET library is housed in the Microsoft.ML package. I selected the latest version (0.7.0) and clicked the Install button. After a few seconds Visual Studio responded with a “successfully installed Microsoft.ML 0.7.0 to IncomePredict” message.

At this point I did a Build | Rebuild Solution and got a “supports only x64 architectures” error message. In the Solution Explorer window, I right-clicked on the IncomePredict project, and selected the Properties entry. In the Properties window, I selected the Build tab on the left, then changed the Platform Target entry from “Any CPU” to “x64.” I also made sure that I was targeting the 4.7 version of the .NET Framework. With earlier versions of the Framework I got an error related to one of the math library dependencies. I again did a Build | Rebuild Solution and was successful. When working with preview-mode libraries such as ML.NET, you should expect quite a few hiccups like this.

The Demo Data

After creating the skeleton of the demo program, the next step is to create the training data file. The data is presented in Figure 2. If you’re following along, in the Solution Explorer window, right-click on the IncomePredict project and select Add | New Folder and name the folder “Data.” Placing your data in a folder named Data isn’t required but it’s a standard practice. Right-click on the Data folder and select Add | New Item. From the new item dialog window, select the Text File type and name it PeopleData.txt.

Figure 2 People Data

48, +1, 4.40, liberal
60, -1, 7.89, conservative
25, -1, 5.48, moderate
66, -1, 3.41, liberal
40, +1, 8.05, conservative
44, +1, 4.56, liberal
80, -1, 5.91, liberal
52, -1, 6.69, conservative
56, -1, 4.01, moderate
55, -1, 4.48, liberal
72, +1, 5.97, conservative
57, -1, 6.71, conservative
50, -1, 6.40, liberal
80, -1, 6.67, moderate
69, +1, 5.79, liberal
39, -1, 9.42, conservative
68, -1, 7.61, moderate
47, +1, 3.24, conservative
18, +1, 4.29, liberal
79, +1, 7.44, conservative
44, -1, 2.55, liberal
52, +1, 4.71, moderate
55, +1, 5.56, liberal
76, -1, 7.80, conservative
32, -1, 5.94, liberal
46, +1, 5.52, moderate
48, -1, 7.25, conservative
58, +1, 5.71, conservative
44, +1, 2.52, liberal
68, -1, 8.38, conservative

Copy the data from Figure 2 and paste it into the editor window, being careful not to have any extra trailing blank lines.

The 30-item dataset is artificial. The first column is a person’s age. The second column indicates sex and is pre-encoded as male = -1 and female = +1. The ML.NET library has methods to encode text data, so the data could’ve used “male” and “female.” The third column is annual income to predict, with values divided by 10,000. The last column specifies the political leaning (conservative, moderate, liberal).

Because the data has three predictor variables (age, sex, politic), it’s not possible to display it in a two-dimensional graph. But you can get a good idea of the structure of the data by examining the graph of just age and annual income in Figure 3. The graph shows that age by itself can’t be used to get an accurate prediction of income.

Income Data
Figure 3 Income Data

After you create the training data in a Data folder, you should create a folder named Models to hold the saved model because the demo code assumes a Models folder exists.

The Program Code

The complete demo code, with a few minor edits to save space, is presented in Figure 4. After the template code loaded into Visual Studio, at the top of the Editor window I removed all namespace references and replaced them with the ones shown in the code listing. The various Microsoft.ML namespaces house all ML.NET functionality. The Threading.Tasks namespace is needed to save or load a trained ML.NET legacy model to file.

Figure 4 ML.NET Legacy Example Program

using System;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Legacy;
using Microsoft.ML.Legacy.Data;
using Microsoft.ML.Legacy.Transforms;
using Microsoft.ML.Legacy.Trainers;
using Microsoft.ML.Legacy.Models;
using System.Threading.Tasks;
// Microsoft.ML 0.7.0  Framework 4.7 Build x64
namespace IncomePredict
{
  class IncomeProgram
  {
    public class IncomeData {
      [Column("0")] public float Age;
      [Column("1")] public float Sex;
      [Column("2")] public float Income;
      [Column("3")] public string Politic;
    }
    public class IncomePrediction {
      [ColumnName("Score")]
      public float Income;
    }
    static void Main(string[] args)
    {
      Console.WriteLine("Begin ML.NET demo run");
      Console.WriteLine("Income from age, sex, politics");
      var pipeline = new LearningPipeline();
      string dataPath = "..\\..\\Data\\PeopleData.txt";
      pipeline.Add(new TextLoader(dataPath).
        CreateFrom<IncomeData>(separator: ','));
      pipeline.Add(new ColumnCopier(("Income", "Label")));
      pipeline.Add(new CategoricalOneHotVectorizer("Politic"));
      pipeline.Add(new ColumnConcatenator("Features", "Age",
        "Sex", "Politic"));
      var sdcar = new StochasticDualCoordinateAscentRegressor();
      sdcar.MaxIterations = 1000;
      sdcar.NormalizeFeatures = NormalizeOption.Auto;
      pipeline.Add(sdcar);
      // pipeline.N
      Console.WriteLine("\nStarting training \n");
      var model = pipeline.Train<IncomeData, IncomePrediction>();
      Console.WriteLine("\nTraining complete \n");
      string modelPath = "..\\..\\Models\\IncomeModel.zip";
      Task.Run(async () =>
      {
        await model.WriteAsync(modelPath);
      }).GetAwaiter().GetResult();
      var testData = new TextLoader(dataPath).
        CreateFrom<IncomeData>(separator: ',');
      var evaluator = new RegressionEvaluator();
      var metrics = evaluator.Evaluate(model, testData);
      double rms = metrics.Rms;
      Console.WriteLine("Root mean squared error = " +
        rms.ToString("F4"));
      Console.WriteLine("Income age 40 conservative male: ");
      IncomeData newPatient = new IncomeData() { Age = 40.0f,
        Sex = -1f, Politic = "conservative" };
      IncomePrediction prediction = model.Predict(newPatient);
      float predIncome = prediction.Income * 10000;
      Console.WriteLine("Predicted income = $" +
        predIncome.ToString("F2"));
      Console.WriteLine("\nEnd ML.NET demo");
      Console.ReadLine();
    } // Main
  } // Program
} // ns

Notice that most of the namespaces have a “Legacy” identifier. The demo program uses what’s called the pipeline API, which is simple and effective. The ML.NET team is adding a new, more flexible API, which I’ll discuss shortly.

The program defines a nested class named IncomeData that describes the internal structure of the training data. For example, the first column is:

[Column("0")]
public float Age;

Notice that the age field is declared type float rather than type double. In most ML systems, type float is the default numeric type because the increase in precision you get from using type double is rarely worth the resulting memory and performance penalty. Predictor field names can be specified using the ColumnName attribute. They’re optional and can be whatever you like, for example [ColumnName(“Age”)].

The demo program defines a nested class named IncomePrediction to hold model predictions:

public class IncomePrediction {
  [ColumnName("Score")]
  public float Income;
}

The column name “Score” is required but, as shown, the associated string variable identifier doesn’t have to match.

Creating and Training the Model

The demo program sets up an untrained ML model using these statements:

var pipeline = new LearningPipeline();
string dataPath = "..\\..\\Data\\IncomeData.txt";
pipeline.Add(new TextLoader(dataPath).
  CreateFrom<IncomeData>(separator: ','));

You can think of a LearningPipeline object as a meta container that holds the training data and a training algorithm. This paradigm is quite a bit different from those used by other ML libraries. Next, the pipeline performs some data manipulation:

pipeline.Add(new ColumnCopier(("Income", "Label")));
pipeline.Add(new CategoricalOneHotVectorizer("Politic"));
pipeline.Add(new ColumnConcatenator("Features", "Age",
  "Sex", "Politic"));

The legacy version of ML.NET requires that the column that holds the values to predict be identified as “Label” so the ColumnCopier method creates an in-memory duplicate of the Income column. An alternative approach is to simply name the Income column as Label in the class definition that defines the structure of the training data.

The trainer only works with numeric data so the text values of the Politic column must be converted to integers. The Categorical­OneHotVectorizer method converts “conservative,” “moderate,” and “liberal” to (1, 0, 0), (0, 1, 0), and (0, 0, 1). An alternative approach is to manually pre-encode text data.

The ColumnConcatenator method combines the three predictor columns into a single column named Features. This naming scheme is required. The training algorithm is added to the pipeline, and the model is trained like so:

var sdcar = new StochasticDualCoordinateAscentRegressor();
sdcar.MaxIterations = 1000;
sdcar.NormalizeFeatures = NormalizeOption.Auto;
pipeline.Add(sdcar);
var model = pipeline.Train<IncomeData, IncomePrediction>();

Stochastic dual coordinate ascent is a relatively simple algorithm for training a linear-form regression model. Other legacy regression trainers include FastForestRegressor, FastTreeRegressor, GeneralizedAdditiveModelRegressor, LightGbmRegressor, OnlineGradientDescentRegressor and OrdinaryLeastSquaresRegressor. Each of these has strengths and weaknesses so there’s no one best algorithm for a regression problem. Understanding the differences between each type of regressor and classifier is not trivial and requires quite a bit of poring through documentation.

Once the pipeline object has been set up, training the model is a one-statement operation. If you refer back to the output shown in Figure 1, you’ll notice that the Train method does a lot of behind-the-scenes work for you. Because the pipeline uses automatic normalization, the trainer analyzed the Age and Income columns and decided that they should be scaled using min-max normalization. This converts all age and income values to values between 0.0 and 1.0 so that relatively large values (such as an age of 52) don’t overwhelm smaller values (such as an income of 4.58). Normalization usually, but not always, improves the accuracy of the resulting model.

The Train method also uses L1 and L2 regularization, which is another standard ML technique to improve the accuracy of a model. Briefly, regularization discourages extreme weight values in the model, which in turn discourages model overfitting. To reiterate, ML.NET does all kinds of advanced processing, without you having to explicitly configure parameter values. Nice!

Saving and Evaluating the Model

After the regression model has been trained, it’s saved to disk like so:

string modelPath = "..\\..\\Models\\IncomeModel.zip";
Task.Run(async () =>
{
  await model.WriteAsync(modelPath);
}).GetAwaiter().GetResult();

The code assumes the existence of a directory named Model that’s two levels above the program executable. An alternative is to hard code the path. Because the WriteAsync method is asynchronous, it’s not so easy to call it. There are several approaches you can use. The approach I prefer is the wrapper technique shown. The lack of a non-async method to save an ML.NET model is a bit surprising, even for a library that’s in preview mode.

The model is evaluated with these statements:

var testData = new TextLoader(dataPath).
  CreateFrom<IncomeData>(separator: ',');
var evaluator = new RegressionEvaluator();
var metrics = evaluator.Evaluate(model, testData);
double rms = metrics.Rms;
Console.WriteLine("Model root mean squared error = " +
  rms.ToString("F4"));

In most ML scenarios you’d have two data files—one used just for training, and a second test dataset used only for model evaluation. For simplicity, the demo program reuses the single 30-item data file for model evaluation.

The Evaluate method returns an aggregate object that holds the root mean squared value for the trained model applied to the test data. Other metrics returned by a regression evaluator include R-squared (the coefficient of determination) and L1 (sum of absolute errors).

For many ML problems, the most useful metric is prediction accuracy. There’s no inherent definition of accuracy for a regression problem because you must define what it means for a prediction to be correct. The usual approach is to write a custom function where a predicted value is counted as correct if it’s within a given percentage of the true value in the training data. For example, if you set the delta percentage to 0.10 and if a true income value is 6.00, a correct prediction is one between 5.40 and 6.60.

Using the Trained Model

The demo program predicts the annual income for a 40-year-old conservative male like so:

Console.WriteLine("Income for age 40 conservative male: ");
IncomeData newPatient = new IncomeData() { Age = 40.0f,
  Sex = -1f, Politic = "conservative" };
IncomePrediction prediction = model.Predict(newPatient);
float predIncome = prediction.Income * 10000;
Console.WriteLine("Predicted income = $" +
  predIncome.ToString("F2"));

Notice that the numeric literals for age and sex use the “f” modifier because the model is expecting type float values. In this example, the trained model was available because the program just finished training. If you wanted to make a prediction from a different program, you’d load the trained model using the ReadAsync method along the lines of:

PredictionModel<IncomeData, IncomePrediction> model = null;
Task.Run(async () =>
{
  model = await PredictionModel.ReadAsync<IncomeData,
    IncomePrediction>(modelPath);
}).GetAwaiter().GetResult();

After loading the model into memory, you’d use it by calling the Predict method as shown earlier.

The New ML.NET API Approach

The legacy pipeline approach is simple and effective, and it provides a consistent interface for using the ML.NET classifiers and regressors. But the legacy approach has some architectural characteristics that limit the library’s extensibility so the ML.NET team has created a new, more flexible approach, which is best explained by example.

Suppose you have the exact same dataset—age, sex, income, politics. And suppose you want to predict political leaning from the other three variables. A demo program to create a classifier using the new ML.NET API approach is presented in Figure 5. This program uses a hybrid combination of old style and new style and is intended to create a bridge between the two. Let me emphasize that the latest code examples in the ML.NET documentation will provide you with more sophisticated, and in some cases better, techniques.

Figure 5 Classification Example Code Listing

using System;
using Microsoft.ML;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Transforms.Conversions;
// Microsoft.ML 0.7.0  Framework 4.7 Build x64
namespace PoliticPredict
{
  class PoliticProgram
  {
    public class PoliticData {
      [Column("0")] public float Age;
      [Column("1")] public float Sex;
      [Column("2")] public float Income;
      [Column("3")]
      [ColumnName("Label")]
      public string Politic;
    }
    public class PoliticPrediction  {
      [ColumnName("PredictedLabel")]
      public string PredictedPolitic;
    }
    static void Main(string[] args)
    {
      var ctx = new MLContext(seed: 1);
      string dataPath = "..\\..\\Data\\PeopleData.txt";
      TextLoader textLoader =
        ctx.Data.TextReader(new TextLoader.Arguments()
      {
        Separator = ",", HasHeader = false,
        Column = new[] {
          new TextLoader.Column("Age", DataKind.R4, 0),
          new TextLoader.Column("Sex", DataKind.R4, 1),
          new TextLoader.Column("Income", DataKind.R4, 2),
          new TextLoader.Column("Label", DataKind.Text, 3)
        }
      });
      var data = textLoader.Read(dataPath);
      var est = ctx.Transforms.Categorical.MapValueToKey("Label")
       .Append(ctx.Transforms.Concatenate("Features", "Age",
         "Sex", "Income"))
       .Append(ctx.MulticlassClassification.Trainers
         .StochasticDualCoordinateAscent("Label", "Features",
         maxIterations: 1000))
       .Append(new KeyToValueEstimator(ctx, "PredictedLabel"));
      var model = est.Fit(data);
      var prediction = model.MakePredictionFunction<PoliticData,
        PoliticPrediction>(ctx).Predict(
          new PoliticData() {
            Age = 40.0f, Sex = -1.0f, Income = 8.55f
          });
      Console.WriteLine("Predicted party is: " +
        prediction.PredictedPolitic);
      Console.ReadLine();
    } // Main
  } // Program
} // ns

At a very high level, many ML tasks have five phases: load and transform training data into memory, create a model, train the model, evaluate and save the model, use the model. Both the legacy and new ML.NET APIs can perform these operations, but the new approach is clearly superior (in my opinion, anyway) for realistic ML scenarios in a production system.

A key feature of the new ML.NET API is the MLContext class. Notice that the ctx object is used when reading the training data, when creating the prediction model and when making a prediction.

Although it’s not apparent from the code, another advantage of the new API over the legacy approach is that you can read training data from multiple files. I don’t encounter this scenario often, but when I do, the ability to read multiple files is a huge time-saver.

Another feature of the new API is the ability to create prediction models in two different ways, called static and dynamic. The static approach gives you full Visual Studio IntelliSense capabilities during development. The dynamic approach can be used when the structure of your data must be determined at run time.

Wrapping Up

If you’ve gone through this article and run and understood the relatively simple demo code, your next step should be to take a full plunge into the new ML.NET API. Unlike many open source projects that have weak or skimpy documentation, the ML.NET documentation is excellent. I can recommend the examples at bit.ly/2AVM1oL as a great place to begin.

Even though the ML.NET library is new, its origins go back many years. Shortly after the introduction of the Microsoft .NET Framework in 2002, Microsoft Research began a project called TMSN (“text mining search and navigation”) to enable software developers to include ML code in Microsoft products and technologies. The project was very successful, and over the years grew in size and usage internally at Microsoft. Somewhere around 2011 the library was renamed to TLC (“the learning code”). TLC is widely used within Microsoft and is currently in version 3.10. The ML.NET library is a descendant of TLC, with Microsoft-specific features removed. I’ve used both libraries and, in many ways, the ML.NET child has surpassed its parent.

This article has just scratched the surface of the ML.NET library. An interesting and powerful new capability of ML.NET is the ability to consume and use deep neural network models created by other systems, such as PyTorch and CNTK. The key to this interoperability is the Open Neural Network Exchange (ONNX) standard. But that’s a topic for a future article.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Ankit Asthana, Chris Lauren, Cesar De la Torre Llorente, Beth Massi, Shahab Moradi, Gal Oshri, Shauheen Zahirazami


Discuss this article in the MSDN Magazine forum