Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Download Microsoft EdgeMore info about Internet Explorer and Microsoft Edge
Table of contentsExit focus mode

Load data from files and other sources

  • 2024-09-13
Feedback

In this article

Learn how to load data into ML.NET for processing and training, using the API. The data is originally stored in files or other data sources such as databases, JSON, XML, or in-memory collections.

If you're using Model Builder, seeLoad training data into Model Builder.

Create the data model

ML.NET enables you to define data models via classes. For example, given the following input data:

Size (Sq. ft.), HistoricalPrice1 ($), HistoricalPrice2 ($), HistoricalPrice3 ($), Current Price ($)700, 100000, 3000000, 250000, 5000001000, 600000, 400000, 650000, 700000

Create a data model that represents the following snippet:

public class HousingData{    [LoadColumn(0)]    public float Size { get; set; }    [LoadColumn(1, 3)]    [VectorType(3)]    public float[] HistoricalPrices { get; set; }    [LoadColumn(4)]    [ColumnName("Label")]    public float CurrentPrice { get; set; }}

Annotate the data model with column attributes

Attributes give ML.NET more information about the data model and the data source.

TheLoadColumn attribute specifies your properties' column indices.

Important

LoadColumn is only required when loading data from a file.

Load columns as:

  • Individual columns, likeSize andCurrentPrices in theHousingData class.
  • Multiple columns at a time in the form of a vector, likeHistoricalPrices in theHousingData class.

If you have a vector property, apply theVectorType attribute to the property in your data model. All of the elements in the vector must be the same type. Keeping the columns separated allows for ease and flexibility of feature engineering, but for a large number of columns, operating on the individual columns causes an impact on training speed.

ML.NET operates through column names. If you want to change the name of a column to something other than the property name, use theColumnName attribute. When creating in-memory objects, you still create objects using the property name. However, for data processing and building machine learning models, ML.NET overrides and references the property with the value provided in theColumnName attribute.

Load data from a single file

To load data from a file, use theLoadFromTextFile method with the data model for the data to be loaded. SinceseparatorChar parameter is tab-delimited by default, change it for your data file as needed. If your file has a header, set thehasHeader parameter totrue to ignore the first line in the file and begin to load data from the second line.

//Create MLContextMLContext mlContext = new MLContext();//Load DataIDataView data = mlContext.Data.LoadFromTextFile<HousingData>("my-data-file.csv", separatorChar: ',', hasHeader: true);

Load data from multiple files

In the event that your data is stored in multiple files, as long as the data schema is the same, ML.NET allows you to load data from multiple files that are either in the same directory or multiple directories.

Load from files in a single directory

When all of your data files are in the same directory, use wildcards in theLoadFromTextFile method.

//Create MLContextMLContext mlContext = new MLContext();//Load Data FileIDataView data = mlContext.Data.LoadFromTextFile<HousingData>("Data/*", separatorChar: ',', hasHeader: true);

Load from files in multiple directories

To load data from multiple directories, use theCreateTextLoader method to create aTextLoader. Then, use theTextLoader.Load method and specify the individual file paths (wildcards can't be used).

//Create MLContextMLContext mlContext = new MLContext();// Create TextLoaderTextLoader textLoader = mlContext.Data.CreateTextLoader<HousingData>(separatorChar: ',', hasHeader: true);// Load DataIDataView data = textLoader.Load("DataFolder/SubFolder1/1.txt", "DataFolder/SubFolder2/1.txt");

Load data from a relational database

ML.NET supports loading data from a variety of relational databases supported bySystem.Data, which include SQL Server, Azure SQL Database, Oracle, SQLite, PostgreSQL, Progress, and IBM DB2.

Note

To useDatabaseLoader, reference theSystem.Data.SqlClient NuGet package.

Given a database with a table namedHouse and the following schema:

CREATE TABLE [House] (    [HouseId] INT NOT NULL IDENTITY,    [Size] INT NOT NULL,    [NumBed] INT NOT NULL,    [Price] REAL NOT NULL    CONSTRAINT [PK_House] PRIMARY KEY ([HouseId]));

The data can be modeled by a class likeHouseData:

public class HouseData{    public float Size { get; set; }    public float NumBed { get; set; }    public float Price { get; set; }}

Then, inside of your application, create aDatabaseLoader.

MLContext mlContext = new MLContext();DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader<HouseData>();

Define your connection string as well as the SQL command to be executed on the database and create aDatabaseSource instance. This sample uses a LocalDB SQL Server database with a file path. However, DatabaseLoader supports any other valid connection string for databases on-premises and in the cloud.

Important

Microsoft recommends that you use the most secure authentication flow available. If you're connecting to Azure SQL,Managed Identities for Azure resources is the recommended authentication method.

string connectionString = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=<YOUR-DB-FILEPATH>;Database=<YOUR-DB-NAME>;Integrated Security=True;Connect Timeout=30";string sqlCommand = "SELECT CAST(Size as REAL) as Size, CAST(NumBed as REAL) as NumBed, Price FROM House";DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlCommand);

Numerical data that's not of typeReal has to be converted toReal. TheReal type is represented as a single-precision floating-point value orSingle, the input type expected by ML.NET algorithms. In this sample, theSize andNumBed columns are integers in the database. Using theCAST built-in function, it's converted toReal. Because thePrice property is already of typeReal, it's loaded as-is.

Use theLoad method to load the data into anIDataView.

IDataView data = loader.Load(dbSource);

Load images

To load image data from a directory, first create a model that includes the image path and a label.ImagePath is the absolute path of the image in the data source directory.Label is the class or category of the actual image file.

public class ImageData{    [LoadColumn(0)]    public string ImagePath;    [LoadColumn(1)]    public string Label;}public static IEnumerable<ImageData> LoadImagesFromDirectory(string folder,            bool useFolderNameAsLabel = true){    string[] files = Directory.GetFiles(folder, "*", searchOption: SearchOption.AllDirectories);    foreach (string file in files)    {        if (Path.GetExtension(file) != ".jpg")            continue;        string label = Path.GetFileName(file);        if (useFolderNameAsLabel)            label = Directory.GetParent(file).Name;        else        {            for (int index = 0; index < label.Length; index++)            {                if (!char.IsLetter(label[index]))                {                    label = label.Substring(0, index);                    break;                }            }        }        yield return new ImageData()        {            ImagePath = file,            Label = label        };    }}

Then load the image:

IEnumerable<ImageData> images = LoadImagesFromDirectory(                folder: "your-image-directory-path",                useFolderNameAsLabel: true                );

To load in-memory raw images from directory, create a model to hold the raw image byte array and label:

public class InMemoryImageData{    [LoadColumn(0)]    public byte[] Image;    [LoadColumn(1)]    public string Label;}static IEnumerable<InMemoryImageData> LoadInMemoryImagesFromDirectory(    string folder,    bool useFolderNameAsLabel = true    ){    string[] files = Directory.GetFiles(folder, "*",        searchOption: SearchOption.AllDirectories);    foreach (string file in files)    {        if (Path.GetExtension(file) != ".jpg")            continue;        string label = Path.GetFileName(file);        if (useFolderNameAsLabel)            label = Directory.GetParent(file).Name;        else        {            for (int index = 0; index < label.Length; index++)            {                if (!char.IsLetter(label[index]))                {                    label = label.Substring(0, index);                    break;                }            }        }        yield return new InMemoryImageData()        {            Image = File.ReadAllBytes(file),            Label = label        };    }}

Load data from other sources

In addition to loading data stored in files, ML.NET supports loading data from sources that include:

  • In-memory collections
  • JSON/XML

When working with streaming sources, ML.NET expects input to be in the form of an in-memory collection. Therefore, when working with sources like JSON/XML, make sure to format the data into an in-memory collection.

Given the following in-memory collection:

HousingData[] inMemoryCollection = new HousingData[]{    new HousingData    {        Size =700f,        HistoricalPrices = new float[]        {            100000f, 3000000f, 250000f        },        CurrentPrice = 500000f    },    new HousingData    {        Size =1000f,        HistoricalPrices = new float[]        {            600000f, 400000f, 650000f        },        CurrentPrice=700000f    }};

Load the in-memory collection into anIDataView with theLoadFromEnumerable method:

Important

LoadFromEnumerable assumes that theIEnumerable it loads from is thread-safe.

// Create MLContextMLContext mlContext = new MLContext();//Load DataIDataView data = mlContext.Data.LoadFromEnumerable<HousingData>(inMemoryCollection);

Next steps

Collaborate with us on GitHub
The source for this content can be found on GitHub, where you can also create and review issues and pull requests. For more information, seeour contributor guide.

Feedback

Was this page helpful?

YesNo

In this article

Was this page helpful?

YesNo