Dimensionality reduction overview
Dimensionality reduction is the common term for a set of mathematical techniquesused to capture the shape and relationships of data in a high-dimensional spaceand translate this information into a low-dimensional space.
Reducing dimensionality is important when you are working with large datasetsthat can contain thousands of features. In such a large data space, the widerrange of distances between data points can make model output harder tointerpret. For example, it makes it difficult to understand which data pointsare more closely situated and therefore represent more similar data.Dimensionality reduction helps you reduce the number of features while retainingthe most important characteristics of the dataset. Reducing the number offeatures also helps reduce the training time of any models that use the data asinput.
BigQuery ML offers the following models for dimensionality reduction:
You can use PCA and autoencoder models with theML.PREDICTorAI.GENERATE_EMBEDDINGfunctions to embed data into a lower-dimensional space, and with theML.DETECT_ANOMALIES functionto performanomaly detection.
You can use the output from dimensionality reduction models for tasks such asthe following:
- Similarity search: Find data points that are similar to each otherbased on their embeddings. This is great for finding related products,recommending similar content, or identifying duplicate or anomalous items.
- Clustering: Use embeddings as input features for k-means models inorder to group data points together based on their similarities.This can help you discover hidden patterns and insights in your data.
- Machine learning: Use embeddings as input features for classificationor regression models.
Recommended knowledge
By using the default settings in theCREATE MODEL statements and theinference functions, you can create and use a dimensionality reduction modeleven without much ML knowledge. However, having basic knowledge aboutML development helps you optimize both your data and your model todeliver better results. We recommend using the following resources to developfamiliarity with ML techniques and processes:
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.