US20210357956A1

Movatterモバイル変換

Info

Publication number: US20210357956A1
Application number: US17/317,616
Authority: US
Inventors: Lianghua Shao; Matthew Vanlandeghem; Jessica Brinson; Sagar Sanghavi; Jonathan Sullivan; Billie J. Kline; Mala Sivarajan; Shruthi Koundinya Nagaraja; Logan THOMAS; Arushi Kumar
Original assignee: Nielsen Co US LLC
Current assignee: Nielsen Co US LLC
Priority date: 2020-05-13
Filing date: 2021-05-11
Publication date: 2021-11-18
Also published as: US20230083206A1; US20230071645A1; WO2021231628A1; WO2021231419A1; US12100016B2; US20230096072A1; US20210357972A1; WO2021231460A1; US20240378628A1; US20230072357A1; US20240086943A1; US11783353B2; WO2021231624A1; US20210357958A1; US20230077544A1; US20210357788A1; US20230075196A1; WO2021231622A1; WO2021231446A1; WO2021231299A1

Abstract

Methods and apparatus to generate audience metrics using third-party privacy-protected cloud environments. An example apparatus includes a data modifier to obtain a first matrix, the first matrix including first data indicative of entities and embeddings, the entities representative of at least one of search result clicks or videos watched, the embeddings representative of at least one of first classifications of the search result clicks or second classifications of the videos watched, generate a second matrix by reducing the first data in the first matrix to second data that satisfies a size corresponding to an input feature, and store the second matrix in first memory as the input feature, and a model generator to generate a demographic correction model based on the second matrix as the input feature, the demographic correction model to correct demographics corresponding to impressions logged in second memory.

Description

RELATED APPLICATION(S)

This patent arises from a non-provisional patent application that claims the benefit of U.S. Provisional Patent Application No. 63/024,260, which was filed on May 13, 2020. U.S. Provisional Patent Application No. 63/024,260 is hereby incorporated herein by reference in its entirety. Priority to U.S. Provisional Patent Application No. 63/024,260 is hereby claimed.

FIELD OF THE DISCLOSURE

This disclosure relates generally to monitoring audiences, and, more particularly, to methods and apparatus to generate audience metrics using third-party privacy-protected cloud environments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example system to enable the generation of audience measurement metrics based on the merging of data collected by a database proprietor and an audience measurement entity (AME).

FIG. 2 is a flowchart representative of machine readable instructions which may be executed to implement the example data modifier ofFIG. 1 to reduce the dimensionality of a matrix associated with entities and embeddings.

FIG. 3 is a block diagram of an example processing platform structured to execute the instructions ofFIG. 2.

The figures are not to scale. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.

Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc. are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name. As used herein, “approximately” and “about” refer to dimensions that may not be exact due to manufacturing tolerances and/or other real world imperfections. As used herein “substantially real time” refers to occurrence in a near instantaneous manner recognizing there may be real world delays for computing time, transmission, etc. Thus, unless otherwise specified, “substantially real time” refers to real time+/−1 second.

DETAILED DESCRIPTION

Audience measurement entities (AMEs) usually collect large amounts of audience measurement information from their panelists including the number of unique audience members for particular media and the number of impressions corresponding to each of the audience members. Unique audience size, as used herein, refers to the total number of unique people (e.g., non-duplicate people) who had an impression of (e.g., were exposed to) a particular media item, without counting duplicate audience members. As used herein, an impression is defined to be an event in which a home or individual accesses and/or is exposed to media (e.g., an advertisement, content, a group of advertisements and/or a collection of content). Impression count, as used herein, refers to the number of times audience members are exposed to a particular media item. The unique audience size associated with a particular media item will always be equal to or less than the number of impressions associated with the media item because, while all audience members by definition have at least one impression of the media, an individual audience member may have more than one impression. That is, the unique audience size is equal to the impression count only when every audience member was exposed to the media only a single time (i.e., the number of audience members equals the number of impressions). Where at least one audience member is exposed to the media multiple times, the unique audience size will be less than the total impression count because multiple impressions will be associated with individual audience members. Thus, unique audience size refers to the number of unique people in an audience (without double counting any person) exposed to media for which audience metrics are being generated. Unique audience size may also be referred to as unique audience, deduplicated audience size, deduplicated audience, or audience.

Techniques for monitoring user access to an Internet-accessible media, such as digital television (DTV) media and digital content ratings (DCR) media, have evolved significantly over the years. Internet-accessible media is also known as digital media. In the past, such monitoring was done primarily through server logs. In particular, media providers serving media on the Internet would log the number of requests received for their media at their servers. Basing Internet usage research on server logs is problematic for several reasons. For example, server logs can be tampered with either directly or via zombie programs, which repeatedly request media from the server to increase the server log counts. Also, media is sometimes retrieved once, cached locally and then repeatedly accessed from the local cache without involving the server. Server logs cannot track such repeat views of cached media. Thus, server logs are susceptible to both over-counting and under-counting errors.

As Internet technology advanced, the limitations of server logs were overcome through methodologies in which the Internet media to be tracked was tagged with monitoring instructions. In particular, monitoring instructions (also known as a media impression request or a beacon request) are associated with the hypertext markup language (HTML) of the media to be tracked. When a client requests the media, both the media and the impression request are downloaded to the client. The impression requests are, thus, executed whenever the media is accessed, be it from a server or from a cache.

The beacon instructions cause monitoring data reflecting information about the access to the media (e.g., the occurrence of a media impression) to be sent from the client that downloaded the media to a monitoring server. Typically, the monitoring server is owned and/or operated by an AME (e.g., any party interested in measuring or tracking audience exposures to advertisements, media, and/or any other media) that did not provide the media to the client and who is a trusted third party for providing accurate usage statistics (e.g., The Nielsen Company, LLC). Advantageously, because the beaconing instructions are associated with the media and executed by the client browser whenever the media is accessed, the monitoring information is provided to the AME irrespective of whether the client is associated with a panelist of the AME. In this manner, the AME is able to track every time a person is exposed to the media on a census-wide or population-wide level. As a result, the AME can reliably determine the total impression count for the media without having to extrapolate from panel data collected from a relatively limited pool of panelists within the population. Frequently, such beacon requests are implemented in connection with third-party cookies. Since the AME is a third party relative to the first party serving the media to the client device, the cookie sent to the AME in the impression request to report the occurrence of the media impression of the client device is a third-party cookie. Third-party cookie tracking is used by audience measurement servers to track access to media by client devices from first-party media servers.

Tracking impressions by tagging media with beacon instructions using third-party cookies is insufficient, by itself, to enable an AME to reliably determine the unique audience size associated with the media if the AME cannot identify the individual user associated with the third-party cookie. That is, the unique audience size cannot be determined because the collected monitoring information does not uniquely identify the person(s) exposed to the media. Under such circumstances, the AME cannot determine whether two reported impressions are associated with the same person or two separate people. The AME may set a third-party cookie on a client device reporting the monitoring information to identify when multiple impressions occur using the same device. However, cookie information does not indicate whether the same person used the client device in connection with each media impression. Furthermore, the same person may access media using multiple different devices that have different cookies so that the AME cannot directly determine when two separate impressions are associated with the same person or two different people.

Furthermore, the monitoring information reported by a client device executing the beacon instructions does not provide an indication of the demographics or other user information associated with the person(s) exposed to the associated media. To at least partially address this issue, the AME establishes a panel of users who have agreed to provide their demographic information and to have their Internet browsing activities monitored. When an individual joins the panel, that person provides corresponding detailed information concerning the person's identity and demographics (e.g., gender, race, income, home location, occupation, etc.) to the AME. The AME sets a cookie on the panelist computer that enables the AME to identify the panelist whenever the panelist accesses tagged media and, thus, sends monitoring information to the AME. Additionally or alternatively, the AME may identify the panelists using other techniques (independent of cookies) by, for example, prompting the user to login or identify themselves. While AMEs are able to obtain user-level information for impressions from panelists (e.g., identify unique individuals associated with particular media impressions), most of the client devices providing monitoring information from the tagged pages are not panelists. Thus, the identity of most people accessing media remains unknown to the AME such that it is necessary for the AME to use statistical methods to impute demographic information based on the data collected for panelists to the larger population of users providing data for the tagged media. However, panel sizes of AMEs remain small compared to the general population of users.

There are many database proprietors operating on the Internet. These database proprietors provide services to large numbers of subscribers. In exchange for the provision of services, the subscribers register with the database proprietors. Examples of such database proprietors include social network sites (e.g., Facebook, Twitter, My Space, etc.), multi-service sites (e.g., Yahoo!, Google, Axiom, Catalina, etc.), online retailer sites (e.g., Amazon.com, Buy.com, etc.), credit reporting sites (e.g., Experian), streaming media sites (e.g., YouTube, Hulu, etc.), etc. These database proprietors set cookies and/or other device/user identifiers on the client devices of their subscribers to enable the database proprietors to recognize their subscribers when their subscribers visit website(s) on the Internet domains of the database proprietors.

In the event the client corresponds to a subscriber of the database proprietor (as determined from the cookie associated with the client), the database proprietor logs/records a database proprietor demographic impression in association with the client/user. As used herein, a demographic impression is an impression that can be matched to particular demographic information of a particular subscriber or registered users of the services of a database proprietor. The database proprietor has the demographic information for the particular subscriber because the subscriber would have provided such information when setting up an account to subscribe to the services of the database proprietor.

Sharing of demographic information associated with subscribers of database proprietors enables AMEs to extend or supplement their panel data with substantially reliable demographics information from external sources (e.g., database proprietors), thus extending the coverage, accuracy, and/or completeness of their demographics-based audience measurements. Such access also enables the AME to monitor persons who would not otherwise have joined an AME panel. Any web service provider having a database identifying demographics of a set of individuals may cooperate with the AME. Such web service providers may be referred to as “database proprietors” and include, for example, wireless service carriers, mobile software/service providers, social media sites (e.g., Facebook, Twitter, MySpace, etc.), online retailer sites (e.g., Amazon.com, Buy.com, etc.), multi-service sites (e.g., Yahoo!, Google, Experian, etc.), and/or any other Internet sites that collect demographic data of users and/or otherwise maintain user registration records. The use of demographic information from disparate data sources (e.g., high-quality demographic information from the panels of an audience measurement entity and/or registered user data of database proprietors) results in improved reporting effectiveness of metrics for both online and offline advertising campaigns.

The above approach to generating audience metrics by an AME depends upon the beacon requests (or tags) associated with the media to be monitored to enable an AME to obtain census wide impression counts (e.g., impressions that include the entire population exposed to the media regardless of whether the audience members are panelists of the AME). Further, the above approach also depends on third-party cookies to enable the enrichment of the census impressions with demographic information from database proprietors. However, in more recent years, there has been a movement away from the use of third-party cookies by third parties. Thus, while media providers (e.g., database proprietors) may still use first-party cookies to collect first-party data, the elimination of third-party cookies prevents the tracking of Internet media by AMEs (outside of client devices associated with panelists for which the AME has provided a meter to track Internet usage behavior). Furthermore, independent of the use of cookies, some database proprietors are moving towards the elimination of third party impression requests or tags (e.g., redirect instructions) embedded in media (e.g., beginning in 2020, third-party tags will no longer be allowed on Youtube.com and other Google Video Partner (GVP) sites). As technology moves in this direction, AMEs (e.g., third parties) will no longer be able to track census wide impressions of media in the manner they have in the past. Furthermore, AMEs will no longer be able to send a redirect request to a client accessing media to cause a second impression request to a database proprietor to associate the impression with demographic information. Thus, the only Internet media monitoring that AMEs will be able to directly perform in such a system will be with panelists that have agreed to be monitored using different techniques that do not depend on third-party cookies and/or tags.

Examples disclosed herein overcome at least some of the limitations that arise out of the elimination of third-party cookies and/or third-party tags by enabling the merging of high-quality demographic information from the panels of an AME with media impression data that continues to be collected by database proprietors. As mentioned above, while third-party cookies and/or third-party tags may be eliminated, database proprietors that provide and/or manage the delivery of media accessed online are still able to track impressions of the media (e.g., via first-party cookies and/or first-party tags). Furthermore, database proprietors are still able to associate demographic information with the impressions whenever the impressions can be matched to a particular subscriber of the database proprietor for which demographic information has been collected (e.g., when the user is registered with the database proprietor). In some examples, the merging of AME panel data and database proprietor impressions data is merged in a privacy-protected cloud environment maintained by the database proprietor. The merged data may include entities for each user. These entities may be top search result click entities and/or video watch entities during a period of time. In examples disclosed herein, a search result click entity is an integer identifier that represents a search term entered by a user. In examples disclosed herein, a video watch entity is an integer identifier that represents a video viewed by a user. In examples disclosed herein, integer identifiers map to a knowledge graph of all entities for the search result clicks and/or videos watched. Additionally, embeddings may be provided for each such entity. In examples disclosed herein, an embedding is a classification of an entity. In examples disclosed herein, classifications are numerical representation (e.g., a vector array of values) of some class of similar objects, images, words, and the like. Example classifications include classifications of Internet searches requested by a user (e.g., corresponding to a top search result click entity) and classifications of media accessed by a user (e.g., corresponding to a video watch entity). In one example, the merged data for a user is provided in an entity-embeddings matrix.

Examples disclosed herein may be used to reduce the dimensionality of entity-embeddings matrices corresponding to users. In examples disclosed herein, an entity-embeddings matrix is used to represent relationships between embeddings corresponding to entities. For example, an entity-embeddings matrix is reduced to a more manageable size to be used as an input feature to generate demographic correction models as disclosed herein. In some examples, a reduction technique is to select top m entities and top n embeddings. Additionally or alternatively, in other examples, a reduction technique is to calculate a weighted average of the embeddings across the entities. Additionally or alternatively, in other examples, a reduction technique is to reduce the dimension of embeddings by using a single value to represent the different embedding dimensions. In this manner, the reduced entity-embeddings matrices generated using techniques disclosed herein may be used to improve computers, computer performance, and/or computer-generated data by providing data of a manageable size to be used as an input feature for demographic correction model generation.

More particularly,FIG. 1 is a block diagram illustrating anexample system100 to enable the generation of audience measurement metrics based on the merging of data collected by adatabase proprietor102 and anAME104. More particularly, in some examples, the data includes AME panel data (that includes media impressions for panelists that are associated with high-quality demographic information collected by the AME104) and database proprietor impressions data (which may be enriched with demographic and/or other information available to the database proprietor102). In the illustrated example, these disparate sources of data are combined within a privacy-protectedcloud environment106 managed and/or maintained by thedatabase proprietor102. The privacy-protectedcloud environment106 is a cloud-based environment that enables media providers (e.g., advertisers and/or content providers) and third parties (e.g., the AME104) to input and combine their data with data from thedatabase proprietor102 inside a data warehouse or data store that enables efficient big data analysis. The combining of data from different parties (e.g., different Internet domains) presents risks to the privacy of the data associated with individuals represented by the data from the different parties. Accordingly, the privacy-protectedcloud environment106 is established with privacy constraints that prevent any associated party (including the database proprietor102) from accessing private information associated with particular individuals. Rather, any data extracted from the privacy-protectedcloud environment106 following a big data analysis and/or query is limited to aggregated information. A specific example of the privacy-protectedcloud environment106 is the Ads Data Hub (ADH) developed by Google.

As used herein, a media impression is defined as an occurrence of access and/or exposure to media108 (e.g., an advertisement, a movie, a movie trailer, a song, a web page banner, etc.). Examples disclosed herein may be used to monitor for media impressions of any one or more media types (e.g., video, audio, a web page, an image, text, etc.). In examples disclosed herein, themedia108 may be primary content and/or advertisements. Examples disclosed herein are not restricted for use with any particular type of media. On the contrary, examples disclosed herein may be implemented in connection with tracking impressions for media of any type or form in a network.

In the illustrated example ofFIG. 1, content providers and/or advertisers distribute themedia108 via the Internet to users that access websites and/or online television services (e.g., web-based TV, Internet protocol TV (IPTV), etc.). For purposes of explanation, examples disclosed herein are described assuming themedia108 is an advertisement that may be provided in connection with particular content of primary interest to a user. In some examples, themedia108 is served by media servers managed by and/or associated with thedatabase proprietor102 that manages and/or maintains the privacy-protectedcloud environment106. For example, thedatabase proprietor102 may be Google, and themedia108 corresponds to ads served with videos accessed via Youtube.com and/or via other Google video partners (GVPs). More generally, in some examples, thedatabase proprietor102 includes corresponding database proprietor servers that can servemedia108 to individuals viaclient devices110. In the illustrated example ofFIG. 1, theclient devices110 may be stationary or portable computers, handheld computing devices, smart phones, Internet appliances, smart televisions, and/or any other type of device that may be connected to the Internet and capable of presenting media. For purposes of explanation, theclient devices110 ofFIG. 1 includepanelist client devices112 andnon-panelist client devices114 to indicate that at least some individuals that access and/or are exposed to themedia108 correspond to panelists who have provided detailed demographic information to theAME104 and have agreed to enable theAME104 to track their exposure to themedia108. In many situations, other individuals who are not panelists will also be exposed to the media108 (e.g., via the non-panelist client devices114). Typically, the number of non-panelist audience members for a particular media item will be significantly greater than the number of panelist audience members. In some examples, thepanelist client devices112 may include and/or implement anaudience measurement meter115 that captures the impressions ofmedia108 accessed by the panelist client devices112 (along with associated information) and reports the same to theAME104. In some examples, theaudience measurement meter115 may be a separate device from thepanelist client device112 used to access themedia108.

In some examples, themedia108 is associated with a unique impression identifier (e.g., a consumer playback nonce (CPN)) generated by thedatabase proprietor102. In some examples, the impression identifier serves to uniquely identify a particular impression of themedia108. Thus, even though thesame media108 may be served multiple times, each time themedia108 is served thedatabase proprietor102 will generate a new and different impression identifier so that each impression of themedia108 can be distinguished from every other impression of the media. In some examples, the impression identifier is encoded into a uniform resource locator (URL) used to access the primary content (e.g., a particular YouTube video) along with which the media108 (as an advertisement) is served. In some examples, with the impression identifier (e.g., CPN) encoded into the URL associated with themedia108, theaudience measurement meter115 extracts the identifier at the time that a media impression occurs so that theAME104 is able to associate a captured impression with the impression identifier.

In some examples, themeter115 may not be able to obtain the impression identifier (e.g., CPN) to associate with a particular media impression. For instance, in some examples where thepanelist client device112 is a mobile device, themeter115 collects a mobile advertising identifier (MAID) and/or an identifier for advertisers (IDFA) that may be used to uniquely identify client devices110 (e.g., thepanelist client devices112 being monitored by the AME104). In some examples, themeter115 reports the MAID and/or IDFA for the particular device associated with themeter115 to theAME104. TheAME104, in turn, provides the MAID and/or IDFA to thedatabase proprietor102 in a double blind exchange through which thedatabase proprietor102 provides theAME104 with the impression identifiers (e.g., CPNs) associated with theclient device110 identified by the MAID and/or IDFA. Once theAME104 receives the impression identifiers for the client device110 (e.g., a particular panelist client device112), the impression identifiers are associated with the impressions previously collected in connection with the device.

In the illustrated example, thedatabase proprietor102 logs each media impression occurring on any of theclient devices110 within the privacy-protectedcloud environment106. In some examples, logging an impression includes logging the time the impression occurred and the type of client device110 (e.g., whether a desktop device, a mobile device, a tablet device, etc.) on which the impression occurred. Further, in some examples, impressions are logged along with the impression's unique impression identifier. In this example, the impressions and associated identifiers are logged in acampaign impressions database116. Thecampaign impressions database116 stores all impressions of themedia108 regardless of whether any particular impression was detected from apanelist client device112 or anon-panelist client device114. Furthermore, thecampaign impressions database116 stores all impressions of themedia108 regardless of whether thedatabase proprietor102 is able to match any particular impression to a particular subscriber of thedatabase proprietor102. As mentioned above, in some examples, thedatabase proprietor102 identifies a particular user (e.g., subscriber) associated with a particular media impression based on a cookie stored on theclient device110. In some examples, thedatabase proprietor102 associates a particular media impression with a user that was signed into the online services of thedatabase proprietor102 at the time the media impression occurred. In some examples, in addition to logging such impressions and associated identifiers in thecampaign impressions database116, thedatabase proprietor102 separately logs such impressions in amatchable impressions database118. As used herein, a matchable impression is an impression that thedatabase proprietor102 is able to match to at least one of a particular subscriber (e.g., because the impression occurred on aclient device110 on which a user was signed into the database proprietor102) or a particular client device110 (e.g., based on a first-party cookie of thedatabase proprietor102 detected on the client device110). In some examples, if thedatabase proprietor102 cannot match a particular media impression (e.g., because no user was signed in at the time the media impression occurred and there is no recognizable cookie on the associated client device110) the impressions is omitted from thematchable impressions database118 but is still logged in thecampaign impressions database116.

As indicated above, thematchable impressions database118 includes media impressions (and associated unique impression identifiers) that thedatabase proprietor102 is able to match to a particular user that has registered with thedatabase proprietor102. In some examples, thematchable impressions database118 also includes user-based covariates that correspond to the particular user to which each impression in the database was matched. As used herein, a user-based covariate refers to any item(s) of information collected and/or generated by thedatabase proprietor102 that can be used to identify, characterize, quantify, and/or distinguish particular users and/or their associated behavior. For example, user-based covariates may include the name, age, and/or gender of the user (and/or any other demographic information about the user) collected at the time the user registered with thedatabase proprietor102, and/or the relative frequency with which the user uses the different types ofclient device110, the number of media items the user has accessed during a most recent period of time (e.g., the last 30 days), the search terms entered by the user during a most recent period of time (e.g., the last 30 days), feature embeddings (numerical representations) of classifications of videos viewed and/or searches entered by the user, etc. As mentioned above, thematchable database118 also includes impressions matched to particular client devices110 (based on first-party cookies), even when the impressions cannot be matched to particular users (based on the users being signed in at the time). In some such examples, the impressions matched toparticular client devices110 are treated as distinct users within thematchable database118. However, as no particular user can be identified, such impressions in thematchable database118 will not be associated with any user-based covariates.

Although only onecampaign impressions database116 is shown in the illustrated example, the privacy-protectedcloud environment106 may include any number ofcampaign impressions databases116, with each database storing impressions corresponding to different media campaigns associated with one or more different advertisers (e.g., product manufacturers, service providers, retailers, merchants, advertisement servers, etc.). In other examples, a singlecampaign impressions database116 may store the impressions associated with multiple different campaigns. In some such examples, thecampaign impressions database116 may store a campaign identifier in connection with each impression to identify the particular campaign to which the impression is associated. Similarly, in some examples, the privacy-protectedcloud environment106 may include one or morematchable impressions databases118 as appropriate. Further, in some examples, thecampaign impressions database116 and thematchable impressions database118 may be combined and/or represented in a single database.

In the illustrated example ofFIG. 1, impressions occurring on theclient devices110 are shown as being reported (e.g., via network communications) directly to both thecampaign impressions database116 and thematchable impressions database118. However, this should not be interpreted as necessarily requiring multiple separate network communications from theclient devices110 to thedatabase proprietor102. Rather, in some examples, notifications of impressions are collected from a single network communication from theclient device110, and thedatabase proprietor102 then populates both thecampaign impressions database116 and thematchable impressions database118. In some examples, thematchable impressions database118 is generated based on an analysis of the data in thecampaign impressions database116. Regardless of the particular process by which the two

databases

116,118 are populated with logged impressions, in some examples, the user-based covariates included in thematchable impressions database118 may be combined with the logged impressions in thecampaign impressions database116 and stored in an enrichedimpressions database120. Thus, the enriched impressions database includes all (e.g., census wide) logged impressions of themedia108 for the relevant advertising campaign and also includes all available user-based covariates associated with each of the logged impressions that thedatabase proprietor102 was able to match to a particular user.

As shown in the illustrated example, whereas thedatabase proprietor102 is able to collect impressions from bothpanelist client devices112 andnon-panelist client devices114, theAME104 is limited to collecting impressions frompanelist client devices112. In some examples, theAME104 also collects the impression identifier associated with each collected media impression so that the collected impressions may be matched with the impressions collected by thedatabase proprietor102 as described further below. In the illustrated example, the impressions (and associated impression identifiers) of the panelists are stored in an AMEpanel data database122 that is within an AME firstparty data store124 in an AMEproprietary cloud environment126. In some examples, the AMEproprietary cloud environment126 is a cloud-based storage system (e.g., a Google Cloud Project) provided by thedatabase proprietor102 that includes functionality to enable interfacing with the privacy-protectedcloud environment106 also maintained by thedatabase proprietor102. As mentioned above, the privacy-protectedcloud environment106 is governed by privacy constraints that prevent any party (with some limited exceptions for the database proprietor102) from accessing private information associated with particular individuals. By contrast, the AMEproprietary cloud environment126 is indicated as proprietary because it is exclusively controlled by the AME such that the AME has full control and access to the data without limitation. While some examples involve the AMEproprietary cloud environment126 being a cloud-based system that is provided by thedatabase proprietor102, in other examples, the AMEproprietary cloud environment126 may be provided by a third party distinct from thedatabase proprietor102.

While theAME104 is limited to collected impressions (and associated identifiers) from only panelists (e.g., via the panelist client devices112), theAME104 is able to collect panel data that is much more robust than merely media impressions. As mentioned above, thepanelist client devices112 are associated with users that have agreed to participate on a panel of theAME104. Participation in a panel includes the provision of detailed demographic information about the panelist and/or all members in the panelist's household. Such demographic information may include age, gender, race, ethnicity, education, employment status, income level, geographic location of residence, etc. In addition to such demographic information, which may be collected at the time a user enrolls as a panelist, the panelist may also agree to enable theAME104 to track and/or monitor various aspects of the user's behavior. For example, theAME104 may monitor panelists' Internet usage behavior including the frequency of Internet usage, the times of day of such usage, the websites visited, and the media exposed to (from which the media impressions are collected).

AME panel data (including media impressions and associated identifiers, demographic information, and Internet usage data) is shown inFIG. 1 as being provided directly to the AMEpanel data database122 from thepanelist client devices112. However, in some examples, there may be one or more intervening operations and/or components that collect and/or process the collected data before it is stored in the AMEpanel data database122. For instance, in some examples, impressions are initially collected and reported to a separate server and/or database that is distinct from the AMEproprietary cloud environment126. In some such examples, this separate server and/or database may not be a cloud-based system. Further, in some examples, such a non-cloud-based system may interface directly with the privacy-protectedcloud environment106 such that the AMEproprietary cloud environment126 may be omitted entirely.

In some examples, there may be multiple different techniques and/or methodologies used to collect the AME panel data that depends on the particular circumstances involved. For example, different monitoring techniques and/or different types ofaudience measurement meters115 may be employed for media accessed via a desktop computer relative to the media accessed via a mobile computing device. In some examples, theaudience measurement meter115 may be implemented as a software application that panelists agree to install on their devices to monitor all Internet usage activity on the respective devices. In some examples, themeter115 may prompt a user of a particular device to identify themselves so that theAME104 can confirm the identity of the user (e.g., whether it was the mother or daughter in a panelist household). In some examples, prompting a user to self-identify may be considered overly intrusive. Accordingly, in some such examples, the circumstances surrounding the behavior of the user of a panelist client device112 (e.g., time of day, type of content being accessed, etc.) may be analyzed to infer the identity of the user to some confidence level (e.g., the accessing of children's content in the early afternoon would indicate a relatively high probability that a child is using the device at that point in time). In some examples, theaudience measurement meter115 may be a separate hardware device that is in communication with a particularpanelist client device112 and enabled to monitor the Internet usage of thepanelist client device112.

In some examples, the processes and/or techniques used by theAME104 to capture panel data (including media impressions and who in particular was exposed to the media) can differ depending on the nature of thepanelist client device112 through which the media was accessed. For instance, in some examples, the identity of the individual using theclient device112 may be based on the individual responding to a prompt to self-identify. In some examples, such prompts are limited to desktop client devices because such a prompt is viewed as overly intrusive on a mobile device. However, without specifically prompting a user of a mobile device to self-identify, there often is no direct way to determine whether the user is the primary user of the device (e.g., the owner of the device) or someone else (e.g., a child of the primary user). Thus, there is the possibility of misattribution of media impressions within the panel data collected using mobile devices. In some examples, to overcome the issue of misattribution in the panel data, theAME104 may develop a machine learning model that can predict the true user of a mobile device (or any device for that matter) based on information that theAME104 does know for certain and/or has access to. For example, inputs to the machine learning model may include the composition of the panelist household, the type (e.g., genre and/or category) of the content, the daypart or time of day when the content was accessed, etc. In some examples, the truth data used to generate and validate such a model may be collected through field surveys in which the above input features are tracked and/or monitored for a subset of panelists that have agreed to be monitored in this manner (which is more intrusive than the typical passive monitoring of content accessed via mobile devices).

As mentioned above, in some examples, the AME panel data (stored in the AME panel data database122) is merged with the database proprietor impressions data (stored in the matchable impressions database118) within the privacy-protectedcloud environment106 to take advantage of the combination of the disparate sets of data to generate more robust and/or reliable audience measurement metrics. In particular, the database proprietor impressions data provides the advantage of volume. That is, the database proprietor impressions data corresponds to a much larger number of impressions than the AME panel data because the database proprietor impressions data includes census wide impression information that includes all impressions collected from both the panelist client devices112 (associated with a relatively small pool of audience members) and thenon-panelist client devices114. The AME panel data provides the advantage of high-quality demographic data for a statistically significant pool of audience members (e.g., panelists) that may be used to correct for errors and/or biases in the database proprietor impressions data.

One source of error in the database proprietor impressions data is that the demographic information for matchable users collected by thedatabase proprietor102 during user registration may not be truthful. In particular, in some examples, many database proprietors impose age restrictions on their user accounts (e.g., a user must be at least 13 years of age, at least 18 years of age, etc.). However, when a person registers with thedatabase proprietor102, the user typically self-declares their age and may, therefore, lie about their age (e.g., an 11 year old may say they are 18 to bypass the age restrictions for a user account). Independent of age restrictions, a particular user may choose to enter an incorrect age for any other reason or no reason at all (e.g., a 44 year old may choose to assert they are only 25). Where adatabase proprietor102 does not verify the self-declared age of users, there is a relatively high likelihood that the ages of at least some registered users of the database proprietor stored in the matchable impressions database118 (as a particular user-based covariate) are inaccurate. Further, it is possible that other self-declared demographic information (e.g., gender, race, ethnicity, income level, etc.) may also be falsified by users during registration. As described further below, the AME panel data (which contains reliable demographic information about the panelists) can be used to correct for inaccurate demographic information in the database proprietor impressions data.

Another source of error in the database proprietor impressions data is based on the concept of misattribution, which arises in situations where multiple different people use thesame client device110 to access media. In some examples, thedatabase proprietor102 associates a particular impression to a particular user based on the user being signed into a platform provided by the database proprietor. For example, if a particular person signs into their Google account and begins watching a YouTube video on aparticular client device110, that person will be attributed with an impression for an ad served during the video because the person was signed in at the time. However, there may be instances where the person finishes using theclient device110 but does not sign out of his or her Google account. Thereafter, a second different person (e.g., a different member in the family of the first person) begins using theclient device110 to view another YouTube video. Although the second person is now accessing media via theclient device110, ad impressions during this time will still be attributed to the first person because the first person is the one who is still indicated as being signed in. Thus, there is likely to be circumstances where the actual person exposed tomedia108 is misattributed to a different registered user of thedatabase proprietor102. The AME panel data (which includes an indication of the actual person using thepanelist client devices112 at any given moment) can be used to correct for misattribution in the demographic information in the database proprietor impressions data. As mentioned above, in some situations, the AME panel data may itself include misattribution errors. Accordingly, in some examples, the AME panel data may first be corrected for misattribution before the AME panel data is used to correct misattribution in the database proprietor impressions data. An example methodology to correct for misattribution in the database proprietor impressions data is described in Singh et al., U.S. Pat. No. 10,469,903, which is hereby incorporated herein by reference in its entirety.

Another problem with the database proprietor impressions data is that of non-coverage. Non-coverage refers to impressions recorded by thedatabase proprietor102 that cannot be matched to a particular registered user of thedatabase proprietor102. The inability of thedatabase proprietor102 to match a particular impression to a particular user can occur for several reasons including that the user is not signed in at the time of the media impression, that the user has not established an account with thedatabase proprietor102, that the user has enabled Limited Ad Tracking (LAT) to prevent the user account from being associated with ad impressions, or that the content associated with the media being monitored corresponds to children's content (for which user-based tracking is not performed). While the inability of thedatabase proprietor102 to match and assign a particular impression to a particular user is not necessarily an error in the database proprietor impressions data, it does undermine the ability to reliably estimate the total unique audience size for (e.g., the number of unique individuals that were exposed to) a particular media item. For example, assume that thedatabase proprietor102 records a total of 11,000 impressions formedia108 in a particular advertising campaign. Further assume that of those 11,000 impressions, thedatabase proprietor102 is able to match 10,000 impressions to a total of 5,000 different users (e.g., each user was exposed to the media on average 2 times) but is unable to match the remaining 1,000 impressions to particular users. Relying solely on the database proprietor impressions data, in this example, there is no way to determine whether the remaining 1,000 impressions should also be attributed to the 5,000 users already exposed at least once to the media108 (for a total audience size of 5,000 people) or if one or more of the remaining 1,000 impressions should be attributed to other users not among the 5,000 already identified (for a total audience size of up to 6,000 people (if every one of the 1,000 impressions was associated with a different person not included in the matched 5,000 users)). In some examples disclosed herein, the AME panel data can be used to estimate the distribution of impressions across different users associated with the non-coverage portion of impressions in the database proprietor impressions data to thereby estimate a total audience size for therelevant media108.

Another confounding factor to the estimation of the total unique audience size for media based on the database proprietor impressions data is the existence of multiple user accounts of a single user. More particular, in some situations a particular individual may establish multiple accounts with thedatabase proprietor102 for different purposes (e.g., a personal account, a work account, a joint account shared with other individuals, etc.). Such a situation can result in a larger number of different users being identified as audience members tomedia108 than the actual number of individuals exposed to themedia108. For example, assume that a particular person registers three user accounts with thedatabase proprietor102 and is exposed to themedia108 once while signed into each of the three different accounts for a total of three impressions. In this scenario, thedatabase proprietor102 would match each impression to a different user based on the different user accounts making it appear that three different people were exposed to themedia108 when, in fact, only one person was exposed to the media three different times. Examples disclosed herein use the AME panel data in conjunction with the database proprietor impressions data to estimate an actual unique audience size from the potentially inflated number of apparently unique users exposed to themedia108.

In the illustrated example ofFIG. 1, the AME panel data is merged with the database proprietor impressions data by an exampledata matching analyzer128. In some examples, thedata matching analyzer128 implements an application programming interface (API) that takes the disparate datasets and matches users in the database proprietor impressions data with panelists in the AME panel data. In some examples, users are matched with panelists based on the unique impression identifiers (e.g., CPNs) collected in connection with the media impressions logged by both thedatabase proprietor102 and theAME104. The combined data is stored in an intermediarymerged data database130 within an AME privacy-protecteddata store132. The data in the intermediarymerged data database130 is referred to as “intermediary” because it is at an intermediate stage in the processing because it includes AME panel data that has been enhanced and/or combined with the database proprietor impressions data, but has not yet be corrected or adjusted to account for the sources of error and/or bias in the database proprietor impressions data as outlined above.

In some examples, the AME intermediary merged data is analyzed by anadjustment factor analyzer134 to calculate adjustment or calibration factors that may be stored in anadjustment factors database136 within an AMEoutput data store138 of the AMEproprietary cloud environment126. In some examples, theadjustment factor analyzer134 calculates different types of adjustment factors to account for different types of errors and/or biases in the database proprietor impressions data. For instance, a multi-account adjustment factor corrects for the situation of a single user accessing media using multiple different user accounts associated with thedatabase proprietor102. A signed-out adjustment factor corrects for non-coverage associated with users that access media while signed out of their account associated with the database proprietor102 (so that thedatabase proprietor102 is unable to associate the impression with the users). In some examples, theadjustment factor analyzer134 is able to directly calculate the multi-account adjustment factor and the signed-out adjustment factor in a deterministic manner.

While the multi-account adjustment factors and the signed-out adjustment factors may be deterministically calculated, correcting for falsified or otherwise incorrect demographic information (e.g., incorrectly self-declared ages) of registered users of thedatabase proprietor102 cannot be solved in such a direct and deterministic manner. Rather, in some examples, a machine learning model is developed to analyze and predict the correct ages of registered users of thedatabase proprietor102. Specifically, as shown inFIG. 1, the privacy-protectedcloud environment106 implements amodel generator140 to generate a demographic correction model using the AME intermediary merged data (stored in the AME intermediary merged data database130) as inputs. More particularly, in some examples, self-declared demographics (e.g., the self-declared age) of users of thedatabase proprietor102, along with other covariates associated with the users, are used as the input variables or features used to train a model to predict the correct demographics (e.g., correct age) of the users as validated by the AME panel data, which serves as the truth data or training labels for the model generation. In some examples, different demographic correction model(s) may be developed to correct for different types of demographic information that needs correcting. For instance, in some examples, a first model can be used to correct the self-declared age of users of thedatabase proprietor102 and a second model can be used to correct the self-declared gender of the users. Once the model(s) have been trained and validated based on the AME panel data, the model(s) are stored in a demographiccorrection models database142.

As mentioned above, there are many different types of covariates collected and/or generated by thedatabase proprietor102. In some examples, the covariates provided by thedatabase proprietor102 may include a certain number (e.g., 100) of the top search result click entities and/or video watch entities for every user during a most recent period of time (e.g., for the last month). As an example, the covariates may include the top 100 search result click entities and video watch entities (e.g., YouTube video watch entities) for each user for the last month. In examples disclosed herein, entities are represented as integer identifiers (IDs) that map to a knowledge graph of all entities for the search result clicks and/or videos watched. For example, the IDs may map to freebase knowledge graph entity IDs. That is, as used in this context, an entity corresponds to a particular node in a knowledge graph maintained by thedatabase proprietor102. In some examples, the total number of unique IDs in the knowledge graph may number in the tens of millions. More particularly, for example, YouTube videos are classified across roughly 20 million unique video entity IDs and Google search results are classified across roughly 25 million unique search result entity IDs. In addition to the top search result click entities and/or video watch entities, thedatabase proprietor102 may also provide embeddings for these entities. An embedding is a numerical representation (e.g., a vector array of values) of some class of similar objects, images, words, and the like. For example, a particular user that frequently searches for and/or views cat videos may be associated with a feature embedding representative of the class corresponding to cats. Thus, feature embeddings translate relatively high dimensional vectors of information (e.g., text strings, images, videos, etc.) into a lower dimensional space to enable the classification of different but similar objects.

In some examples, multiple embeddings may be associated with each search result click entity and/or video watch entity. Accordingly, assuming the top 100 search result entities and video watch entities are provided among the covariates and that 16 dimension embeddings are provided for each such entity, this results in a 100×16 matrix of values for each user, which may be too much data to process during generation of the demographic correction models as described above. Accordingly, in some examples, the privacy-protectedcloud environment106 implements adata modifier135 to reduce the dimensionality of the matrix. The reduced matrix may be a more manageable size to be used as an input feature for themodel generator140 to generate the demographic correction model. The demographic correction model may be used to predict demographics (e.g., age, gender) of actual users associated with media impressions.

In some examples, reduction in the entity-embeddings matrix is accomplished by thedata modifier135 selecting the top m entities (e.g., knowledge graph entities) and the top n embeddings and converting the 2-dimensional matrix to a 1-dimensional array. In such examples, top m entities and the top n embeddings are both hyperparameters: m represents the number of top entities to select from the matrix, and n represents the number of top embeddings to select from the matrix. For example, Table 1 below shows the matrix including entries for 4 different embeddings across 3 different knowledge graph entities (e.g., corresponding to a 3×4 matrix comparable to the 100×16 matrix discussed above) for a user (e.g., user_id=1).

TABLE 1

user_id	ent	emb_l	emb_2	emb_3	emb_4

1	1	1	2	3	4
1	2	5	6	7	8
1	3	9	10	11	12

Assuming that m=2 (e.g., the top 2 entities are to be selected) and n=2 (e.g., the top 2 embeddings are to be selected), the 3×4 matrix represented in Table 1 may be converted to 4 columns (e.g., the 100×16 matrix discussed above converted to m×n). The 4 columns are a 4-element array (e.g., m=2 and n=2) as represented in Table 2 below.

TABLE 2

user_id	ent_1_emb_l	ent_1_emb_2	ent_1_emb_l	ent_2_emb_2

1	1	2	5	6

Additionally or alternatively, in some examples, the dimension of entities is reduced by thedata modifier135 calculating a weighted average of the embeddings across the entities. In such examples, weights and a scale for softmax weights (e.g., a softmax weights scale) are both hyperparameters to calculate the weighted average of the embeddings across the entities. For example, Table 3 below represents the original weight definitions for every entity:

TABLE 3

		equal	ranking
ent id	weights	weights	weights	softmax weights scale = i

1	w1	=1/3	=3/sum(3, 2, 1)	=exp(3i)/sum(exp(3i),
				exp(2i), exp(1i))
2	w2	=1/3	=2/sum(3, 2, 1)	=exp(2i)/sum(exp(3i),
				exp(2i), exp(1i))
3	w3	=1/3	=1/sum(3, 2, 1)	=exp(1i)/sum(exp(3i),
				exp(2i), exp(1i))

A simplified weight definition may be defined by multiplying the original weight by the weight denominator (e.g., simplified weight definition=(original weight)×(weight denominator)) as shown in Table 4 below. For example, the weight denominator in the “equal weights” column of Table 3 above is three. Therefore, every entry in the “equal weights” column of Table 3 above is multiplied by three for the simplified weight definition (e.g., simplified weight definition=(equal weight)×(3)). The results are shown in the “equal weights” column of Table 4 below. For example, the weight denominator in the “ranking weights” column of Table 3 above is sum(3,2,1). Therefore, every entry in the “ranking weights” column of Table 3 above is multiplied by sum(3,2,1) for the simplified weight definition (e.g., simplified weight definition=(equal weight)×sum(3,2,1)). The results are shown in the “ranking weights” column of Table 4 below. For example, the weight denominator in the “softmax weights scale=i” column of Table 3 above is sum(exp(3*i),exp(2*i),exp(1*i)). Therefore, every entry in the “softmax weights scale=i” column of Table 3 above is multiplied by sum(exp(3*i),exp(2*i),exp(1*i)) for the simplified weight definition (e.g., simplified weight definition=(equal weight)×sum(exp(3*i),exp(2*i),exp(1*i))). The results are shown in the “softmax weights scale=i” column of Table 4 below.

TABLE 4

		equal	ranking	softmax weights
ent id	weights	weights	weights	scale = i

1	w1	=1	=3	=exp(3*i)
2	w2	=1	=2	=exp(2*i)
3	w3	=1	=1	=exp(1*i)

Based on the above weights calculated by thedata modifier135 as shown in Tables 3 and 4, the example 3×4 matrix may be converted to 4 columns (e.g., the 100×16 matrix discussed above converted to 16 columns). The 4 columns (a single row for each user) including a single value for each embedding are represented in entries shown in Table 5 below.

TABLE 5

	weighted	weighted	weighted	weighted
user_id	emb 1	emb 2	emb 3	emb 4

1	=(w1*1 +	=(w1*2 +	=(w1*3 +	=(w1*4 +
	w2*5 +	w2*6 +	w2*7 +	w2*8 +
	w3*9)/(w1 +	w3*10)/(w1 +	w3*11)/(w1 +	w3*12)/(w1 +
	w2 + w3)	w2 + w3)	w2 + w3)	w2 + w3)

Additionally or alternatively, in some examples, thedata modifier135 reduces the dimension of embeddings by using a single value to represent the different embedding dimensions. Reducing the multiple dimensions to a single value may be accomplished with one or more hyperparameters. The multiple dimensions can be reduced to a single value by, for example, calculating the average of the embeddings (e.g., Average=average(1, 2, 3, 4)), selecting the maximum embedding value (e.g., Maximum=max(1, 2, 3, 4)), selecting the minimum value (e.g., Minimum=min(1, 2, 3, 4)), calculating the Manhattan distance (e.g., Manhattan Distance=sum(abs(1), abs(2), abs(3), abs(4))), calculating the Chebyshev distance (e.g., Chebyshev Distance=max(abs(1), abs(2), abs(3), abs(4))), calculating the Euclidean distance (e.g., Euclidean Distance=sum[abs(1){circumflex over ( )}2, abs(2){circumflex over ( )}2, abs(3){circumflex over ( )}2, abs(4){circumflex over ( )}2]{circumflex over ( )}(½)), and/or calculating Minkowski distance (e.g., Minkowski Distance=sum[abs(1){circumflex over ( )}3, abs(2){circumflex over ( )}3, abs(3){circumflex over ( )}3, abs(4){circumflex over ( )}3]{circumflex over ( )}(⅓)). Example equations to solve for one entity (e.g., ent_id=1) for different hyperparameters are shown in Table 6 below.

TABLE 6

Hyperparameter	Equation for ent_id = 1

Average	=average(1, 2, 3, 4)
Maximum	=max(1, 2, 3, 4)
Minimum	=sum(abs(1), abs(2), abs(3), abs(4))
Manhattan Distance	=sum(abs(1), abs(2), abs(3), abs(4))
Chebyshev Distance	=max(abs(1), abs(2), abs(3), abs(4))
Euclidean Distance	=sum[abs(1){circumflex over ( )}2, abs(2){circumflex over ( )}2, abs(3){circumflex over ( )}2, abs(4){circumflex over ( )}2]{circumflex over ( )}(1/2)
Minkowski Distance	=sum[abs(1){circumflex over ( )}3, abs(2){circumflex over ( )}3, abs(3){circumflex over ( )}3, abs(4){circumflex over ( )}3]{circumflex over ( )}(1/3)

As a specific example, assuming the average of the embeddings is used, the 3×4 matrix represented in Table 1 may be converted to 3 columns (e.g., the 100×16 matrix discussed above converted to 100 columns). The 3 columns (a single row of three elements for each user) include a single value for each entity as represented in entries shown in Table 7 below.

TABLE 7

user_id	reduced emb 1	reduced emb 2	reduced emb 3

1	=average(1, 2,	=average(5, 6,	=average(9, 10,
	3, 4)	7, 8)	11, 12)

In some examples, the number of unique entities (e.g., search result clicks and/or videos watched) represented in the covariates for a particular user may be less than the total number of different entities thedatabase proprietor102 provides. For example, thedatabase proprietor102 may provide the top 100 entities for each panelist. However, a particular user may only be associated with 87 different entities, thereby resulting in 13 null entities. When the number of null entities is non-zero, it is possible to infer the number of different topics the particular user is interested in (e.g., corresponding to the number of different non-null entities associated with the user). In some examples, assuming thedatabase proprietor102 provides the top 100 entities for the particular user, the value of different topics the particular user is interested in may be in a range [0,100]. Similarly, not all entities have embeddings (e.g., if a particular search result click entity and/or video watch entity has not been clicked/viewed by a sufficient number of users) but be represented by a null struct or null embeddings. The number of null embeddings can be used to infer how many rare topics the particular user is interested in. In examples disclosed herein, a rare topic is an entity (e.g., topic) that includes no embeddings. In some examples, assuming thedatabase proprietor102 provides the top 100 entities for the particular user, the value of rare topics the particular user is interested in may be in a range [0,100] or NULL (e.g., the particular user does not have any entities). In some examples, the number of null entities (and/or the number of non-null entities) and/or the number of null embeddings for a particular user may serve as additional input features for the demographic correction model generation process.

For example, as represented in Table 8 below, user 1 has a normal embedding array for entity 1, an array of nulls associated with entity 2 (e.g., corresponding to 1 null embeddings), and no entity identified for entity 3 (e.g., corresponding to 1 null entity).

TABLE 8

user_id	ent	ent_id	emb_0	emb_1	emb_2	emb_3

1	1	A	1	2	3	4
1	2	B	null	null	null	null
1	3

In this example, two additional columns may be added to reflect the null counts as represented in Table 9 below.

TABLE 9

user_id	count_null_entities	count_null_embeddings

1	1	1

In some examples, a process is implemented to track different demographic correction model experiments over time to achieve high quality (e.g., accurate) models and also for auditing purposes. Accomplishing this objective within the context of the privacy-protectedcloud environment106 presents several unique challenges because the model features (e.g., inputs and hyperparameters) and model performance (e.g., accuracy) are stored separately to satisfy the privacy constraints of the environment.

In some examples, amodel analyzer144 may implement and/or use one or more demographic correction models to generate predictions and/or inferences as to the actual demographics (e.g., actual ages) of users associated with media impressions logged by thedatabase proprietor102. That is, in some examples, as shown inFIG. 1, themodel analyzer144 uses one or more of the demographic correction models in the demographiccorrection models database142 to analyze the impressions in the enrichedimpressions database120 that were matched to a particular user of thedatabase proprietor102. The inferred demographic (e.g., age) for each user may be stored in amodel inferences database146 for subsequent use, retrieval, and/or analysis. Additionally or alternatively, in some examples, themodel analyzer144 uses one or more of the demographic correction models in the demographiccorrection models database142 to analyze the entire user base of the database proprietor regardless of whether the users are matched to any particular media impressions. After inferring the correct demographic (e.g., age) for each user, the inferences are stored in themodel inferences database146. In some such examples, when the users matched to particular impressions are to be analyzed (e.g., the users matched to impressions in the enriched impressions database120), themodel analyzer144 merely extracts the inferred demographic assignment to each relevant user in the enrichedimpressions database120 that matches with one or more media impressions.

As described above, in some examples, thedatabase proprietor102 may identify a particular user as corresponding to a particular impression based on the user being signed into thedatabase proprietor102. However, there are circumstances where the individual corresponding to the user account is not the actual person that was exposed to the relevant media. Accordingly, merely inferring a correct demographic (e.g., age) of the user associated with the signed in user account may not be the correct demographic of the actual person to which a particular media impression should be attributed. In other words, whereas the AME panelist data and the database proprietor impressions data is matched at the impression level, demographic correction is implemented at the user level. Therefore, before generating the demographic correction model, a method to reduce logged impressions to individual users is first implemented so that the demographic correction model can be reliably implemented.

With inferences made to correct inaccurate demographic information of database proprietor users (e.g., falsified self-declared ages) and stored in themodel inferences database146, theAME104 may be interested in extracting audience measurement metrics based on the corrected data. However, as mentioned above, the data contained inside the privacy-protectedcloud environment106 is subject to privacy constraints. In some examples, the privacy constraints ensure that the data can only be extracted for review and/or analysis in aggregate so as to protect the privacy of any particular individual represented in the data (e.g., a panelist of theAME104 and/or a registered user of the database proprietor102). Accordingly, in some examples, adata aggregator148 aggregates the audience measurement data associated with particular media campaigns before the data is provided to an aggregatedcampaign data database150 in the AMEoutput data store138 of the AMEproprietary cloud environment126.

Thedata aggregator148 may aggregate data in different ways for different types of audience measurement metrics. For instance, at the highest level, the aggregated data may provide the total impression count and total number of users (e.g., estimated audience size) exposed to themedia108 for a particular media campaign. As mentioned above, the total number of users reported by thedata aggregator148 is based on the total number of unique user accounts matched to impressions but does not include the individuals associated with impressions that were not matched to a particular user (e.g., non-coverage). However, the total number of unique user accounts does not account for the fact that a single individual may correspond to more than one user account (e.g., multi-account users), and does not account for situations where a person other than a signed-in user was exposed to the media108 (e.g., misattribution). These errors in the aggregated data may be corrected based on the adjustment factors stored in theadjustment factors database136. Further, in some examples, the aggregated data may include an indication of the demographic composition of the users represented in the aggregated data (e.g., number of males vs females, number of users in different age brackets, etc.).

Additionally or alternatively, in some examples, thedata aggregator148 may provide aggregated data that is associated with a particular aspect of a media campaign. For instance, the data may be aggregated based on particular sites (e.g., all media impressions served on YouTube.com). In other examples, the data may be aggregated based on placement information (e.g., aggregated based on particular primary content videos accessed by users when the media advertisement was served). In other examples, the data may be aggregated based on device type (e.g., impressions served via a desktop computer versus impressions served via a mobile device). In other examples, the data may be aggregated based on a combination of one or more of the above factors and/or based on any other relevant factor(s).

In some examples, the privacy constraints imposed on the data within the privacy-protectedcloud environment106 include a limitation that data cannot be extracted (even when aggregated) for less than a threshold number of individuals (e.g., 50 individuals). Accordingly, if the particular metric being sought includes less than the threshold number of individuals, thedata aggregator148 will not provide such data. For instance, if the threshold number of individuals is 50 but there are only 46 females in the age range of 18-25 that were exposed toparticular media108, thedata aggregator148 would not provide the aggregate data for females in the 18-25 age bracket. Such privacy constraints can leave gaps in the audience measurement metrics, particularly in locations where the number of panelists is relatively small. Accordingly, in some examples, when audience measurement is not available for a particular demographic segment of interest in a particular region (e.g., a particular country), the audience measurement metrics in one or more comparable region(s) may be used to impute the metrics for the missing data in the first region of interest. In some examples, the particular metrics imputed from comparable regions is based on a comparison of audience metrics for which data is available in both regions. For instance, while data for females in the 18-25 bracket may be unavailable, assume that data for females in the 26-35 age bracket is available. The metrics associated with the 26-35 age bracket in the region of interests may be compared with metrics for the 26-35 age bracket in other regions and the regions with the closest metrics to the region of interest may be selected for use in calculating imputation factor(s).

As shown in the illustrated example, both theadjustment factors database136 and the aggregatedcampaigns data database150 are included within the AMEoutput data store138 of the AMEproprietary cloud environment126. As mentioned above, in some examples, the AMEproprietary cloud environment126 is provided by thedatabase proprietor102 and enables data to be provided to and retrieved from the privacy-protected cloud environment. In some examples, the aggregated campaign data and the adjustment factors are subsequently transferred to aseparate computing apparatus152 of theAME104 for analysis by anaudience metrics analyzer154. In some examples, the separate computing apparatus may be omitted with its functionality provided by the AMEproprietary cloud environment126. In other examples, the AMEproprietary cloud environment126 may be omitted with the adjustment factors and the aggregated data provided directly to thecomputing apparatus152. Further, in this example, the AMEpanel data database122 is within the AME firstparty data store124, which is shown as being separate from the AMEoutput data store138. However, in other examples, the AME firstparty data store124 and the AMEoutput data store138 may be combined.

In the illustrated example ofFIG. 1, the audience metrics analyzer154 applies the adjustment factors to the aggregated data to correct for errors in the data including misattribution, non-coverage, and multi-count users. The output of the audience metrics analyzer154 corresponds to the final calibrated data of theAME104 and is stored in a final calibrateddata database156. In this example, thecomputing apparatus152 also includes areport generator158 to generate reports based on the final calibrated data.

A flowchart representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing aspects of the privacy-protectedcloud environment106 ofFIG. 1 is shown inFIG. 2. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as theprocessor312 shown in theexample processor platform300 discussed below in connection withFIG. 3. The program may be embodied in software stored on a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associated with theprocessor312, but the entire program and/or parts thereof could alternatively be executed by a device other than theprocessor312 and/or embodied in firmware or dedicated hardware. Further, although the example program is described with reference to the flowcharts illustrated inFIG. 2, many other methods of implementing the example privacy-protectedcloud environment106 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., discrete and/or integrated analog and/or digital circuitry, an FPGA, an ASIC, a comparator, an operational-amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware. The processor circuitry may be distributed in different network locations and/or local to one or more devices (e.g., a multi-core processor in a single machine, multiple processors distributed across a server rack, etc.).

The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.

In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.

The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.

As mentioned above, the example processes ofFIG. 2 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.

As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” item, as used herein, refers to one or more of that item. The terms “a” (or “an”), “one or more”, and “at least one” can be used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., a single unit or processor. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.

FIG. 2 is a flowchart representative of an example machinereadable instructions200 that may be executed by a processor (e.g., theprocessor312 ofFIG. 3) to implement theexample data modifier135, theexample model generator140, and/or theexample model analyzer144 ofFIG. 1 to reduce the dimensionality of a matrix associated with entities and embeddings. Theexample instructions200 ofFIG. 2 begin atblock202, at which thedata modifier135 obtains a first matrix including data associated with entities and embeddings. The first matrix may correspond to a user and may be obtained from the AME intermediary merged data database130 (FIG. 1). For example, the first matrix may be a 100×16 matrix including 100 entities (e.g., search result entities and/or video watch entities) and 16 dimension embeddings (e.g., classifications of searches entered and/or videos viewed by a user) for each entity.

Atblock204, theexample data modifier135 generates a second matrix by reducing the data in the first matrix to second data. For example, the data in the first matrix may be too much data for the model generator140 (FIG. 1) to process while generating a demographic correction model. Therefore, the second matrix is generated as a more manageable size to be used as an input feature for themodel generator140 to generate the demographic correction model. The generation of the second matrix may be based on performing a reduction technique. In some examples, thedata modifier135 implements the reduction technique ofblock204 by selecting the top m entities (e.g., knowledge graph entities) and the top n embeddings and converting the 2-dimensional matrix to a 1-dimensional array (e.g., described in connection to Tables 1 and 2 above). In other examples, thedata modifier135 implements the reduction technique ofblock204 by calculating weighted averages of the embeddings associated with the entities (e.g., as described in connection to Tables 3-5 above). In yet other examples, thedata modifier135 implements the reduction technique ofblock204 by calculating values of an average, a Manhattan distance, a Chebyshev distance, a Euclidean distance, or a Minkowski distance of the embeddings associated with the entities (e.g., as described in connection to Tables 6 and 7 above).

Atblock206, theexample data modifier135 stores the second matrix in memory as an input feature. For example, theexample data modifier135 stores the second matrix in memory when the second matrix satisfies the size to be used as an input feature by themodel generator140 to generate one or more demographic correction models. Atblock208, theexample model generator140 generates a demographic correction model based on the second matrix as the input feature. Alternatively, the demographic correction model may be generated based on a plurality of input features. In such examples, the second matrix may be one of several input features. For example, themodel generator140 may generate the demographic correction model by utilizing the one or more input features to train and validate the demographic correction model. The demographic correction model may be utilized to correct demographic information from thedatabase proprietor102 associated with impressions.

Atblock210, theexample model generator140 stores the demographic correction model in the example demographiccorrection models database142. For example, theexample model generator140 may store a plurality of demographic correction models in the demographiccorrection models database142 for different demographics. In some examples, a first demographic correction model corrects age from thedatabase proprietor102 associated with impressions. In other examples, a second demographic correction model corrects gender from thedatabase proprietor102 associated with impressions. The example model analyzer144 (FIG. 1) can access different ones of the demographic correction models from the demographiccorrection models database142 based on the type of demographic information that is to be corrected.

Atblock212, theexample model analyzer144 applies the demographic correction model from the demographiccorrection models database142 to predict demographics associated with impressions. In some examples, a media impression is already associated with demographics based on a user signed into thedatabase proprietor102. However, there are cases where the demographics reported by the user are incorrect. For example, a user may self-report his age as a male and age between 30 to 34, whereas the actual demographics of the user are a male of an age between 40 to 44. Therefore, the demographics (e.g., of a signed-in subscriber of the database proprietor102) do not correspond to the actual user exposed to the media. The actual demographics may be predicted by applying one or more demographic correction models. For example, themodel analyzer144 applies the one or more demographic correction models to predict demographics of the signed-in subscriber of thedatabase proprietor102 to correct computer-generated error such as self-reported demographics corresponding to a subscriber of a user account associated with one or more impressions in the example enriched mediacampaign impressions database120. Additionally, in some examples the demographic correction models may predict the actual demographics for circumstances where the individual corresponding to the user account is not the actual user exposed to the relevant media associated with the media impression (e.g., if another user such as a family member or friend is borrowing/using the device).

Atblock214, theexample model analyzer144 stores the resulting data generated by the demographic correction model in the example model inferences database146 (FIG. 1). For example, themodel analyzer144 stores the predicted demographics of a signed-in subscriber of thedatabase proprietor102 associated with impressions in themodel inferences database146 so that correct demographics (e.g., the predicted demographics) corresponding to the impressions can be correctly aggregated.

Atblock216, theexample data modifier135 determines whether there is another matrix to reduce. In one example, thedata modifier135 analyzes the AME intermediary mergeddata database130 to determine whether there is another matrix corresponding to a different user. If thedata modifier135 determines there is another matrix to reduce (e.g., block216 returns a result of “YES”), thedata modifier135 returns to block210. If thedata modifier135 determines there is another matrix to reduce (e.g., block216 returns a result of “NO”), theexample instructions200 ofFIG. 2 terminate.

FIG. 3 is a block diagram of anexample processor platform300 structured to execute the instructions ofFIG. 2 to implement the privacy-protected cloud environment109 ofFIG. 1. Theprocessor platform300 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), or any other type of computing device.

Theprocessor platform300 of the illustrated example includes aprocessor312. Theprocessor312 of the illustrated example is hardware. For example, theprocessor312 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In this example, the processor implements the exampledata matching analyzer128, the example AME privacy-protecteddata store132, the exampleadjustment factor analyzer134, theexample data modifier135, theexample model generator140, theexample model analyzer144, and/or theexample data aggregator148.

Theprocessor312 of the illustrated example includes a local memory313 (e.g., a cache). Theprocessor312 of the illustrated example is in communication with a main memory including avolatile memory314 and anon-volatile memory316 via abus318. Thevolatile memory314 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®) and/or any other type of random access memory device. Thenon-volatile memory316 may be implemented by flash memory and/or any other desired type of memory device. Access to the

main memory

314,316 is controlled by a memory controller.

Theprocessor platform300 of the illustrated example also includes aninterface circuit320. Theinterface circuit320 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), a Bluetooth® interface, a near field communication (NFC) interface, and/or a PCI express interface.

In the illustrated example, one ormore input devices322 are connected to theinterface circuit320. The input device(s)322 permit(s) a user to enter data and/or commands into theprocessor312. The input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.

One ormore output devices324 are also connected to theinterface circuit320 of the illustrated example. Theoutput devices324 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube display (CRT), an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer and/or speaker. Theinterface circuit320 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.

Theinterface circuit320 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via anetwork326. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.

Theprocessor platform300 of the illustrated example also includes one or moremass storage devices328 for storing software and/or data. Examples of suchmass storage devices328 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.

The machineexecutable instructions332 ofFIG. 2 may be stored in themass storage device328, in thevolatile memory314, in thenon-volatile memory316, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.

From the foregoing, it will be appreciated that example methods, apparatus and articles of manufacture have been disclosed that enable the generation of accurate and reliable audience measurement metrics for Internet-based media without the use of third-party cookies and/or tags that have been the standard approach for monitoring Internet media for many years. This is accomplished by merging AME panel data with database proprietor impressions data within a privacy-protected cloud based environment. The nature of the cloud environment and the privacy constraints imposed thereon as well as the nature in which the database proprietor collects the database proprietor impression data present technological challenges contributing to limitations in the reliability and/or completeness of the data. However, examples disclosed herein overcome these difficulties by generating adjustment factors and/or machine learning models based on the AME panel data.

Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.

Example methods and apparatus to generate audience metrics using third-party privacy-protected cloud environments are disclosed herein. Further examples and combinations thereof include the following:

Example 1 includes a non-transitory computer readable medium including instructions that when executed cause at least one processor to obtain a first matrix, the first matrix including first data indicative of entities and embeddings, the entities representative of at least one of search result clicks or videos watched, the embeddings representative of at least one of first classifications of the search result clicks or second classifications of the videos watched, generate a second matrix by reducing the first data in the first matrix to second data that satisfies a size corresponding to an input feature, store the second matrix in first memory as the input feature, and generate a demographic correction model based on the second matrix as the input feature, the demographic correction model to correct demographics corresponding to impressions logged in second memory.

Example 2 includes the non-transitory computer readable medium of example 1, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by selecting a first entity from the entities and first and second embeddings from the embeddings, and generating the second matrix to include a first entry and a second entry, the first entry storing a first value of the first embedding associated with the first entity, the second entry storing a second value of the second embedding associated with the first entity.

Example 3 includes the non-transitory computer readable medium of example 1, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by calculating weighted averages of the embeddings associated with the entities, the weighted averages including a first weighted average based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing a value of the first weighted average associated with the first entity.

Example 4 includes the non-transitory computer readable medium of example 1, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by calculating values of an average, a Manhattan distance, a Chebyshev distance, a Euclidean distance, or a Minkowski distance of the embeddings associated with the entities, the values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 5 includes the non-transitory computer readable medium of example 1, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by selecting at least one of maximum values or minimum values of the embeddings associated with the entities, the at least one of the maximum values or the minimum values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 6 includes the non-transitory computer readable medium of example 1, wherein the data corresponds to a user access to media, the media associated with the entities and the embeddings.

Example 7 includes the non-transitory computer readable medium of example 1, wherein the entities include at least one of top search result click entities or video watch entities.

Example 8 includes the non-transitory computer readable medium of example 1, wherein the embeddings include classifications of at least one of Internet searches requested by a user or media accessed by the user.

Example 9 includes the non-transitory computer readable medium of example 1, wherein the entities are represented using integer identifiers that map to a knowledge graph.

Example 10 includes the non-transitory computer readable medium of example 1, wherein the embeddings are represented as a numerical representation of a class of at least one of objects, images, or words.

Example 11 includes an apparatus including a data modifier to obtain a first matrix, the first matrix including first data indicative of entities and embeddings, the entities representative of at least one of search result clicks or videos watched, the embeddings representative of at least one of first classifications of the search result clicks or second classifications of the videos watched, generate a second matrix by reducing the first data in the first matrix to second data that satisfies a size corresponding to an input feature, and store the second matrix in first memory as the input feature, and a model generator to generate a demographic correction model based on the second matrix as the input feature, the demographic correction model to correct demographics corresponding to impressions logged in second memory.

Example 12 includes the apparatus of example 11, wherein the data modifier is to generate the second matrix based on performing a reduction technique, the data modifier to perform the reduction technique by selecting a first entity from the entities and first and second embeddings from the embeddings, and generating the second matrix to include a first entry and a second entry, the first entry storing a first value of the first embedding associated with the first entity, the second entry storing a second value of the second embedding associated with the first entity.

Example 13 includes the apparatus of example 11, wherein the data modifier is to generate the second matrix based on performing a reduction technique, the data modifier to perform the reduction technique by calculating weighted averages of the embeddings associated with the entities, the weighted averages including a first weighted average based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing a value of the first weighted average associated with the first entity.

Example 14 includes the apparatus of example 11, wherein the data modifier is to generate the second matrix is based on performing a reduction technique, the data modifier to perform the reduction technique by calculating values of an average, a Manhattan distance, a Chebyshev distance, a Euclidean distance, or a Minkowski distance of the embeddings associated with the entities, the values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 15 includes the apparatus of example 11, wherein the data modifier is to generate the second matrix based on performing a reduction technique, the data modifier to perform the reduction technique by selecting at least one of maximum values or minimum values of the embeddings associated with the entities, the at least one of the maximum values or the minimum values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 16 includes the apparatus of example 11, wherein the data corresponds to a user access to media, the media associated with the entities and the embeddings.

Example 17 includes the apparatus of example 11, wherein the entities include at least one of top search result click entities or video watch entities.

Example 18 includes the apparatus of example 11, wherein the embeddings include classifications of at least one of Internet searches requested by a user or media accessed by the user.

Example 19 includes the apparatus of example 11, wherein the entities are represented using integer identifiers that map to a knowledge graph.

Example 20 includes the apparatus of example 11, wherein the embeddings are represented as a numerical representation of a class of at least one of objects, images, or words.

Example 21 includes an apparatus including at least one memory, instructions, and at least one processor to execute the instructions to at least obtain a first matrix, the first matrix including first data indicative of entities and embeddings, the entities representative of at least one of search result clicks or videos watched, the embeddings representative of at least one of first classifications of the search result clicks or second classifications of the videos watched, generate a second matrix by reducing the first data in the first matrix to second data that satisfies a size corresponding to an input feature, store the second matrix in first memory as the input feature, and generate a demographic correction model based on the second matrix as the input feature, the demographic correction model to correct demographics corresponding to impressions logged in second memory.

Example 22 includes the apparatus of example 21, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by selecting a first entity from the entities and first and second embeddings from the embeddings, and generating the second matrix to include a first entry and a second entry, the first entry storing a first value of the first embedding associated with the first entity, the second entry storing a second value of the second embedding associated with the first entity.

Example 23 includes the apparatus of example 21, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by calculating weighted averages of the embeddings associated with the entities, the weighted averages including a first weighted average based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing a value of the first weighted average associated with the first entity.

Example 24 includes the apparatus of example 21, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by calculating values of an average, a Manhattan distance, a Chebyshev distance, a Euclidean distance, or a Minkowski distance of the embeddings associated with the entities, the values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 25 includes the apparatus of example 21, wherein the at least one processor is to generate the second matrix based on performing a reduction technique, the at least one processor to perform the reduction technique by selecting at least one of maximum values or minimum values of the embeddings associated with the entities, the at least one of the maximum values or the minimum values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 26 includes the apparatus of example 21, wherein the data corresponds to a user access to media, the media associated with the entities and the embeddings.

Example 27 includes the apparatus of example 21, wherein the entities include at least one of top search result click entities or video watch entities.

Example 28 includes the apparatus of example 21, wherein the embeddings include classifications of at least one of Internet searches requested by a user or media accessed by the user.

Example 29 includes the apparatus of example 21, wherein the entities are represented using integer identifiers that map to a knowledge graph.

Example 30 includes the apparatus of example 21, wherein the embeddings are represented as a numerical representation of a class of at least one of objects, images, or words.

Example 31 includes a method including obtaining, by executing an instruction with a processor, a first matrix, the first matrix including first data indicative of entities and embeddings, the entities representative of at least one of search result clicks or videos watched, the embeddings representative of at least one of first classifications of the search result clicks or second classifications of the videos watched, generating, by executing an instruction with the processor, a second matrix by reducing the first data in the first matrix to second data that satisfies a size corresponding to an input feature, storing, by executing an instruction with the processor, the second matrix in first memory as the input feature, and generating, by executing an instruction with the processor, a demographic correction model based on the second matrix as the input feature, the demographic correction model to correct demographics corresponding to impressions logged in second memory.

Example 32 includes the method of example 31, wherein the generating of the second matrix is based on performing a reduction technique, the performing of the reduction technique including selecting a first entity from the entities and first and second embeddings from the embeddings, and generating the second matrix to include a first entry and a second entry, the first entry storing a first value of the first embedding associated with the first entity, the second entry storing a second value of the second embedding associated with the first entity.

Example 33 includes the method of example 31, wherein the generating of the second matrix is based on performing a reduction technique, the performing of the reduction technique including calculating weighted averages of the embeddings associated with the entities, the weighted averages including a first weighted average based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing a value of the first weighted average associated with the first entity.

Example 34 includes the method of example 31, wherein the generating of the second matrix is based on performing a reduction technique, the performing of the reduction technique including calculating values of an average, a Manhattan distance, a Chebyshev distance, a Euclidean distance, or a Minkowski distance of the embeddings associated with the entities, the values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 35 includes the method of example 31, wherein the generating of the second matrix is based on performing a reduction technique, the performing of the reduction technique including selecting at least one of maximum values or minimum values of the embeddings associated with the entities, the at least one of the maximum values or the minimum values including a first value based on a first entity from the entities and ones of the embeddings, and generating the second matrix to include an entry storing the first value associated with the first entity.

Example 36 includes the method of example 31, wherein the data corresponds to a user access to media, the media associated with the entities and the embeddings.

Example 37 includes the method of example 31, wherein the entities include at least one of top search result click entities or video watch entities.

Example 38 includes the method of example 31, wherein the embeddings include classifications of at least one of Internet searches requested by a user or media accessed by the user.

Example 39 includes the method of example 31, wherein the entities are represented using integer identifiers that map to a knowledge graph.

Example 40 includes the method of example 31, wherein the embeddings are represented as a numerical representation of a class of at least one of objects, images, or words.

The following claims are hereby incorporated into this Detailed Description by this reference, with each claim standing on its own as a separate embodiment of the present disclosure.