JP7662267B2

Movatterモバイル変換

Info

Publication number: JP7662267B2
Application number: JP2023218442A
Authority: JP
Inventors: ヒューレット，ウィリアム，レディントン; デン，スイシャン; ヤン，シェン; ラム，ホ，ユ
Original assignee: Palo Alto Networks Inc
Current assignee: Palo Alto Networks Inc
Priority date: 2019-07-19
Filing date: 2023-12-25
Publication date: 2025-04-15
Anticipated expiration: 2040-07-06
Also published as: KR20220053549A; KR102676386B1; JP7411775B2; JP2022541250A; EP3999985A1; JP2024023875A; EP3999985A4; WO2021015941A1; CN114072798A

Description

Translated fromJapanese

マルウェアは、悪意のあるソフトウェア(例えば、様々な敵対的、侵入的、及び／又は、望ましくないソフトウェアを含む)を参照する一般的な用語である。マルウェアは、コード、スクリプト、アクティブコンテンツ、及び／又は、他のソフトウェアの形態であり得る。マルウェアの使用例は、コンピュータ及び／又はネットワークの動作の中断、機密情報（proprietary information）(例えば、身元、財務、及び／又は、知的財産関連情報といった、秘密情報)の盗用、及び／又は、私的／専有コンピュータシステム及び／又はコンピュータネットワークへのアクセスの獲得、を含む。不幸にも、マルウェアの検出および軽減に役立つ技法が開発されるにつれて、悪意のある作家は、そうした努力を回避する方法を見つけるようになる。従って、マルウェアを識別し、かつ、軽減するための技法を改善する必要性が継続的に存在している。Malware is a general term referring to malicious software (e.g., including a variety of hostile, intrusive, and/or unwanted software). Malware can be in the form of code, scripts, active content, and/or other software. Examples of malware uses include disrupting computer and/or network operations, stealing proprietary information (e.g., confidential information such as identity, financial, and/or intellectual property related information), and/or gaining access to private/proprietary computer systems and/or computer networks. Unfortunately, as techniques are developed to aid in the detection and mitigation of malware, malicious writers find ways to circumvent those efforts. Thus, there is a continuing need for improved techniques for identifying and mitigating malware.

本発明の様々な実施形態が、以下の詳細な説明および添付の図面において開示されている。
図1は、悪意のあるアプリケーションが検出され、危害を引き起こすことを防止する環境の一つの実施例を示している。図2Aは、データ機器の一つの実施形態を示している。図2Bは、データ機器の一つの実施形態の論理コンポーネントの機能図である。図3は、サンプルを解析するためのシステムに含めることができる論理コンポーネントの一つの実施例を示している。図4は、脅威エンジン（threat engine）の一つの例示的な実施形態の部分を示している。図5は、ツリーの一部について一つの実施例を示している。図6は、データ機器においてインラインマルウェア検出を実行するためのプロセスついて一つの実施例を示している。図7Aは、ファイルについて一つの例示的なハッシュテーブルを示している。図7Bは、サンプルについて一つの例示的な脅威署名を示している。図8Aは、特徴抽出を実行するためのプロセスについて一つの実施例を示している。図8Bは、モデルを生成するためのプロセスについて一つの実施例を示している。 Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
FIG. 1 illustrates one embodiment of an environment in which malicious applications are detected and prevented from causing harm. FIG. 2A illustrates one embodiment of a data device. FIG. 2B is a functional diagram of the logical components of one embodiment of a data appliance. FIG. 3 illustrates one embodiment of logical components that may be included in a system for analyzing a sample. FIG. 4 illustrates portions of one exemplary embodiment of a threat engine. FIG. 5 shows one example of a portion of a tree. FIG. 6 illustrates one embodiment of a process for performing in-line malware detection on a data appliance. FIG. 7A shows an example hash table for a file. FIG. 7B shows one exemplary threat signature for the sample. FIG. 8A illustrates one embodiment of a process for performing feature extraction. FIG. 8B illustrates one embodiment of a process for generating a model.

本発明は、プロセス、装置、システム、合成物、コンピュータ読取り可能な記憶媒体上に具現化されたコンピュータプログラム製品、及び／又は、プロセッサを含む、多数の方法で実施することができる。プロセッサに結合されたメモリに保管され、かつ／あるいは、それによって提供される命令を実行するように構成されたプロセッサ、といったものである。この明細書では、これらの実施形態、または、本発明が採用し得るその他の形態は、技法（technique）と称される。一般的に、開示されるプロセスのステップの順序は、本発明の範囲内で変更され得る。特に指示のない限り、タスクを実行するように構成されているものと説明されたプロセッサまたはメモリといったコンポーネントは、所与の時間にタスクを実行するように一時的に構成される一般的なコンポーネント、または、タスクを実行するように製造されている特定のコンポーネントとして実装することができる。ここにおいて使用されるように、用語「プロセッサ（“processor”）」は、コンピュータプログラム命令などのデータを処理するように構成された１つ以上のデバイス、回路、及び／又は、処理コアを参照する。The present invention can be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer program product embodied on a computer-readable storage medium, and/or a processor configured to execute instructions stored in and/or provided by a memory coupled to the processor. These embodiments, or other forms in which the present invention may be employed, are referred to as techniques in this specification. In general, the order of steps of a disclosed process may be modified within the scope of the present invention. Unless otherwise indicated, components such as a processor or memory described as being configured to perform a task may be implemented as general components that are temporarily configured to perform the task at a given time, or as specific components that are manufactured to perform the task. As used herein, the term "processor" refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

本発明の１つ以上の実施形態の詳細な説明は、本発明の原理を説明する添付の図面と共に、以下で提供されている。本発明は、そうした実施形態に関連して説明されるが、本発明は、任意の実施形態に限定されるものではない。本発明の範囲は、請求項によってのみ限定されるものであり、そして、本発明は、多数の代替物、修正物、および均等物を包含している。本発明の完全な理解を提供するために、以下の説明において多数の具体的な詳細が記載されている。これらの詳細は、例示のために提供されているものであり、そして、本発明は、これらの特定の詳細の一部または全部を伴わずに、請求項に従って実施することができる。明確化のために、発明に関連する技術分野において周知の技術的資料は、発明が不必要に不明瞭にならないように詳細には説明されない。A detailed description of one or more embodiments of the present invention is provided below along with accompanying drawings that illustrate the principles of the invention. While the present invention will be described in connection with such embodiments, the present invention is not limited to any embodiment. The scope of the present invention is limited only by the claims, and the present invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description to provide a thorough understanding of the present invention. These details are provided for the purposes of example, and the present invention may be practiced according to the claims without some or all of these specific details. For the sake of clarity, technical material well known in the art related to the invention will not be described in detail so as not to unnecessarily obscure the invention.

I.概要I. Overview

ファイアウォールは、一般的に、承認された通信がファイアウォールを通過するのを許可し、一方で、不正アクセスからネットワークを保護している。ファイアウォールは、典型的には、ネットワークアクセスのためにファイアウォール機能を提供する、デバイス、一式のデバイス、または、デバイスにおいて実行されるソフトウェアである。例えば、ファイアウォールは、デバイス(例えば、コンピュータ、スマートフォン、または、他のタイプのネットワーク通信可能なデバイス)のオペレーティングシステムの中に統合することができる。ファイアウォールは、また、コンピュータサーバ、ゲートウェイ、ネットワーク／ルーティング（routing）デバイス(例えば、ネットワークルータ)、または、データ機器(例えば、セキュリティ機器、または他のタイプの特殊目的デバイス)といった、様々なタイプのデバイスまたはセキュリティデバイス上のソフトウェアアプリケーションとして統合され、または実行することができ、そして、いくつかの実装では、特定の動作は、ASICまたはFPGAといった、特定目的ハードウェアで実装することができる。
る。 A firewall generally allows authorized communications to pass through the firewall while protecting the network from unauthorized access. A firewall is typically a device, a set of devices, or software running on a device that provides firewall functionality for network access. For example, a firewall can be integrated into the operating system of a device (e.g., a computer, a smartphone, or other type of network-enabled device). A firewall can also be integrated or run as a software application on various types of devices or security devices, such as computer servers, gateways, network/routing devices (e.g., network routers), or data appliances (e.g., security appliances, or other types of special purpose devices), and in some implementations, certain operations can be implemented in special purpose hardware, such as ASICs or FPGAs.
do.

ファイアウォールは、典型的に、一式のルールに基づいてネットワーク送信を拒否または許可する。これらのルールのセットは、しばしば、ポリシ(例えば、ネットワークポリシ、またはネットワークセキュリティポリシ)として参照される。例えば、ファイアウォールは、不要な外部トラフィックが保護デバイスに到達するのを防ぐために、一式のルールまたはポリシを適用することによって、インバウンドトラフィック（inbound traffic）をフィルタリングすることができる。ファイアウォールは、また、一式のルールまたはポリシを適用することによってアウトバウンドトラフィックをフィルタリングすることができる(例えば、許可（allow）、ブロック（block）、モニタリング（monitor）、通知（notify）、またはログ（log）、及び／又は、ファイアウォールルールまたはファイアウォールポリシにおいて指定され得る他のアクションであり、これらは、ここにおいて説明されるような、様々な基準に基づいてトリガすることができる)。ファイアウォールは、また、同様に一式のルールまたはポリシを適用することによって、ローカルネットワーク(例えば、イントラネット)トラフィックをフィルタリングすることもできる。A firewall typically denies or allows network transmissions based on a set of rules. These sets of rules are often referred to as policies (e.g., network policies or network security policies). For example, a firewall can filter inbound traffic by applying a set of rules or policies to prevent unwanted external traffic from reaching a protected device. A firewall can also filter outbound traffic by applying a set of rules or policies (e.g., allow, block, monitor, notify, or log, and/or other actions that may be specified in a firewall rule or firewall policy, which may be triggered based on various criteria, as described herein). A firewall can also filter local network (e.g., intranet) traffic by applying a set of rules or policies as well.

セキュリティデバイス(例えば、セキュリティ機器、セキュリティゲートウェイ、セキュリティサービス、及び／又は、他のセキュリティデバイス)は、様々なセキュリティ動作(例えば、ファイアウォール、アンチ－マルウェア、侵入防止／検出、プロキシ、及び／又は、他のセキュリティ機能)、ネットワーク機能(例えば、ルーティング、クオリティ・オブ・サービス（QoS)、ネットワーク関連リソースのワークロードバランシング、及び／又は、他のネットワーク機能)、及び／又は、他のセキュリティ及び／又はネットワーク関連の機能を実行することができる。例えば、ルーティングは、送信元（source）情報(例えば、IPアドレスおよびポート)、宛先（destination）情報(例えば、IPアドレスおよびポート)、および、プロトコル情報に基づいて実行することができる。Security devices (e.g., security appliances, security gateways, security services, and/or other security devices) may perform various security operations (e.g., firewalls, anti-malware, intrusion prevention/detection, proxies, and/or other security functions), network functions (e.g., routing, quality of service (QoS), workload balancing of network-related resources, and/or other network functions), and/or other security and/or network-related functions. For example, routing may be performed based on source information (e.g., IP addresses and ports), destination information (e.g., IP addresses and ports), and protocol information.

基本的なパケットフィルタリング・ファイアウォールは、ネットワークを介して送信される個々のパケットを検査することによって、ネットワーク通信トラフィックをフィルタリングする(例えば、ステートレス（stateless）パケットフィルタリング・ファイアウォールである、パケットフィルタリング・ファイアウォールまたは第１世代ファイアウォール)。ステートレスパケットフィルタリング・ファイアウォールは、典型的に、個々のパケット自体を検査し、そして、検査されたパケットに基づいて(例えば、パケットの送信元および宛先のアドレス情報、プロトコル情報、および、ポート番号の組み合わせを使用して)ルールを適用する。Basic packet filtering firewalls filter network communication traffic by inspecting each individual packet sent across the network (e.g., a packet filtering firewall or first generation firewall, which is a stateless packet filtering firewall). Stateless packet filtering firewalls typically inspect each individual packet itself and then apply rules based on the inspected packet (e.g., using a combination of the packet's source and destination address information, protocol information, and port numbers).

アプリケーション・ファイアウォールは、また、(例えば、アプリケーション層フィルタリング・ファイアウォール、または、TCP／IPスタックのアプリケーションレベルにおいて機能する第２世代ファイアウォールを使用して)アプリケーション層フィルタリングを実行することもできる。アプリケーション層フィルタリング・ファイアウォールまたはアプリケーション・ファイアウォールは、一般的に、所定のアプリケーションおよびプロトコル(例えば、ハイパーテキスト転送プロトコル(HTTP)を使用したウェブブラウジング、ドメインネームシステム(DNS)要求、ファイル転送プロトコル(FTP)を使用したファイル転送、および、Telnet、DHCP、TCP、UDP、およびTFTP(GSS)といった、様々な他のタイプのアプリケーションおよび他のプロトコル)を識別することができる。例えば、アプリケーション・ファイアウォールは、標準ポートにおいて通信を試みる未認可（unauthorized）プロトコルをブロックすることができる(例えば、そのプロトコルについて非標準（non-standard）ポートを使用することにより黙って通り抜けること（sneak through）を試みる未認可／外れたポリシプロトコルは、一般的に、アプリケーション・ファイアウォールを使用して識別することができる)。Application firewalls can also perform application layer filtering (e.g., using an application layer filtering firewall, or a second generation firewall that functions at the application level of the TCP/IP stack). Application layer filtering firewalls or application firewalls can generally identify certain applications and protocols (e.g., web browsing using the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS) requests, file transfers using the File Transfer Protocol (FTP), and various other types of applications and other protocols, such as Telnet, DHCP, TCP, UDP, and TFTP (GSS)). For example, application firewalls can block unauthorized protocols that attempt to communicate on standard ports (e.g., unauthorized/out-of-policy protocols that attempt to sneak through by using a non-standard port for that protocol can generally be identified using an application firewall).

ステートフル・ファイアウォールは、また、ステートフル・ベースのパケット検査を実行することもでき、そこでは、各パケットが、そのネットワーク送信のパケットフロー（packets／packet flow）と関連する一式のパケットのコンテキストの中で検査される。このファイアウォール技術は、一般的に、ステートフル・パケット検査として参照される。ファイアウォールを通過する全ての接続の記録を保持し、そして、パケットが、新しい接続の開始であるか、既存の接続の一部であるか、または、無効なパケットであるかを判断することができるからである。例えば、接続の状態は、それ自体が、ポリシの中のルールをトリガするクライテリアの１つになり得る。Stateful firewalls can also perform stateful-based packet inspection, where each packet is inspected in the context of the set of packets associated with its network outgoing packet flow. This firewall technique is commonly referred to as stateful packet inspection, because it keeps a record of all connections that pass through the firewall and can determine whether a packet is the start of a new connection, part of an existing connection, or an invalid packet. For example, the state of a connection can itself be one of the criteria that triggers a rule in a policy.

先進的または次世代ファイアウォールは、上述のように、ステートレスおよびステートフルなパケットフィルタリングおよびアプリケーション層フィルタリングを実行することができる。次世代ファイアウォールは、また、追加的なファイアウォール技術を実行することもできる。例えば、先進的または次世代ファイアウォールとして、しばしば参照される所定の新しいファイアウォールは、また、ユーザおよびコンテンツを識別することができる。特に、所定の次世代ファイアウォールは、これらのファイアウォールが自動的に識別できるアプリケーションのリストを、何千ものアプリケーションまで拡大している。そうした次世代ファイアウォールの例は、Palo Alto Networksから市販されている(例えば、Palo Alto NetworksのPAシリーズのファイアウォール)。例えば、Palo Alto Networksの次世代ファイアウォールは、様々な識別技術を使用して、企業およびサービスプロバイダが、アプリケーション、ユーザ、およびコンテンツ－単にポート、IPアドレス、およびパケットだけでなく－を識別し、かつ、制御することを可能にする。様々な識別技術は、正確なアプリケーション識別のためのアプリケーションID（App-ID)（例えば、App ID)、ユーザ識別のためのユーザID（User-ID)（例えば、User ID)、および、リアルタイムなコンテンツスキャニングのためのコンテンツID（Content-ID)（例えば、Content ID)といったものである(例えば、Webサーフィンを制御し、かつ、データおよびファイルの転送を制限する)。これらの識別技術により、企業は、従来のポートブロッキングファイアウォールによって提供される従来のアプローチに従う代わりに、ビジネス関連の概念を使用して、アプリケーションの使用を安全に可能にすることができる。また、（例えば、専用装置として実装される）次世代ファイアウォールのための特定目的ハードウェアは、汎用ハードウェアにおいて実行されるソフトウェアよりも、アプリケーション検査についてより高いパフォーマンスレベルを一般的に提供する(例えば、Palo Alto Networks社が提供するセキュリティ機器といったものであり、シングルパス・ソフトウェアエンジンと堅く統合されている、専用の、機能固有の処理を利用し、Palo Alto NetworksのPAシリーズ次世代ファイアウォールについて、レイテンシ（latency）を最小化する一方で、ネットワークのスループットを最大化する)。Advanced or next-generation firewalls can perform stateless and stateful packet filtering and application layer filtering, as described above. Next-generation firewalls can also perform additional firewall technologies. For example, certain newer firewalls, often referred to as advanced or next-generation firewalls, can also identify users and content. In particular, certain next-generation firewalls have expanded the list of applications that they can automatically identify to thousands of applications. Examples of such next-generation firewalls are commercially available from Palo Alto Networks (e.g., Palo Alto Networks PA series firewalls). For example, Palo Alto Networks next-generation firewalls use a variety of identification technologies to enable enterprises and service providers to identify and control applications, users, and content - not just ports, IP addresses, and packets. Various identification techniques include Application-ID (e.g., App ID) for precise application identification, User-ID (e.g., User ID) for user identification, and Content-ID (e.g., Content ID) for real-time content scanning (e.g., to control web surfing and restrict data and file transfers). These identification techniques allow enterprises to safely enable application usage using business-relevant concepts instead of following the traditional approach offered by traditional port-blocking firewalls. Also, special-purpose hardware for next-generation firewalls (e.g., implemented as dedicated devices) typically offers higher performance levels for application inspection than software running on general-purpose hardware (e.g., security appliances from Palo Alto Networks that utilize dedicated, function-specific processing tightly integrated with a single-pass software engine to maximize network throughput while minimizing latency for Palo Alto Networks' PA Series next-generation firewalls).

先進的または次世代ファイアウォールは、また、仮想化ファイアウォールを使用して実装することもできる。そうした次世代ファイアウォールの例は、Palo Alto Networks社から市販されている(Palo Alto Networksのファイアウォールは、VMware(R) ESXi^TMおよびNSX^TM、Citrix(R)Netscaler SDX^TM、KVM／OpenStack(Centos／RHEL、Ubuntu(R))、および、Amazon Web Services(AWS)を含む、様々な商用仮想化環境をサポートしている)。例えば、仮想化ファイアウォールは、物理的フォームファクタ機器で利用可能な、同様の、または、完全に同一の次世代ファイアウォールおよび先進的な脅威防止機能をサポートすることができ、企業は、プライベート、パブリック、およびハイブリッドなクラウドコンピューティング環境へのアプリケーションの流入を安全に可能にすることができる。VMモニタリング、ダイナミックアドレスグループ、およびRESTベースのAPIといった自動化機能により、企業は、VMの変化を動的にモニタすることができ、そのコンテキストをセキュリティポリシに反映させて、それにより、VMの変化時に生じ得るポリシの遅れ（lag）を排除している。 Advanced or next-generation firewalls can also be implemented using virtualized firewalls. An example of such a next-generation firewall is commercially available from Palo Alto Networks (Palo Alto Networks firewalls support a variety of commercial virtualization environments, including VMware® ESXi^™ and NSX^™ , Citrix® Netscaler SDX^™ , KVM/OpenStack (Centos/RHEL, Ubuntu®), and Amazon Web Services (AWS). For example, virtualized firewalls can support similar or identical next-generation firewall and advanced threat prevention capabilities available on physical form factor devices, allowing enterprises to safely enable the influx of applications to private, public, and hybrid cloud computing environments. Automation capabilities such as VM monitoring, dynamic address groups, and REST-based APIs allow enterprises to dynamically monitor VM changes and update security policies to reflect that context, thereby eliminating policy lag that can occur when VMs change.

II.環境の実施例II. Examples of environmental practices

図1は、悪意のあるアプリケーション(「マルウェア（“malware”）」)が検出され、被害を引き起こさない環境の例を示している。以下でさらに詳細に説明するように、マルウェア分類(例えば、セキュリティプラットフォーム122によって作成される)は、図1に示される環境に含まれる様々なエンティティ間で様々に共有及び／又は改良することができ、ここにおいて説明される技術を用いて、エンドポイントクライアント装置104－110といった装置を、そうしたマルウェアから保護することができる。Figure 1 illustrates an example environment in which malicious applications ("malware") are detected and do not cause damage. As described in more detail below, malware classifications (e.g., generated by security platform 122) may be variously shared and/or refined among various entities included in the environment illustrated in Figure 1, and techniques described herein may be used to protect devices, such as endpoint client devices 104-110, from such malware.

「アプリケーション（“application”）」という用語は、形式／プラットフォームにかかわらず、プログラム、プログラムのバンドル、マニフェスト、パッケージ、等を総称して指すために、本仕様書の全体を通して使用されている。「アプリケーション」(ここにおいては「サンプル」とも呼ばれる)は、スタンドアロン（standalone）ファイル(例えば、ファイル名「calculator.apk」または「calculator.exe」を有する計算アプリケーション)であってもよく、または、別のアプリケーションの独立したコンポーネント(例えば、モバイル広告SDKまたは計算アプリケーション内に埋め込まれたライブラリ)であってよい。The term "application" is used throughout this specification to collectively refer to programs, program bundles, manifests, packages, etc., regardless of format/platform. An "application" (also referred to herein as a "sample") may be a standalone file (e.g., a calculator application having the file name "calculator.apk" or "calculator.exe") or may be a separate component of another application (e.g., a mobile advertising SDK or a library embedded within a calculator application).

ここにおいて使用される「マルウェア」とは、秘密裡であろうとなかろうと(かつ、違法であろうとなかろうと)、完全な情報を得た場合にはユーザが承認しない／承認しないであろう挙動に関与する。マルウェアの例は、トロイの木馬、ウイルス、ルートキット、スパイウェア、ハッキングツール、キーロガー、等を含む。マルウェアの一つの例は、デスクトップ・アプリケーションであり、それは、エンドユーザの場所を収集し、かつ、リモート・サーバに報告する(しかし、ユーザには、マッピング・サービスといった、場所ベースのサービスを提供しない)。マルウェアのもう別の例は、悪意のあるアンドロイド（登録商標）（Android）アプリケーションパッケージ.apk(APK)であり、それは、エンドユーザにとっては無料ゲームのように見えるが、密かにSMSプレミアムメッセージ(例えば、各10ドルの費用)を送信し、エンドユーザの電話料金請求書を膨らませる。マルウェアの別の例は、アップルのiOSフラッシュライトアプリケーションであり、それは、ユーザの連絡先を密かに収集し、かつ、それらの連絡先をスパマー（spammer）に送信する。他の形態のマルウェアも、ここにおいて説明される技術(例えば、ランサムウェア)を用いて検出／阻止することができる。さらにnグラム（n-gram）／特徴ベクトル／出力蓄積変数は、悪意のあるアプリケーションについて生成されるものとしてここにおいて説明されているが、ここにおいて説明される技術は、また、他の種類のアプリケーション(例えば、アドウェア・プロファイル、グッドウェア・プロファイル、等)のためのプロファイルを生成するために、様々な実施形態でも使用することができる。As used herein, "malware" refers to any behavior, covert or not (and illegal or not), that a user would not/would not approve of if given full information. Examples of malware include Trojan horses, viruses, rootkits, spyware, hacking tools, keyloggers, etc. One example of malware is a desktop application that collects and reports the end user's location to a remote server (but does not provide the user with location-based services, such as mapping services). Another example of malware is a malicious Android application package .apk (APK) that appears to the end user as a free game, but secretly sends SMS premium messages (e.g., costing $10 each) and inflates the end user's phone bill. Another example of malware is Apple's iOS Flashlight application, which covertly collects a user's contacts and sends those contacts to spammers. Other forms of malware can also be detected/thwarted using the techniques described herein (e.g., ransomware). Additionally, although n-grams/feature vectors/output accumulation variables are described herein as being generated for malicious applications, the techniques described herein may also be used in various embodiments to generate profiles for other types of applications (e.g., adware profiles, goodware profiles, etc.).

ここにおいて説明される技術は、種々のプラットフォーム(例えば、デスクトップ、モバイルデバイス、ゲームプラットフォーム、エンベッドシステム、等）及び／又は種々のタイプのアプリケーション(例えば、Android apkファイル、iOSアプリケーション、Windows PEファイル、Adobe Acrobat PDFファイル、等）と組み合わせて使用することができる。図1に示す例示的な環境において、クライアント装置104－108は、ラップトップコンピュータ、デスクトップコンピュータ、およびエンタープライズネットワーク140に存在するタブレットである。クライアント装置110は、エンタープライズネットワーク140の外部に存在するラップトップコンピュータである。The techniques described herein can be used in conjunction with a variety of platforms (e.g., desktops, mobile devices, gaming platforms, embedded systems, etc.) and/or a variety of types of applications (e.g., Android apk files, iOS applications, Windows PE files, Adobe Acrobat PDF files, etc.). In the exemplary environment shown in FIG. 1, client devices 104-108 are laptop computers, desktop computers, and tablets that reside onenterprise network 140.Client device 110 is a laptop computer that residesoutside enterprise network 140.

データ機器102は、クライアント装置104および106といった、クライアント装置と、エンタープライズネットワーク140外のノード(例えば、外部ネットワーク118を介して到達可能)との間の通信に関するポリシを実施するように構成されている。そうしたポリシの例は、トラフィックシェーピング、サービスの品質、およびトラフィックのルーティングを管理するポリを含む。ポリシの他の例は、受信（および送信）メールの添付ファイル、ウェブサイトのコンテンツ、インスタントメッセージングプログラムを介して交換されるファイル、及び／又は、他のファイル転送、における脅威についてスキャニング（scanning）を要求するといった、セキュリティポリシを含む。いくつかの実施形態において、データ機器102は、また、エンタープライズネットワーク140内に留まるトラフィックに関するポリシを実施するように構成される。Data appliance 102 is configured to enforce policies regarding communications between client devices, such asclient devices 104 and 106, and nodes outside enterprise network 140 (e.g., reachable via external network 118). Examples of such policies include policies governing traffic shaping, quality of service, and routing of traffic. Other examples of policies include security policies, such as mandating scanning for threats in incoming (and outgoing) email attachments, web site content, files exchanged via instant messaging programs, and/or other file transfers. In some embodiments,data appliance 102 is also configured to enforce policies regarding traffic that remains withinenterprise network 140.

データ機器の一つの実施形態が図2Aに示されている。示される例は、種々の実施形態において、データ機器102に含まれる物理的コンポーネントの表現である。具体的に、データ機器102は、高性能マルチコア中央処理ユニット（CPU）202およびランダムアクセスメモリ（RAM）204を含んでいる。データ機器102は、また、ストレージ210(１つ以上のハードディスクまたはソリッドステート・ストレージユニット、といったもの）を含む。様々な実施形態において、データ機器102は、エンタープライズネットワーク140をモニタリングすること、および、開示された技術を実装することに使用される情報を(RAM204、ストレージ210、及び／又は、他の適切なロケーション、のいずれかに)保管する。そうした情報の例は、アプリケーション識別子、コンテンツ識別子、ユーザ識別子、要求されたURL、IPアドレスマッピング、ポリシおよび他のコンフィグレーション情報、署名、ホスト名／URL分類情報、マルウェアプロファイル、および機械学習モデル、を含む。データ機器102は、また、１つ以上の任意的なハードウェアアクセラレータを含み得る。例えば、データ機器102は、暗号化および復号動作を実行するように構成された暗号エンジン206、および、照合器（matching）を実行し、ネットワークプロセッサとして動作し、かつ／あるいは、他のタスクを実行するように構成された、１つ以上のフィールドプログラマブルゲートアレイ208を含み得る。One embodiment of a data appliance is shown in FIG. 2A. The illustrated example is a representation of the physical components included in thedata appliance 102 in various embodiments. Specifically, thedata appliance 102 includes a high-performance multi-core central processing unit (CPU) 202 and random access memory (RAM) 204. Thedata appliance 102 also includes storage 210 (such as one or more hard disks or solid-state storage units). In various embodiments, thedata appliance 102 stores (either in theRAM 204, thestorage 210, and/or other suitable locations) information used to monitor theenterprise network 140 and to implement the disclosed techniques. Examples of such information include application identifiers, content identifiers, user identifiers, requested URLs, IP address mappings, policies and other configuration information, signatures, hostname/URL categorization information, malware profiles, and machine learning models. Thedata appliance 102 may also include one or more optional hardware accelerators. For example,data device 102 may include acryptographic engine 206 configured to perform encryption and decryption operations, and one or more fieldprogrammable gate arrays 208 configured to perform matching, operate as a network processor, and/or perform other tasks.

データ機器102によって実行されるものとしてここにおいて説明される機能性は、種々の方法で提供／実装することができる。例えば、データ機器102は、専用のデバイスまたはデバイスセットであってよい。データ機器102によって提供される機能は、汎用コンピュータ、コンピュータサーバ、ゲートウェイ、及び／又は、ネットワーク／ルーティング・デバイス上のソフトウェアとして統合され、または、実行され得る。いくつかの実施形態において、データ機器102によって提供されるものとして説明される少なくともいくつかのサービスが、代わりに(または、これに加えて)、クライアント装置において実行するソフトウェアによって、クライアント装置(例えば、クライアント装置104またはクライアント装置110)に提供される。The functionality described herein as being performed bydata device 102 may be provided/implemented in a variety of ways. For example,data device 102 may be a dedicated device or set of devices. The functionality provided bydata device 102 may be integrated or performed as software on a general purpose computer, computer server, gateway, and/or network/routing device. In some embodiments, at least some of the services described as being provided bydata device 102 are instead (or in addition) provided to a client device (e.g.,client device 104 or client device 110) by software executing on the client device.

データ機器102がタスクを実行するものとして記述されるときはいつでも、単一のコンポーネント、コンポーネントのサブセット、またはデータ機器102の全てのコンポーネントは、タスクを実行するために協働することができる。同様に、データ機器102のコンポーネントがタスクを実行するものとして説明されるときはいつでも、サブコンポーネントは、タスクを実行することができ、かつ／あるいは、コンポーネントは、他のコンポーネントと共にタスクを実行することができる。様々な実施形態において、データ機器102の一部は、１つ以上の第三者によって提供される。データ機器102に利用可能な計算リソースの量といった要因に応じて、データ機器102の種々の論理コンポーネント及び／又は特徴は省略されてよく、そして、ここにおいて説明される技術はそれに応じて適合される。同様に、追加の論理コンポーネント／特徴を、データ機器102の実施形態に、適用可能なように含めることができる。種々の実施形態におけるデータ機器102に含まれるコンポーネントの一つの例は、(例えば、パケットフロー解析に基づいてアプリケーションを識別するために種々のアプリケーション署名を使用して)アプリケーションを識別するように構成されているアプリケーション識別エンジンである。例えば、アプリケーション識別エンジンは、セッションが関与するトラフィックのタイプを決定することができる。Webブラウジング－ソーシャルネットワーキング、Webブラウジング－ニュース、SSH、等といったものである。Whenever thedata appliance 102 is described as performing a task, a single component, a subset of components, or all components of thedata appliance 102 may cooperate to perform the task. Similarly, whenever a component of thedata appliance 102 is described as performing a task, the subcomponent may perform the task and/or the component may perform the task with other components. In various embodiments, portions of thedata appliance 102 are provided by one or more third parties. Depending on factors such as the amount of computing resources available to thedata appliance 102, various logical components and/or features of thedata appliance 102 may be omitted, and the techniques described herein may be adapted accordingly. Similarly, additional logical components/features may be included in embodiments of thedata appliance 102, as applicable. One example of a component included in thedata appliance 102 in various embodiments is an application identification engine configured to identify applications (e.g., using various application signatures to identify applications based on packet flow analysis). For example, the application identification engine may determine the type of traffic in which a session is involved. Web browsing - social networking, Web browsing - news, SSH, etc.

図2Bは、データ機器の一つの実施形態の論理コンポーネントの機能図である。示される例は、種々の実施形態においてデータ機器102に含まれ得る論理コンポーネントの表現である。別段の規定がない限り、データ機器102の種々の論理コンポーネントは、一般的に、１つ以上のスクリプト(例えば、該当する場合、Java、python、等で書かれたもの)のセット（set）を含む種々の方法で実装可能である。FIG. 2B is a functional diagram of the logical components of one embodiment of a data appliance. The example shown is a representation of the logical components that may be included in thedata appliance 102 in various embodiments. Unless otherwise specified, the various logical components of thedata appliance 102 may generally be implemented in a variety of ways, including as a set of one or more scripts (e.g., written in Java, python, etc., as applicable).

図示のように、データ機器102はファイアウォールを備え、かつ、管理プレーン232およびデータプレーン234を含んでいる。管理プレーンは、ポリシの設定およびログデータの表示のめのユーザインターフェイスを提供するといったことにより、ユーザインタラクション（user interaction）の管理について責任を負う。データプレーンは、パケット処理およびセッション処理を実行するといったことにより、データ管理について責任を負う。As shown,data appliance 102 includes a firewall and includes amanagement plane 232 and adata plane 234. The management plane is responsible for managing user interaction, such as by providing a user interface for setting policies and displaying log data. The data plane is responsible for data management, such as by performing packet and session processing.

ネットワークプロセッサ236は、クライアント装置108といった、クライアント装置からパケットを受信し、そして、それらを処理のためにデータプレーン234に提供するように構成されている。フローモジュール238は、新しいセッションの一部としてパケットを識別するときはいつでも、新しいセッションフローを生成する。その後のパケットは、フロールックアップに基づいて、セッションに属しているものとして識別される。該当する場合、SSL復号エンジン240によってSSL復号化が適用される。そうでなければ、SSL復号エンジン240による処理は省略される。復号エンジン240は、データ機器102がSSL／TLSおよびSSHの暗号化トラフィックを検査および制御することを助け、そして、従って、そうでなければ暗号化トラフィック内に隠されたままであり得る脅威を停止することを助ける。復号エンジン240は、また、機密性の高いコンテンツがエンタープライズネットワーク140から去るのを防止することを助けることができる。復号は、URLカテゴリ、トラフィック元、トラフィック宛先、ユーザ、ユーザグループ、およびポート、といったパラメータに基づいて選択的に制御することができる(例えば、イネーブルされ、または、ディセーブルされる)。復号ポリシ(例えば、復号するセッションを指定するもの)に加えて、復号プロファイルは、ポリシによって制御されるセッションの様々なオプションを制御するために割り当てることができる。例えば、特定の暗号スイートおよび暗号化プロトコルバージョンの使用が要求され得る。Thenetwork processor 236 is configured to receive packets from a client device, such as theclient device 108, and provide them to thedata plane 234 for processing. Theflow module 238 creates a new session flow whenever it identifies a packet as part of a new session. Subsequent packets are identified as belonging to the session based on the flow lookup. If applicable, SSL decryption is applied by theSSL decryption engine 240. Otherwise, processing by theSSL decryption engine 240 is omitted. Thedecryption engine 240 helps thedata appliance 102 inspect and control SSL/TLS and SSH encrypted traffic, and thus helps stop threats that may otherwise remain hidden in the encrypted traffic. Thedecryption engine 240 can also help prevent sensitive content from leaving theenterprise network 140. Decryption can be selectively controlled (e.g., enabled or disabled) based on parameters such as URL category, traffic source, traffic destination, user, user group, and port. In addition to a decryption policy (e.g., specifying which sessions to decrypt), a decryption profile can be assigned to control various options of the session controlled by the policy. For example, the use of specific cipher suites and encryption protocol versions can be required.

アプリケーション識別(APP-ID)エンジン242は、セッションが関与するトラフィックのタイプを決定するように構成されている。一つの例として、アプリケーション識別エンジン242は、受信データ内のGETリクエストを認識し、そして、セッションがHTTPデコーダを必要とすると結論付けることができる。場合によって、例えば、ウェブブラウジングセッションにおいて、識別されたアプリケーションは変更することができ、そして、そうした変更はデータ機器102によって書き留め（noted）られる。例えば、ユーザは、まず、企業のWiki(訪問したURLに基づいて「Webブラウジング－生産性（“Web Browsing-Productivity”）」として分類される)を閲覧し、次に、ソーシャルネットワーキングサイト(訪問したURLに基づいて「Webブラウジング－ソーシャルネットワーキング（“Web Browsing-Social Networking”）」として分類される)を閲覧することができる。異なるタイプのプロトコルは、対応するデコーダを有している。The application identification (APP-ID)engine 242 is configured to determine the type of traffic the session involves. As an example, theapplication identification engine 242 may recognize a GET request in the received data and conclude that the session requires an HTTP decoder. In some cases, such as during a web browsing session, the identified application may change and such changes may be noted by thedata appliance 102. For example, a user may first browse a corporate Wiki (classified as "Web Browsing-Productivity" based on the URLs visited) and then browse a social networking site (classified as "Web Browsing-Social Networking" based on the URLs visited). Different types of protocols have corresponding decoders.

アプリケーション識別エンジン242によって行われた決定に基づいて、パケットを正しい順序に組み立て、トークン化を実行し、情報を抽出するように構成された、適切なデコーダに対して、脅威エンジン244によって、パケットが送信される。脅威エンジン244は、また、パケットに何が起こるべきかを決定するために、署名照合（signature matching）を実行する。必要に応じて、SSL暗号化エンジン246は、復号されたデータを再び暗号化することができる。パケットは、転送のために(例えば、宛先へ)転送モジュール248を使用して転送される。Based on the determination made by theapplication identification engine 242, the packets are sent by thethreat engine 244 to the appropriate decoder configured to assemble the packets in the correct order, perform tokenization, and extract information. Thethreat engine 244 also performs signature matching to determine what should happen to the packets. If necessary, theSSL encryption engine 246 can re-encrypt the decrypted data. The packets are forwarded using theforwarding module 248 for forwarding (e.g., to a destination).

図2Bにも、また、示されるように、ポリシ252は、受信され、そして、管理プレーン232に保管される。ポリシは、ドメイン名及び／又はホスト／サーバ名を使用して指定することができる、１つ以上のルールを含むことができ、そして、ルールは、モニタリングされるセッショントラフィックフローからの様々な抽出されたパラメータ／情報に基づいて、加入者／IPフローに対するセキュリティポリシ実施のためといった、１つ以上の署名または他の照合基準または発見的方法を適用することができる。インターフェイス（I/F）通信器250が、管理通信(例えば、(REST)API、メッセージ、またはネットワークプロトコル通信、もしくは他の通信メカニズムを介して)について提供されている。Also shown in FIG. 2B, apolicy 252 is received and stored in themanagement plane 232. The policy can include one or more rules, which can be specified using domain names and/or host/server names, and the rules can apply one or more signatures or other matching criteria or heuristics, such as for security policy enforcement, to the subscriber/IP flows based on various extracted parameters/information from the monitored session traffic flows. An interface (I/F)communicator 250 is provided for management communications (e.g., via (REST) APIs, messages, or network protocol communications, or other communication mechanisms).

III.セキュリティプラットフォームIII. Security Platform

図1に戻り、悪意のある(システム120を使用する)個人がマルウェア130を作成したと仮定する。悪意のある個人は、クライアント装置104といった、クライアント装置がマルウェア130のコピーを実行することを望んでおり、クライアント装置を危険にさらし（compromising）、そして、例えば、クライアント装置をボットネットにおけるボット（bot）にさせる。危険にさらされたクライアント装置は、次いで、タスク(例えば、暗号通貨のマイニング、または、サービス妨害攻撃への参加)を実行し、そして、コマンドおよび制御（C&C）サーバ150といった、外部エンティティに情報を報告するように、並びに、必要に応じて、C&Cサーバ150からの命令を受信するように、指示され得る。Returning to FIG. 1, assume that a malicious individual (using system 120) has createdmalware 130. The malicious individual wants a client device, such asclient device 104, to run a copy ofmalware 130, compromising the client device and, for example, causing the client device to become a bot in a botnet. The compromised client device can then be instructed to perform a task (e.g., mining cryptocurrency or participating in a denial of service attack) and to report information to an external entity, such as a command and control (C&C)server 150, as well as, optionally, to receive instructions from theC&C server 150.

データ機器102が、クライアント装置104を操作するユーザ「アリス（“Alice”）」に対して送信された電子メールをインターセプトしたと想定する。マルウェア130のコピーは、システム120によってメッセージに添付されている。代替的であるが、類似のシナリオとして、データ機器102は、クライアント装置104による（例えば、ウェブサイトからの）マルウェア130のダウンロードの試みをインターセプトすることができる。いずれのシナリオにおいても、データ機器102は、ファイルの署名（例えば、eメールの添付またはマルウェア130のウェブサイトダウンロード）がデータ機器102上に存在するか否かを決定する。署名は、存在する場合に、ファイルが安全であると知られている(例えば、ホワイトリストに在る)ことを示すことができ、そして、また、そのファイルが悪意のものであると知られている(例えば、ブラックリストに在る)ことを示すこともできる。Assume that thedata appliance 102 intercepts an email sent to a user “Alice” operating aclient device 104. A copy of themalware 130 has been attached to the message by thesystem 120. In an alternative, but similar scenario, thedata appliance 102 may intercept an attempt by theclient device 104 to download the malware 130 (e.g., from a website). In either scenario, thedata appliance 102 determines whether a signature of the file (e.g., an email attachment or a website download of the malware 130) is present on thedata appliance 102. If present, the signature may indicate that the file is known to be safe (e.g., on a whitelist) and may also indicate that the file is known to be malicious (e.g., on a blacklist).

様々な実施形態において、データ機器102は、セキュリティプラットフォーム122と協働して動作するように構成されている。一つの例として、セキュリティプラットフォーム122は、データ機器102に、既知の悪意のあるファイルの署名のセットを(例えば、サブスクリプションの一部として)提供することができる。マルウェア130に対する署名がセットに含まれる場合(例えば、マルウェア130のMD5ハッシュ)、データ機器102は、それに応じて(例えば、クライアント装置104に送られる電子メール添付のMD5ハッシュがマルウェア130のMD5ハッシュに一致することを検出することによって)、クライアント装置104へのマルウェア130の送信を防止することができる。セキュリティプラットフォーム122は、また、データ機器102に既知の悪意のあるドメイン及び／又はIPアドレスのリストを提供することができ、データ機器102がエンタープライズネットワーク140とC&Cサーバ150(例えば、C&Cサーバ150が悪意であることが知られている場合)との間のトラフィックをブロックすることを可能にする。悪意のあるドメイン(及び／又はIPアドレス)のリストは、また、データ機器102が、そのノードの１つがいつ侵害されたかを判断するのに役立つ。例えば、クライアント装置104がC&Cサーバ150へのコンタクトを試みる場合、そうした試みは、クライアント104がマルウェアによって危険にさらされたこと(従って、クライアント装置104がエンタープライズネットワーク140内の他のノードと通信するのを隔離するなどの是正措置を講じる必要があること)を示す強力な指標（indicator）である。以下でより詳細に説明されるように、セキュリティプラットフォーム122は、また、ファイルのインライン解析を行うためにデータ機器102によって使用可能な機械学習モデルのセットといった、他のタイプの情報を、データ機器102に(例えば、予約の一部として)提供することができる。In various embodiments, thedata appliance 102 is configured to operate in cooperation with thesecurity platform 122. As one example, thesecurity platform 122 can provide thedata appliance 102 with a set of signatures of known malicious files (e.g., as part of a subscription). If a signature for themalware 130 is included in the set (e.g., an MD5 hash of the malware 130), thedata appliance 102 can respond by preventing transmission of themalware 130 to the client device 104 (e.g., by detecting that an MD5 hash of an email attachment sent to theclient device 104 matches the MD5 hash of the malware 130). Thesecurity platform 122 can also provide thedata appliance 102 with a list of known malicious domains and/or IP addresses, enabling thedata appliance 102 to block traffic between theenterprise network 140 and the C&C server 150 (e.g., if theC&C server 150 is known to be malicious). The list of malicious domains (and/or IP addresses) can also help thedata appliance 102 determine when one of its nodes has been compromised. For example, if theclient device 104 attempts to contact theC&C server 150, such an attempt is a strong indicator that theclient 104 has been compromised by malware (and thus that corrective action should be taken, such as isolating theclient device 104 from communicating with other nodes in the enterprise network 140). As described in more detail below, thesecurity platform 122 can also provide other types of information to the data appliance 102 (e.g., as part of a subscription), such as a set of machine learning models that can be used by thedata appliance 102 to perform inline analysis of files.

様々な実施形態において、添付（attachment）に対する署名が見つからない場合、データ機器102は、様々な措置を講じることができる。第１例として、データ機器102は、良性（benign）としてホワイトリストに掲載されていない(例えば、既知の良好なファイルの署名と一致しない)添付の送信をブロックすることによって、フェールセーフ（fail-safe）にすることができる。このアプローチの欠点は、実際に良性である場合にも、潜在的にマルウェアとして不必要にブロックされる正規の添付が多く存在し得ることである。第２例として、データ機器102は、悪意のあるものとしてブラックリストに掲載されていない添付ファイル(例えば、既知の悪意のあるファイルの署名と一致しないもの)の送信を可能にすることによって、故障の危険（fail-danger）をもたらし得る。このアプローチの欠点は、新たに作成されたマルウェア(プラットフォーム122によって以前は見えなかったもの)が、危害を引き起こすのを妨げられないことである。In various embodiments, if a signature for an attachment is not found, thedata device 102 can take various actions. As a first example, thedata device 102 can be fail-safe by blocking the transmission of any attachment that is not whitelisted as benign (e.g., does not match the signature of a known good file). The drawback of this approach is that there may be many legitimate attachments that are unnecessarily blocked as potentially malware when in fact they are benign. As a second example, thedata device 102 can pose a fail-danger by allowing the transmission of any attachment that is not blacklisted as malicious (e.g., does not match the signature of a known malicious file). The drawback of this approach is that newly created malware (previously unseen by the platform 122) is not prevented from causing harm.

第３例として、データ機器102は、静的／動的解析のためにセキュリティプラットフォーム122にファイル(例えば、マルウェア130)を提供し、それが悪意であるか否かを判断し、かつ／あるいは、それを分類するように構成することができる。添付のセキュリティプラットフォーム122(署名がまだ存在しない)による解析が実行される間に、データ機器102は様々なアクションをとることができる。第１例として、データ機器102は、セキュリティプラットフォーム122から応答が受信されるまで、電子メール(および添付ファイル)がアリスに配信されるのを妨げることができる。プラットフォーム122がサンプルを完全に解析するのに約15分かかると仮定すると、これは、アリスへの受信メッセージが15分遅れることを意味する。この例では、添付は悪意があるため、そうした遅延はアリスにマイナスの影響を与えない。別の例においては、誰かが、署名も存在しない良性の添付を伴う時間に敏感な（time sensitive）メッセージをアリスに送ったものと想定する。アリスへのメッセージの配送を15分遅らせることは(例えば、アリスによって）受け入れられないと見なされる可能性が高い。以下でより詳細に説明されるように、代替的アプローチは、データ機器102において添付について(例えば、プラットフォーム122からの裁決を待つ間に)少なくともある程度のリアルタイム解析を行うことである。データ機器102が、添付が悪意のあるものか良性のものかを独立して決定することができれば、初期アクション（例えば、アリスへの配送をブロックする、または、許可する）をとることができ、そして、セキュリティプラットフォーム122から裁決（verdict）を受信した後で、必要に応じて、追加アクションを調整／実行することができる。As a third example,data appliance 102 can be configured to provide a file (e.g., malware 130) tosecurity platform 122 for static/dynamic analysis to determine whether it is malicious and/or classify it. While analysis bysecurity platform 122 of the attachment (signature not yet present) is being performed,data appliance 102 can take various actions. As a first example,data appliance 102 can prevent the email (and attachment) from being delivered to Alice until a response is received fromsecurity platform 122. Assuming that it takes about 15 minutes forplatform 122 to fully analyze the sample, this means that the incoming message to Alice will be delayed by 15 minutes. In this example, such a delay will not negatively impact Alice, since the attachment is malicious. In another example, assume that someone sends Alice a time sensitive message with a benign attachment that is also unsigned. Delaying delivery of the message to Alice by 15 minutes is likely to be considered unacceptable (e.g., by Alice). As described in more detail below, an alternative approach is to perform at least some real-time analysis of the attachment at data appliance 102 (e.g., while awaiting a verdict from platform 122). Ifdata appliance 102 can independently determine whether the attachment is malicious or benign, it can take initial action (e.g., block or allow delivery to Alice) and can adjust/take additional action, if necessary, after receiving a verdict fromsecurity platform 122.

セキュリティプラットフォーム122は、受信したサンプルのコピーをストレージ142に保管し、そして、解析が開始される(または、適宜、予定される)。ストレージ142の一つの例は、アパッチハデュープ（Apache Hadoop）クラスタである。解析の結果(および、アプリケーションに関連する追加情報)は、データベース146に保管される。アプリケーションが不正であると判断された場合、データ機器は、解析結果に基づいて、ファイルダウンロードを自動的にブロックするように設定することができる。さらに、悪意があると判断されたファイルをダウンロードする将来のファイル転送要求を自動的にブロックするために、マルウェアについて署名を生成し、そして、(例えば、データ機器102、136、148といったデータ機器に対して)配布することができる。Thesecurity platform 122 stores a copy of the received sample instorage 142, and analysis is initiated (or scheduled, as appropriate). One example ofstorage 142 is an Apache Hadoop cluster. The results of the analysis (and additional information related to the application) are stored indatabase 146. If the application is determined to be malicious, the data appliance can be configured to automatically block the file download based on the analysis results. Additionally, a signature can be generated for the malware and distributed (e.g., to data appliances such asdata appliances 102, 136, 148) to automatically block future file transfer requests to download files determined to be malicious.

様々な実施形態において、セキュリティプラットフォーム122は、典型的なサーバ－クラス・オペレーティングシステム(例えば、Linux（登録商標）)を実行する１つ以上の専用の市販のハードウェアサーバを含む(例えば、マルチコアプロセッサ、RAMの32G+、ギガビット・ネットワークインターフェイス・アダプタ、および、ハードドライブを有しているもの)。セキュリティプラットフォーム122は、複数のそうしたサーバ、ソリッドステートドライブ、及び／又は、他の適用可能な高性能ハードウェアを含むスケーラブル・インフラストラクチャにわたり、実装され得る。セキュリティプラットフォーム122は、１つ以上の第三者によって提供されるコンポーネントを含む、複数の分散コンポーネントを有することができる。例えば、セキュリティプラットフォーム122の一部または全部を、Amazon Elastic Compute Cloud（EC2）及び／又はAmazon Simple Storage Service（S3）を使用して実装することができる。さらに、データ機器102の場合と同様に、セキュリティプラットフォーム122が、データの保管またはデータの処理といった、タスクを実行するように言及されるときはいつでも、セキュリティプラットフォーム122のサブコンポーネントまたは複数のサブコンポーネントは、(個々に、または、第三者のコンポーネントと協力して)そのタスクを実行するために協働し得ることができることが理解されるべきである。一つの例として、セキュリティプラットフォーム122は、任意的に、VMサーバ124といった、１つ以上の仮想マシン（VM）サーバと協力して、静的／動的分解析を実行することができる。In various embodiments,security platform 122 includes one or more dedicated, commercially available hardware servers (e.g., having multi-core processors, 32G+ of RAM, gigabit network interface adapters, and hard drives) running a typical server-class operating system (e.g., Linux).Security platform 122 may be implemented across a scalable infrastructure including multiple such servers, solid-state drives, and/or other applicable high performance hardware.Security platform 122 may have multiple distributed components, including components provided by one or more third parties. For example, some or all ofsecurity platform 122 may be implemented using Amazon Elastic Compute Cloud (EC2) and/or Amazon Simple Storage Service (S3). Additionally, as withdata equipment 102, wheneversecurity platform 122 is referred to as performing a task, such as storing or processing data, it should be understood that a subcomponent or subcomponents ofsecurity platform 122 may cooperate (individually or in cooperation with third party components) to perform that task. As one example,security platform 122 can optionally cooperate with one or more virtual machine (VM) servers, such asVM server 124, to perform static and dynamic analysis.

仮想マシンサーバの一つの例は、VMware ESXi、Citrix XenServer、またはMicrosoft Hyper-Vといった、市販の仮想化ソフトウェアを実行する、市販のサーバ－クラスのハードウェア(例えば、マルチコアプロセッサ、RAMの32G+、および１つ以上のギガビット・ネットワークインターフェイス・アダプタ)を含む物理マシンである。いくつかの実施形態において、仮想マシンサーバは省略されている。さらに、仮想マシンサーバは、セキュリティプラットフォーム122を管理するのと同じエンティティの制御下にあってよいが、また、第三者によって提供されてもよい。一つの例として、仮想マシンサーバは、EC2に依存することができ、セキュリティプラットフォーム122のオペレータによって所有され、かつ、その制御下にある専用ハードウェアによって提供されるセキュリティプラットフォーム122の残りの部分を伴う。VMサーバ124は、クライアント装置をエミュレートするために１つ以上の仮想マシン126－128を提供するように構成さていれる。仮想マシンは、様々なオペレーティングシステム及び／又はそのバージョンを実行することができる。仮想マシンでアプリケーションを実行した結果として生じる、観察された動作がログに記録され、そして、解析される(例えば、アプリケーションが悪意を持っていることを示す場合)。いくつかの実施形態において、ログ解析は、VMサーバ(例えば、VMサーバ124)によって実行される。他の実施形態において、解析は、少なくとも部分的に、コーディネータ144といった、セキュリティプラットフォーム122の他のコンポーネントによって実行される。One example of a virtual machine server is a physical machine that includes commercially available server-class hardware (e.g., multi-core processor, 32G+ of RAM, and one or more gigabit network interface adapters) running commercially available virtualization software such as VMware ESXi, Citrix XenServer, or Microsoft Hyper-V. In some embodiments, the virtual machine server is omitted. Furthermore, the virtual machine server may be under the control of the same entity that manages thesecurity platform 122, but may also be provided by a third party. As one example, the virtual machine server may rely on EC2, with the rest of thesecurity platform 122 being provided by dedicated hardware owned and under the control of the operator of thesecurity platform 122. TheVM server 124 is configured to provide one or more virtual machines 126-128 to emulate client devices. The virtual machines may run various operating systems and/or versions thereof. Observed behavior resulting from running applications in the virtual machines is logged and analyzed (e.g., if it indicates that the application is malicious). In some embodiments, the log analysis is performed by a VM server (e.g., VM server 124). In other embodiments, the analysis is performed, at least in part, by other components ofsecurity platform 122, such ascoordinator 144.

様々な実施形態において、セキュリティプラットフォーム122は、サブスクリプション（subscription）の一部として、署名(及び／又は、他の識別子)のリストを介して、データ機器102に対してサンプルの解析の結果を利用可能にする。例えば、セキュリティプラットフォーム122は、マルウェアアプリケーションを識別するコンテンツパッケージを周期的に送信することができる(例えば、毎日、毎時、または他の間隔、及び／又は、１つ以上のポリシによって構成されたイベントに基づいて)。コンテンツパッケージの例は、識別されたマルウェアアプリケーションのリストを含み、パッケージ名、アプリケーションを一意に識別するためのハッシュ値、および、識別された各マルウェアアプリケーションのマルウェア名(及び／又は、マルウェアファミリ名)といった情報を伴う。サブスクリプションは、データ機器102によってインターセプトされ、データ機器102によってセキュリティプラットフォーム122に送信されるファイルの解析のみをカバーすることができ、そして、また、セキュリティプラットフォーム122(または、そのサブセット、単なるモバイルマルウェアであるが、マルウェアの他の形態ではないもの（例えば、PDFマルウェア）)に対して知られている全てのマルウェアの署名をカバーすることもできる。以下でより詳細に説明されるように、プラットフォーム122は、また、データ機器102がマルウェアを検出するのを助けることができる機械学習モデルといった、他のタイプの情報を利用可能にすることができる。In various embodiments, thesecurity platform 122 makes the results of the analysis of the samples available to thedata device 102 as part of the subscription via a list of signatures (and/or other identifiers). For example, thesecurity platform 122 can periodically (e.g., daily, hourly, or other intervals and/or based on events configured by one or more policies) send a content package identifying malware applications. An example content package includes a list of identified malware applications, along with information such as the package name, a hash value to uniquely identify the application, and a malware name (and/or malware family name) for each identified malware application. The subscription can cover only the analysis of files intercepted by thedata device 102 and sent by thedata device 102 to thesecurity platform 122, and can also cover all malware signatures known to the security platform 122 (or a subset thereof, just mobile malware but not other forms of malware (e.g., PDF malware)). As described in more detail below,platform 122 may also make available other types of information, such as machine learning models that can helpdata device 102 detect malware.

様々な実施形態において、セキュリティプラットフォーム122は、データ機器102のオペレータに加えて(または、該当する場合は、その代わりに)、様々なエンティティに対してセキュリティサービスを提供するように構成されている。例えば、自身のそれぞれのエンタープライズネットワーク114および116、並びに、それら自身のそれぞれのデータ機器136および148を有している、他のエンタープライズは、セキュリティプラットフォーム122のオペレータと契約（contract）することができる。他のタイプのエンティティは、また、セキュリティプラットフォーム122のサービスを利用することもできる。例えば、クライアント装置110にインターネットサービスを提供するインターネットサービスプロバイダは、クライアント装置110がダウンロードを試みるアプリケーションを解析するために、セキュリティプラットフォーム122と契約することができる。別の例として、クライアント装置110のオーナーは、セキュリティプラットフォーム122と通信するクライアント装置110上にソフトウェアをインストールすることができる(例えば、セキュリティプラットフォーム122からコンテンツパッケージを受信し、受信したコンテンツパッケージを使用して、ここにおいて説明される技術に従って添付をチェックし、そして、解析のためにアプリケーションをセキュリティプラットフォーム122に送信する)。In various embodiments, thesecurity platform 122 is configured to provide security services to various entities in addition to (or instead of, if applicable) the operator of thedata equipment 102. For example, other enterprises having their ownrespective enterprise networks 114 and 116 and their ownrespective data equipment 136 and 148 may contract with the operator of thesecurity platform 122. Other types of entities may also use the services of thesecurity platform 122. For example, an Internet service provider providing Internet service to theclient device 110 may contract with thesecurity platform 122 to analyze applications that theclient device 110 attempts to download. As another example, the owner of theclient device 110 may install software on theclient device 110 that communicates with the security platform 122 (e.g., to receive a content package from thesecurity platform 122, use the received content package to check attachments according to techniques described herein, and send the application to thesecurity platform 122 for analysis).

IV. 静的／動的解析を使用するサンプル解析IV. Sample analysis using static/dynamic analysis

図3は、サンプルを解析するためのシステムに含めることができる論理コンポーネントの例を示している。解析システム300は、単一の装置を使用して実施することができる。例えば、解析システム300の機能は、データ機器102の中に組み込まれたマルウェア解析モジュール112に実装することができる。解析システム300は、また、複数の別個の装置にわたり、集合的に、実施することができる。例えば、解析システム300の機能は、セキュリティプラットフォーム122によって提供され得る。FIG. 3 illustrates examples of logical components that may be included in a system for analyzing a sample.Analysis system 300 may be implemented using a single device. For example, the functionality ofanalysis system 300 may be implemented inmalware analysis module 112 embedded withindata appliance 102.Analysis system 300 may also be implemented collectively across multiple separate devices. For example, the functionality ofanalysis system 300 may be provided bysecurity platform 122.

様々な実施形態において、解析システム300は、既知の安全コンテンツ及び／又は既知の不良コンテンツのリスト、データベース、または、他のコレクション(コレクション314として図3において集合的に示されている)を使用する。コレクション314は、サブスクリプションサービス(例えば、第三者によって提供されるもの)を介すること、及び／又は、他の処理 (例えば、データ機器102及び／又はセキュリティプラットフォーム122によって実行されるもの)の結果として、様々な方法で獲得され得る。コレクション314に含まれる情報の例は、既知の悪意のあるサーバのURL、ドメイン名、及び／又は、IPアドレス、既知の安全なサーバのURL、ドメイン名、及び／又は、IPアドレス、既知のコマンドおよび制御（C&C）ドメインのURL、ドメイン名、及び／又は、IPアドレス、既知の悪意のあるアプリケーションの署名、ハッシュ、及び／又は、他の識別子、既知の安全なアプリケーションの署名、ハッシュ、及び／又は、他の識別子、既知の悪意のあるファイルの署名、ハッシュ、及び／又は、他の識別子(例えば、Android exploitファイル)、既知の安全なライブラリの署名、ハッシュ、及び／又は、他の識別子、および、既知の悪意のあるライブラリの署名、ハッシュ、及び／又は、他の識別子、である。In various embodiments,analysis system 300 uses a list, database, or other collection of known safe content and/or known bad content (collectively shown in FIG. 3 as collection 314).Collection 314 may be obtained in various ways, such as via a subscription service (e.g., provided by a third party) and/or as a result of other processing (e.g., performed bydata appliance 102 and/or security platform 122). Examples of information included incollection 314 are URLs, domain names, and/or IP addresses of known malicious servers, URLs, domain names, and/or IP addresses of known safe servers, URLs, domain names, and/or IP addresses of known command and control (C&C) domains, signatures, hashes, and/or other identifiers of known malicious applications, signatures, hashes, and/or other identifiers of known safe applications, signatures, hashes, and/or other identifiers of known malicious files (e.g., Android exploit files), signatures, hashes, and/or other identifiers of known safe libraries, and signatures, hashes, and/or other identifiers of known malicious libraries.

A. 摂取（ingestion）A. Ingestion

様々な実施形態においては、解析のための新しいサンプルが受信されると(例えば、サンプルに関連する既存の特徴が解析システム300に存在しない)、それはキュー302に追加される。図3に示すように、アプリケーション130は、システム300によって受信され、そして、キュー302に追加される。In various embodiments, when a new sample is received for analysis (e.g., there are no existing features associated with the sample in the analysis system 300), it is added to thequeue 302. As shown in FIG. 3, anapplication 130 is received by thesystem 300 and added to thequeue 302.

B. 静的解析B. Static Analysis

コーディネータ304は、キュー302をモニタリングし、そして、リソース(例えば、静的解析ワーカー（worker）)が利用可能になると、コーディネータ304は、処理のためにキュー302からサンプルを取り出す(例えば、マルウェア130のコピーをフェッチ（fetch）する)。特に、コーディネータ304は、最初に、静的解析のためにサンプルを静的解析エンジン306に供給する。いくつかの実施形態においては、１つ以上の静的解析エンジンが解析システム300内に含まれており、ここで、解析システム300は、単一の装置である。他の実施形態において、静的解析は、複数のワーカー(すなわち、静的解析エンジン306の複数のインスタンス)を含む別個の静的解析サーバによって実行される。Thecoordinator 304 monitors thequeue 302, and when resources (e.g., static analysis workers) become available, thecoordinator 304 removes samples from thequeue 302 for processing (e.g., fetches a copy of the malware 130). In particular, thecoordinator 304 first provides the samples to thestatic analysis engine 306 for static analysis. In some embodiments, one or more static analysis engines are included within theanalysis system 300, where theanalysis system 300 is a single device. In other embodiments, static analysis is performed by a separate static analysis server that includes multiple workers (i.e., multiple instances of the static analysis engine 306).

静的解析エンジンは、サンプルに関する一般的な情報を獲得し、そして、それを静的解析レポート308内に(適宜、発見的情報および他の情報と共に)含める。レポートは、静的解析エンジンによって、または、静的解析エンジン306から情報を受信するように構成され得るコーディネータ304によって(または、別の適切なコンポーネントによって)作成され得る。いくつかの実施形態において、収集された情報は、作成される別個の静的解析レポート308(すなわち、レポート308からのデータベースレコードの部分)の代わりに、または、それに加えて、サンプルのデータベースレコード(例えば、データベース316)に保管される。いくつかの実施形態において、静的解析エンジンは、また、アプリケーション(例えば、「安全な（“safe”）」、「疑わしい（“suspicious”）」、または「悪意のある（“malicious”）」もの)に関する裁決を形成する。一つの例として、たとえ１つの「悪意のある」静的機能がアプリケーションに存在する場合(例えば、アプリケーションが既知の悪意のあるドメインへのハードリンクを含んでいる)、裁決は「悪意のある」ものであり得る。別の例として、各特徴にポイントを割り当てることができ(例えば、発見された場合の重大度に基づいて、悪意を予測するための特徴の信頼性に基づいて、等）、裁決は、静的解析結果に関連するポイントの数に基づいて、静的解析エンジン306(または、該当する場合は、コーディネータ304)によって割り当てることができる。The static analysis engine obtains general information about the sample and includes it (along with heuristic and other information, as appropriate) in astatic analysis report 308. The report may be generated by the static analysis engine or by the coordinator 304 (or by another suitable component), which may be configured to receive information from thestatic analysis engine 306. In some embodiments, the collected information is stored in the database record of the sample (e.g., database 316) instead of or in addition to a separate static analysis report 308 (i.e., part of the database record from report 308) being generated. In some embodiments, the static analysis engine also forms a verdict regarding the application (e.g., "safe", "suspicious", or "malicious"). As one example, if one "malicious" static feature is present in the application (e.g., the application contains a hard link to a known malicious domain), the verdict may be "malicious". As another example, points can be assigned to each feature (e.g., based on severity if found, based on the feature's reliability for predicting malicious intent, etc.), and a decision can be assigned by the static analysis engine 306 (orcoordinator 304, if applicable) based on the number of points associated with the static analysis results.

C. 動的解析C. Dynamic Analysis

一旦、静的解析が完了すると、コーディネータ304は、アプリケーションにおいて動的解析を実行するために、利用可能な動的解析エンジン310を配置する。静的解析エンジン306と同様に、解析システム300は、１つ以上の動的解析エンジンを直接的に含むことができる。他の実施形態において、動的解析は、複数のワーカー(すなわち、動的解析エンジン310の複数のインスタンス)を含む別個の動的解析サーバによって実行される。Once the static analysis is complete, thecoordinator 304 deploys an availabledynamic analysis engine 310 to perform dynamic analysis on the application. Similar to thestatic analysis engine 306, theanalysis system 300 may directly include one or more dynamic analysis engines. In other embodiments, dynamic analysis is performed by a separate dynamic analysis server that includes multiple workers (i.e., multiple instances of the dynamic analysis engine 310).

各ダイナミック解析ワーカーは、仮想マシンインスタンスを管理する。いくつかの実施形態において、静的解析の結果(例えば、静的解析エンジン306によって実行されるもの)は、レポート形式(308)であるか、かつ／あるいは、データベース316に保管されているか、または、別の方法で保管されているかのいずれかで、動的解析エンジン310に対する入力として提供される。例えば、動的解析エンジン310によって使用される仮想マシンインスタンス(例えば、Microsoft Windows7 SP2 vs. Microsoft Windows10 Enterprise、または、iOS 11.0 vs. iOS 12.0)の選択／カスタマイズを助けるために、静的レポート情報を使用することができる。複数の仮想マシンインスタンスが同時に実行される場合、単一の動的解析エンジンが全てのインスタンスを管理することができ、または、必要に応じて、複数の動的解析エンジンを(例えば、それ自身の仮想マシンインスタンスの各管理と共に)使用することができる。以下でより詳細に説明するように、解析の動的部分の最中に、アプリケーション(ネットワークアクティビティを含む)によって取られたアクションが解析される。Each dynamic analysis worker manages a virtual machine instance. In some embodiments, the results of the static analysis (e.g., performed by static analysis engine 306), either in report form (308) and/or stored indatabase 316 or otherwise, are provided as input todynamic analysis engine 310. For example, the static report information can be used to aid in the selection/customization of the virtual machine instance (e.g.,Microsoft Windows 7 SP2 vs.Microsoft Windows 10 Enterprise, or iOS 11.0 vs. iOS 12.0) used bydynamic analysis engine 310. When multiple virtual machine instances run simultaneously, a single dynamic analysis engine can manage all instances, or multiple dynamic analysis engines can be used (e.g., with each managing its own virtual machine instance), as needed. During the dynamic portion of the analysis, actions taken by the application (including network activity) are analyzed, as described in more detail below.

様々な実施形態において、サンプルの静的解析は、省略されるか、または、該当する場合、別個のエンティティによって実施される。一つの例として、従来の静的及び／又は動的解析は、第１エンティティによってファイルにおいて実行され得る。一旦(例えば、第１エンティティによって)所与のファイルが悪意のものであると決定されると、そのファイルは、特に、マルウェアのネットワーク活動の使用に関連する追加的な解析のために(例えば、動的解析エンジン310によって)、第２エンティティ(例えば、セキュリティプラットフォーム122のオペレータ)に提供され得る。In various embodiments, static analysis of the sample is omitted or, if applicable, performed by a separate entity. As one example, traditional static and/or dynamic analysis may be performed on the file by the first entity. Once a given file is determined to be malicious (e.g., by the first entity), the file may be provided to a second entity (e.g., an operator of security platform 122) for additional analysis (e.g., by dynamic analysis engine 310), particularly related to the malware's use of network activity.

解析システム300によって使用される環境は、アプリケーションが実行されている間に観察された挙動が、それらが発生したときにログに記録されるように(例えば、フッキング（hooking）およびログキャット（logcat）をサポートするカスタマイズされたカーネルを使用して)、計装され／フックされる。エミュレータに関連するネットワークトラフィックも、また、(例えば、pcapを使用して)キャプチャされる。ログ／ネットワークデータは、解析システム300上に一時ファイルとして保管することができ、そして、また、より永続的に(例えば、HDFS、または他の適切なストレージ技術、もしくは、MongoDBといった、技術の組み合わせを使用して)保管することもできる。動的解析エンジン(または、別の適切なコンポーネント)は、サンプルによって行われた接続をドメイン、IPアドレス、等のリスト(314)と比較し、そして、サンプルが悪意のあるエンティティと通信したか(または、通信を試みたか)否かを決定することができる。The environment used by theanalysis system 300 is instrumented/hooked (e.g., using a customized kernel that supports hooking and logcat) so that behaviors observed while the application is running are logged as they occur. Network traffic associated with the emulator is also captured (e.g., using pcap). The log/network data can be stored as temporary files on theanalysis system 300, and can also be stored more persistently (e.g., using HDFS, or other suitable storage technology, or a combination of technologies, such as MongoDB). The dynamic analysis engine (or another suitable component) can compare connections made by the sample to a list (314) of domains, IP addresses, etc., and determine whether the sample communicated (or attempted to communicate) with a malicious entity.

静的解析エンジンと同様に、動的解析エンジンは、その解析の結果を、テストされるアプリケーションに関連するレコードにおけるデータベース316に保管する(かつ／あるいは、該当する場合、結果をレポート312に含める)。いくつかの実施形態において、動的解析エンジンは、また、アプリケーションに関する裁決(例えば、「安全な」、「疑わしい」、または「悪意のある」)も形成する。一つの例として、たとえ１つの「悪意のある」行為がアプリケーションによって取られたとしても(例えば、既知の悪意のあるドメインにコンタクトする試み、または、機密情報を除去しようとする試みが観察される)、裁決は「悪意のある」であり得る。別の例として、実施されたアクションに対してポイントを割り当てることができ(例えば、発見された場合の重大性に基づいて、悪意を予測するための行為の信頼性に基づいて、等）、そして、動的解析エンジン310(または、該当する場合は、コーディネータ304)によって、動的解析結果に関連するポイントの数に基づいて、裁決を指定することができる。いくつかの実施態様において、サンプルに関連する最終的な裁決は、レポート308とレポート312の組み合わせに基づいて、(例えば、コーディネータ304によって)行われる。Similar to the static analysis engine, the dynamic analysis engine stores the results of its analysis in adatabase 316 in a record associated with the application being tested (and/or includes the results in areport 312, if applicable). In some embodiments, the dynamic analysis engine also forms a verdict (e.g., "safe," "suspicious," or "malicious") regarding the application. As one example, even if a single "malicious" action was taken by the application (e.g., an attempt to contact a known malicious domain or an attempt to remove sensitive information is observed), the verdict may be "malicious." As another example, points may be assigned to the actions taken (e.g., based on the severity if found, based on the reliability of the action to predict maliciousness, etc.), and a verdict may be assigned by the dynamic analysis engine 310 (or thecoordinator 304, if applicable) based on the number of points associated with the dynamic analysis results. In some embodiments, a final verdict associated with the sample is made (e.g., by the coordinator 304) based on a combination of thereports 308 and 312.

V. インラインマルウェア検出V. Inline malware detection

図1の環境に戻ると、何百万もの新しいマルウェアサンプルが毎月生成され得る(例えば、システム120のオペレータといった不正な個人によるものであり、既存のマルウェアに微妙な変更を加えるか、または、新しいマルウェアを作成するかいずれかによる)。従って、セキュリティプラットフォーム122が(少なくとも初期に)署名を有していない多くのマルウェアサンプルが存在している。さらに、セキュリティプラットフォーム122が新たに作成されたマルウェアの署名を生成した場合でも、リソースの制約により、データ機器102といった、データ機器は、任意の時点で、全ての既知の署名のリスト(例えば、プラットフォーム122上に保管されたもの)を有すること／使用することができない。Returning to the environment of FIG. 1, millions of new malware samples may be generated each month (e.g., by unauthorized individuals, such as operators ofsystem 120, either by making subtle modifications to existing malware or by creating new malware). Thus, there are many malware samples for whichsecurity platform 122 does not have signatures (at least initially). Furthermore, even ifsecurity platform 122 generates signatures for newly created malware, due to resource constraints, a data appliance, such asdata appliance 102, may not have/use a list of all known signatures (e.g., stored on platform 122) at any given time.

ときどき、マルウェア130といった、マルウェアは成功裡にネットワーク140に侵入する。この理由の１つは、データ機器102が「初回許可（“first-time allow”）」原則に基づいて動作する場合である。データ機器102が、サンプル(例えば、サンプル130)についての署名を有しておらず、そして、解析のためにそれをセキュリティプラットフォーム122に提出する場合、裁決(例えば、「良性」、「悪意のある」、「不明」、等）を返すのに、セキュリティプラットフォーム122が概ね5分を要するものと仮定する。その5分間の最中にシステム120とクライアント装置104との間の通信をブロックする代わりに、初回許可の原則の下で、通信が許可されている。裁決が返された場合(例えば、5分後)、データ機器102は、裁決を使用して、ネットワーク140へのマルウェア130のその後の送信を阻ブロックすることができ、システム120とネットワーク140との間の通信を阻止することができる、等。様々な実施形態において、データ機器102がセキュリティプラットフォーム122からの裁決を待っている間に、サンプル130の第２コピーがデータ機器102に到着した場合、サンプル130の第２コピー(および、それに続く任意のコピー)は、セキュリティプラットフォーム122からの応答を待つ間、システム120によって保持される。Sometimes, malware, such asmalware 130, successfully infiltratesnetwork 140. One reason for this is whendata device 102 operates on a “first-time allow” basis. Assume that ifdata device 102 does not have a signature for a sample (e.g., sample 130) and submits it tosecurity platform 122 for analysis, it takessecurity platform 122 approximately five minutes to return a verdict (e.g., “benign,” “malicious,” “unknown,” etc.). Instead of blocking communication betweensystem 120 andclient device 104 during that five-minute period, under the first-time allow basis, communication is allowed. When a verdict is returned (e.g., after five minutes),data device 102 can use the verdict to block further transmission ofmalware 130 tonetwork 140, prevent communication betweensystem 120 andnetwork 140, etc. In various embodiments, if a second copy of thesample 130 arrives at thedata device 102 while thedata device 102 is awaiting a decision from thesecurity platform 122, the second copy of the sample 130 (and any subsequent copies) is retained by thesystem 120 while awaiting a response from thesecurity platform 122.

残念ながら、データ機器102がセキュリティプラットフォーム122からの裁決を待つ5分間に、クライアント装置104のユーザはマルウェア130を実行し、クライアント装置104またはネットワーク140内の他のノードを危険にさらす可能性があった。上述のように、様々な実施形態において、データ機器102はマルウェア解析モジュール112を含んでいる。マルウェア解析モジュール112が実行できるタスクの１つは、インラインマルウェア検出である。特に、以下でさらに詳細に説明するように、ファイル(サンプル130といったもの）がデータ機器102を通過する際に、データ機器102上のファイルの効率的な解析を実行するために機械学習技術を適用することができ(例えば、データ機器102によってファイルにおいて実行される他の処理と並行して)、そして、初期の悪意裁定は、(例えば、セキュリティプラットフォーム122からの最低を待つ間に)データ機器102によって決定することができる。Unfortunately, during the five minutes that thedata appliance 102 waits for a decision from thesecurity platform 122, the user of theclient device 104 could have executedmalware 130, potentially compromising theclient device 104 or other nodes in thenetwork 140. As described above, in various embodiments, thedata appliance 102 includes amalware analysis module 112. One task that themalware analysis module 112 can perform is inline malware detection. In particular, as described in more detail below, machine learning techniques can be applied to perform efficient analysis of a file (such as a sample 130) on thedata appliance 102 as it passes through the data appliance 102 (e.g., in parallel with other processing performed on the file by the data appliance 102), and an initial malicious decision can be determined by the data appliance 102 (e.g., while waiting for a decision from the security platform 122).

データ機器102といったリソース制約付きの（resource constrained）機器においてでそうした解析を実施する際には、様々な困難が生じ得る。機器102における１つの主要なリソースは、セッションメモリである。セッションは、情報のネットワーク転送であり、ここにおいて説明される技術に従って機器102が解析するファイルを含んでいる。単一の機器は、何百万もの同時セッションを有することがあり、そして、所与のセッションの最中に持続することができるメモリは極めて限られている。データ機器102といった、データ機器においてインライン解析を実行することにおける第１の困難は、そうしたメモリ上の制約のせいで、データ機器102が、典型的には、ファイル全体を一度に処理することはできず、代わりに、パケット毎に処理する必要がある一連のパケットを受信することである。従って、データ機器102によって使用される機械学習アプローチは、様々な実施形態においてパケットストリームを収容（accommodate）する必要がある。第２の問題は、場合によっては、データ機器102が、処理される所与のファイルエンドがどこで生じるか(例えば、ストリームにおけるサンプル130の終端)を決定できないことである。データ機器102によって使用される機械学習アプローチは、従って、種々の実施形態において潜在的に途中（midstream）(例えば、サンプル130の受領／処理の途中、または、そうでなければ実際のファイル終了の前)の所与のファイルに関して裁決を下すことができる必要がある。Various challenges may arise when performing such analysis on a resource constrained appliance, such as thedata appliance 102. One major resource on theappliance 102 is session memory. A session is a network transfer of information, including the files that theappliance 102 analyzes according to the techniques described herein. A single appliance may have millions of simultaneous sessions, and the memory that can be sustained during a given session is very limited. A first challenge in performing inline analysis on a data appliance, such as thedata appliance 102, is that due to such memory constraints, thedata appliance 102 typically cannot process an entire file at once, but instead receives a series of packets that must be processed packet by packet. Thus, the machine learning approach used by thedata appliance 102 must accommodate the packet stream in various embodiments. A second challenge is that in some cases, thedata appliance 102 cannot determine where a given file being processed ends (e.g., the end of thesample 130 in the stream). The machine learning approach used by thedata appliance 102 must therefore, in various embodiments, be able to make a decision regarding a given file potentially midstream (e.g., midway through receipt/processing of thesample 130 or otherwise prior to the actual end of the file).

A. 機械学習モデルA. Machine learning models

以下でさらに詳細に説明するように、様々な実施形態において、セキュリティプラットフォーム122は、インラインマルウェア検出と共に使用するデータ機器102のために、データ機器102に対して一式の機械学習モデルを提供する。モデルは、悪意のあるファイルに対応している、セキュリティプラットフォーム122によって決定される特徴(例えばnグラム（n-grams）または他の特徴)を組み込んでいる。そうしたモデルの２つのタイプの例は、線形分類モデルおよび非線形分類モデルを含む。データ機器102によって使用され得る線形分類モデルの例は、ロジスティック回帰および線形サポートベクトルマシンを含む。データ機器102によって使用され得る非線形分類モデルの一つの例は、勾配ブースティングツリー(例えば、eXtreme Gradient Boosting（XGBoost）)を含む。非線形モデルは、より正確である(そして、難読化された／偽装されたマルウェアをより良好に検出することができる)が、線形モデルは、機器102においてかなり少ないリソースを使用する(そして、JavaScriptまたは類似のファイルを効率的に解析するのにより適している)。As described in further detail below, in various embodiments, thesecurity platform 122 provides a set of machine learning models to thedata appliance 102 for use with inline malware detection. The models incorporate features (e.g., n-grams or other features) determined by thesecurity platform 122 that correspond to malicious files. Examples of two types of such models include linear and non-linear classification models. Examples of linear classification models that may be used by thedata appliance 102 include logistic regression and linear support vector machines. One example of a non-linear classification model that may be used by thedata appliance 102 includes gradient boosting trees (e.g., eXtreme Gradient Boosting (XGBoost)). The non-linear models are more accurate (and can better detect obfuscated/disguised malware), while the linear models use significantly fewer resources in the appliance 102 (and are more suitable for efficiently analyzing JavaScript or similar files).

以下でさらに詳細に説明するように、解析される所与のファイルに使用される分類モデルのタイプは、そのファイルに関連付けられたファイルタイプに基づくことができる(そして、例えば、マジックナンバーによって、決定することができる)。As described in more detail below, the type of classification model used for a given file being analyzed can be based on the file type associated with that file (and can be determined, for example, by a magic number).

1. 脅威エンジンについて追加的な詳細1. Additional details about the threat engine

様々な実施形態において、データ機器102は脅威エンジン244を含む。脅威エンジンは、それぞれのデコーダステージおよびパターンマッチステージの最中に、プロトコルデコーディングおよび脅威署名マッチングの両方を組み込んでいる。２つのステージの結果は、検出器ステージによって併合される。In various embodiments, thedata appliance 102 includes athreat engine 244. The threat engine incorporates both protocol decoding and threat signature matching during the respective decoder and pattern match stages. The results of the two stages are merged by the detector stage.

データ機器102がパケットを受信すると、データ機器102はセッションマッチを実行して、そのパケットがどのセッションに属するかを決定する(データ機器102が同時セッションをサポートすることを可能にしている)。各セッションは、特定のプロトコルデコーダ(例えば、Webブラウジングデコーダ、FTPデコーダ、またはSMTPデコーダ)を意味するセッション状態を有している。ファイルがセッションの一部として送信されるとき、適用可能なプロトコルデコーダは、適切なファイル特有のデコーダ(例えば、PEファイルデコーダ、JavaScriptデコーダ、またはPDFデコーダ)を使用することができる。When thedata appliance 102 receives a packet, it performs a session match to determine which session the packet belongs to (allowing thedata appliance 102 to support simultaneous sessions). Each session has a session state that represents a particular protocol decoder (e.g., a Web browsing decoder, an FTP decoder, or an SMTP decoder). When a file is sent as part of a session, the applicable protocol decoder can use the appropriate file-specific decoder (e.g., a PE file decoder, a JavaScript decoder, or a PDF decoder).

脅威エンジン244の一つの例示的な実施形態の部分が図4に示されている。所与のセッションに対して、デコーダ402は、対応するプロトコルおよびマーキングのコンテキスト（marking context）に従って、トラフィックバイトストリームを進む（walk）。コンテキストの一つの例は、エンドオブファイル（end-of-file）コンテキストである(例えば、JavaScriptファイルの処理中に<／script>に出会うこと)。デコーダ402は、パケット内のエンドオブファイルコンテキストをマーク付けすることができ、次いで、ファイルの観察された特徴を使用して、適切なモデルの実行をトリガするために使用することができる。ある場合(例えば、FTPトラフィック)では、コンテキストを識別／マーク付けする、デコーダ402のための明示的なプロトコルレベルのタグが存在しないことがある。以下でさらに詳細に説明するように、様々な実施形態において、デコーダ402は、他の情報(例えば、ヘッダで報告されたファイルサイズ)を使用して、ファイルの特徴抽出がいつ終了すべきか(例えば、オーバーレイセクションを開始する)、そして、適切なモデルを使用する実行が開始すべきかを判断する。A portion of one example embodiment of thethreat engine 244 is shown in FIG. 4. For a given session, thedecoder 402 walks the traffic byte stream according to the corresponding protocol and marking context. One example of a context is an end-of-file context (e.g., encountering </script> while processing a JavaScript file). Thedecoder 402 can mark the end-of-file context in the packet, which can then be used to trigger execution of the appropriate model using the observed characteristics of the file. In some cases (e.g., FTP traffic), there may not be an explicit protocol-level tag for thedecoder 402 to identify/mark the context. As described in more detail below, in various embodiments, thedecoder 402 uses other information (e.g., file size reported in the header) to determine when feature extraction of the file should end (e.g., begin the overlay section) and execution using the appropriate model should begin.

デコーダ402は、２つの部分から構成される。デコーダ402の第１部分は、状態マシン言語を使用して状態マシンとして実装することができる仮想マシン部分(404)である。デコーダ402の第２部分は、トラフィックが一致したときに状態マシン遷移およびアクションをトリガするためのトークン406のセットである。脅威エンジン244は、また、(例えば、脅威パターンに対して)パターンマッチングを実行する脅威パターン照合器408(例えば、正規表現を使用している)を含む。一つの例として、脅威パターン照合器（matcher）408は、(例えば、セキュリティプラットフォーム122によって)照合する文字列（的確な（exact）文字列またはワイルドカード文字列のいずれか）のテーブル、および、照合する文字列が見つかった場合に行う対応するアクションを備えることができる。検出器410は、デコーダ402および脅威パターン照合器408によって提供される出力を処理して、様々なアクションを行う。Thedecoder 402 is composed of two parts. The first part of thedecoder 402 is a virtual machine part (404) that can be implemented as a state machine using a state machine language. The second part of thedecoder 402 is a set of tokens 406 for triggering state machine transitions and actions when traffic is matched. Thethreat engine 244 also includes a threat pattern matcher 408 (e.g., using regular expressions) that performs pattern matching (e.g., against threat patterns). As one example, the threat pattern matcher 408 can include (e.g., by the security platform 122) a table of strings to match (either exact strings or wildcard strings) and corresponding actions to take if a matching string is found. The detector 410 processes the output provided by thedecoder 402 and the threat pattern matcher 408 to take various actions.

2. Nグラム（n-grams）2. N-grams

セッション内のデータは、一連のnグラム（n-grams）へと分割することができる－一連のバイト文字列。一つの例として、セッションにおける16進数データの一部が「1023ae42f6f28762aab」であると仮定する。とすると、シーケンスにおける2グラム（2-gram）は、「1023」、「23ae」、「ae42」、「42f6」、等といった、隣接する文字の全てのペアである。様々な実施形態において、脅威エンジン244は、8グラム（8-gram）を使用してファイルを解析するように構成されている。他のnグラムも、また、使用することができる、7グラムまたは4グラムといったもの。上記の文字列の例において、「1023ae42f6f28762」は8グラムであり、「23ae42f6f28762aa」は8グラムである、等。バイトシーケンスで可能な異なる8グラムの総数は、2の64乗(18,446,744,073,709,551,616)である。バイトシーケンス内の可能な8グラムの全てを検索することは、データ機器102のリソースを容易に超えるだろう。代わりに、以下でより詳細に説明されるように、セキュリティプラットフォーム122によって、脅威エンジン244による使用のためのデータ機器102に対して、大幅に低減された8グラムのセットが提供される。The data in a session can be broken down into a series of n-grams - a series of byte strings. As an example, assume that some of the hexadecimal data in a session is "1023ae42f6f28762aab". Then, the 2-grams in the sequence are all pairs of adjacent characters, such as "1023", "23ae", "ae42", "42f6", etc. In various embodiments, thethreat engine 244 is configured to analyze the file using 8-grams. Other n-grams can also be used, such as 7-grams or 4-grams. In the example string above, "1023ae42f6f28762" is an 8-gram, "23ae42f6f28762aa" is an 8-gram, etc. The total number of different 8-grams possible in a byte sequence is 2^64 (18,446,744,073,709,551,616). Searching for all possible 8-grams within a byte sequence would easily exceed the resources of thedata appliance 102. Instead, a significantly reduced set of 8-grams is provided by thesecurity platform 122 to thedata appliance 102 for use by thethreat engine 244, as described in more detail below.

ファイルに対応するセッションパケットが脅威エンジン244によって受信されると、脅威パターン照合器408は、テーブル内の文字列に対する一致についてパケットを解析する(例えば、正規表現及び／又は的確な文字列一致を実行することによる)。一致(例えば、対応するパターンIDによって識別される一致の各インスタンス)、および、各一致がどのオフセットで発生したかのリストが生成される。これらの一致に対するアクションは、オフセットの順序(例えば、下から上へ)で行われる。所与の一致に対して(すなわち、特定のパターンIDに対応して)、行われるべき１つ以上のアクションのセットが(例えば、アクションをパターンIDにマッピングするアクションテーブルを介して)指定される。When a session packet corresponding to a file is received by thethreat engine 244, the threat pattern matcher 408 analyzes the packet for matches against strings in the table (e.g., by performing regular expressions and/or exact string matching). A list of matches (e.g., each instance of a match identified by a corresponding pattern ID) and at which offset each match occurred is generated. Actions on these matches are taken in offset order (e.g., bottom to top). For a given match (i.e., corresponding to a particular pattern ID), a set of one or more actions to be taken is specified (e.g., via an action table that maps actions to pattern IDs).

セキュリティプラットフォーム122によって提供される8グラムのセットは、脅威パターン照合器408がすでに実行している一致(例えば、JavaScriptファイルがパスワードストレージにアクセスする場所、または、PEファイルがLocal Security Authority Subsystem Service（LSASS）APIを呼び出す場所といった、マルウェアの特定の指標を探す発見的一致（heuristic matches）)のテーブルへの追加として、(例えば、的確な文字列一致として)追加され得る。このアプローチの１つの利点は、パケットを通過する複数のパスを実行する代わりに(例えば、最初に発見的一致を評価し、そして、次いで、8グラム一致を評価する)、脅威パターン照合器408によって実行される他の検索と並行して8グラムを検索できることである。The set of 8-grams provided by thesecurity platform 122 can be added (e.g., as an exact string match) as an addition to a table of matches that the threat pattern matcher 408 is already performing (e.g., heuristic matches that look for specific indicators of malware, such as where a JavaScript file accesses password storage or where a PE file calls a Local Security Authority Subsystem Service (LSASS) API). One advantage of this approach is that 8-grams can be searched in parallel with other searches performed by the threat pattern matcher 408, instead of performing multiple passes through the packet (e.g., first evaluating a heuristic match and then evaluating an 8-gram match).

以下でより詳細に説明されるように、8グラム一致は、種々の実施形態において、線形および非線形の両方の分類モデルによって使用されるnグラム一致に対して指定可能なアクションの例は、(例えば、線形分類器について)重み付きカウンタを増加させること（incrementing）、および、(例えば、非線形分類器について)特徴ベクトル内の一致の保存を含む。どのアクションが行われるかは、(どのタイプのモデルを使用するかを決定する)パケットに関連付けられたファイルタイプに基づいて指定され得る。As described in more detail below, 8-gram matches are used by both linear and non-linear classification models in various embodiments. Examples of actions that can be specified for n-gram matches include incrementing a weighted counter (e.g., for linear classifiers) and saving the match in a feature vector (e.g., for non-linear classifiers). Which action is taken can be specified based on the file type associated with the packet (which determines which type of model to use).

3. モデルの選択3. Select a model

場合によっては、ファイルのヘッダの中で特定のファイルタイプが指定される(例えば、ファイル自体の最初の7バイト内に現れるマジックナンバーとして)。そうしたシナリオにおいて、脅威エンジン244は、(例えば、ファイルタイプおよび対応するモデルを列挙するセキュリティプラットフォーム122によって提供されるテーブルに基づいて)指定されたファイルタイプに対応する適切なモデルを選択することができる。JavaScriptといった、他の場合において、マジックナンバーまたは他のファイルタイプ識別子(ヘッダに存在する場合)は、どの分類モデルを使用すべきかを証明するものではない。一つの例として、JavaScriptは「textfile」のファイルタイプを有するだろう。JavaScriptといったファイルタイプを識別するために、デコーダ402が使用され、確定的有限状態オートマトン（deterministic finite state automaton、DFA）パターンマッチングを実行し、そして、発見的手法(例えば、ファイルがJavaScriptであることを識別する<script>および他のインジケータ)を適用することができる。決定されたファイルタイプ及び／又は選択された分類モデルは、セッション状態に保存される。セッションに関連付けられたファイルタイプは、セッションの進行につれて、更新することができる。例えば、テキストストリームにおいて、<script>タグに出会うとき、JavaScriptファイルタイプをセッションに割り当てることができる。対応する<／script>出会うときは、ファイルタイプを変更することができる(例えば、平文に戻る)。In some cases, a particular file type is specified in the file's header (e.g., as a magic number that appears in the first 7 bytes of the file itself). In such a scenario, thethreat engine 244 can select an appropriate model that corresponds to the specified file type (e.g., based on a table provided by thesecurity platform 122 that lists file types and corresponding models). In other cases, such as JavaScript, the magic number or other file type identifier (if present in the header) does not dictate which classification model should be used. As an example, JavaScript would have a file type of "textfile". To identify a file type such as JavaScript, thedecoder 402 can be used to perform deterministic finite state automaton (DFA) pattern matching and apply heuristics (e.g., <script> and other indicators that identify the file as JavaScript). The determined file type and/or selected classification model are saved in the session state. The file type associated with a session can be updated as the session progresses. For example, a JavaScript file type can be assigned to a session when a <script> tag is encountered in a text stream. When the corresponding </script> is encountered, the file type can be changed (e.g. back to plain text).

4. 線形分類モデル4. Linear classification model

線形モデルを表現する１つの方法は、以下の線形方程式を使用することである。One way to represent a linear model is to use the following linear equation:

Σ（β_ｉｘ_ｉ）＜Ｃ，ｉ＝1,2,3…,P
ここで、Pは特徴の総数であり、ｘ_ｉはi番目の特徴であり、β_ｉは特徴ｘ_ｉの係数(重み付け)であり、そして、Cは閾値定数である。この例において、Cは悪意の裁決に対する閾値であり、所与のファイルについて合計がCより小さい場合に、そのファイルには良性の裁定が割り当てられ、かつ、合計がC以上の場合には、そのファイルに悪意の裁定が割り当てられることを意味している。 Σ(β_i x_i )<C, i=1,2,3…,P
where P is the total number of features, x_i is the i-th feature, β_i is the coefficient (weight) of feature x_i , and C is a threshold constant. In this example, C is the threshold for malicious verdicts, meaning that if the sum for a given file is less than C, the file is assigned a benign verdict, and if the sum is greater than or equal to C, the file is assigned a malicious verdict.

データ機器102による線形分類モデルを使用するための１つのアプローチは、以下の通りである。入力ファイルのスコアを追跡するために単一のフロート(d)を使用され、そして、観察されたnグラムおよび対応する係数(すなわち、ｘ_ｉおよびβ_ｉ)を保管するためにハッシュテーブルが使用される。それぞれ入ってくるパケットに対して、n-gram特徴(例えば、セキュリティプラットフォーム122によって提供されるようなもの)それぞれがチェックされる。ハッシュテーブルの特徴(ｘ_ｉ)について一致が見つかると、いつでも、ハッシュテーブル内でその特徴に一致する単一のフロート(β_ｉ)が追加される(例えば、dに対して)。ファイルエンドに到達すると、単一フロート(d)が閾値(C)に対して比較され、ファイルについて裁決を決定する。 One approach to using a linear classification model by thedata appliance 102 is as follows: A single float (d) is used to track the score of the input file, and a hash table is used to store the observed n-grams and corresponding coefficients (i.e., x_i and β_i ). For each incoming packet, each n-gram feature (e.g., as provided by the security platform 122) is checked. Whenever a match is found for a feature (x_i ) in the hash table, a single float (β_i ) that matches that feature in the hash table is added (e.g., to d). When the end of the file is reached, the single float (d) is compared against a threshold (C) to determine a verdict for the file.

nグラムカウントについて、特徴ｘ_ｉは、i番目のnグラムが観察される回数に等しい。特定のファイルについてi番目のn-gramが4回観測されたと仮定する。４＊β_ｉは、β_ｉ＋β_ｉ＋β_ｉ＋β_ｉに書き換えることができる。i番目のnグラムが何回を観察されるかをカウントし(すなわち4回)、そして、β_ｉを乗算することの代わりに、別のアプローチは、i番目のnグラム観察されるたびにβ_ｉを加算することである。さらに、ファイルについてj番目のnグラムが3回観測されたと仮定する。３＊β_ｉは、同様に、β_ｉ＋β_ｉ＋β_ｉとして書くことができ、β_ｉが何回観察されたかをカウントする代わりに、毎回β_ｉを加算し、そして、次いで、最後に加算する。For n-gram counts, feature x_i is equal to the number of times the ith n-gram is observed. Suppose the ith n-gram is observed 4 times for a particular file. 4*β_i can be rewritten as β_i + β_i + β_i + β_i . Instead of counting how many times the ith n-gram is observed (i.e., 4 times) and then multiplying by β_i , another approach is to add β_i every time the ith n-gram is observed. Suppose further that the jth n-gram is observed 3 times for a file. 3*β_i can similarly be written as β_i + β_i + β_i , where instead of counting how many times β_i is observed, we add β_i each time, and then add it at the end.

Σ（β_ｉｘ_ｉ）を見つけるために、β_ｉｘ_ｉ、β_ｊｘ_ｊ、...それぞれが加算される(ここで、...は他の特徴／重み付けの全てに対応する)。これは、β_ｉ＋β_ｉ＋β_ｉ＋β_ｊ＋β_ｊ＋β_ｊ＋β_ｊとして書き換えることができる。加算は累積的であるため、値の加算は任意の順序(例えば、β_ｉ＋β_ｊ＋β_ｉ＋β_ｊ＋β_ｉ＋β_ｉ＋β_ｊ、等）で加えられ、そして、単一のフロートへと累積される。ここで、フロート(d)が0.0で始まるものと仮定する。特徴ｘ_ｉが観察される度に、β_ｉがフロートdに対して追加され、そして、ｘ_ｊが観察される度に、β_ｊがフロートdに対して追加され得る。このアプローチは、4バイトのフロートをセッション毎のメモリ全体として使用することを可能にし、そして、セッション毎のメモリが特徴の数に比例するアプローチとは対照的である。ここでは、特徴ベクトル全体が重み付けベクトルによって乗算されるように、メモリに保管される。4バイト＊1,000の4Kバイトの特徴の例を使用すると、ストレージについて4Kが必要とされるだろう(単一の4バイトフロートと比較して)。これは、1,000倍高価である。 To find Σ(_βixi ), βixi,_βjxj , ... are each added (where ... corresponds to all of the other features/weightings). This can be rewritten as_βi +_βi +_βi +_βj +_βj +_βj₊_βj₊_βj . Because the addition is cumulative, the values can be added in any order (e.g., βi₊_βj +_βi +_βj +_βi +_βi +_βj ,_etc. ) and accumulated into a single float. Now assume that the float (d) starts at 0.0. Each time feature_xi is observed,_βi can be added to the float d, and each time_xj is observed,_βj can be added to the float d. This approach allows a 4-byte float to be used as the entire memory per session, and contrasts with approaches where the memory per session is proportional to the number of features, where the entire feature vector is stored in memory as it is multiplied by the weight vector. Using the example of 4 bytes * 1,000 4K bytes of features, 4K of storage would be required (compared to a single 4-byte float), which is 1,000 times more expensive.

5. 非線形分類モデル5. Nonlinear classification models

種々の非線形分類アプローチを、ここにおいて説明される技術と共に使用することができる。非線形分類モデルの一つの例は、勾配ブースティングツリーである。この例において、特徴ベクトルは、オールゼロ（all-zero）ベクトルに初期化される。不運にも、(線形モデルとは異なり)非線形モデルでは、存在が検出されている特徴のセット全体(例えば、1000個の特徴)がセッションの全持続期間について持続される。このことは、線形アプローチにおけるほど効率的ではないが、完全な4バイトのフロートではなく、1バイト(0－255)のフロートになるように特徴をダウンサンプリングすることによって、ある程度の効率が未だに得られる(メモリが制約されていないデバイスで使用され得る)。Various non-linear classification approaches can be used with the techniques described herein. One example of a non-linear classification model is a gradient boosting tree. In this example, the feature vector is initialized to an all-zero vector. Unfortunately, in a non-linear model (unlike a linear model), the entire set of features whose presence has been detected (e.g., 1000 features) is persisted for the entire duration of the session. This is not as efficient as in the linear approach, but some efficiency can still be gained by downsampling the features to be 1-byte (0-255) floats rather than full 4-byte floats (which can be used on devices where memory is not constrained).

データ機器102がファイルの全体をスキャンする際、特徴が観察される度に、その特徴の値が特徴ベクトル内で1だけ増加される。一旦ファイルエンドに到達すると(または、そうでなければ特徴観察の終了が発生する)、構築された特徴ベクトルは、勾配ブースティングツリーモデルへと供給される(例えば、セキュリティプラットフォーム122から受信される)。以下でより詳細に説明されるように、非線形分類モデルはnグラム(例えば、8グラム)および非nグラム特徴の両方を使用して構築され得る。非nグラム特徴の一つの例は、ファイルの意図された（purported）サイズである(ファイルのヘッダを含むパケットから値として読み取ることができる)。(例えば、ヘッダで指定されたファイルサイズに基づいて)意図されたエンドオブファイルの後に現れるファイルデータは、オーバーレイと呼ばれる。特徴として機能することに加えて、意図されたファイル長は、そのファイルがどれだけ長いと予想されるかについてプロキシとして使用され得る。非線形分類子（classifier）は、意図されたファイル長に到達するまで、ファイルのパケットストリームに対して実行され得る。そして、次いで、ファイルエンドに実際に到達したか否かにかかわりなく、ファイルに対して裁決を形成することができる。所与のファイルがオーバーレイを含むことは、また、非線形分類モデルの一部として使用され得る特徴の例でもある。種々の実施形態において、ファイルのオーバーレイ部分は解析されず、再度、－実際のファイルエンドの以前に解析を行うことができる。他の実施形態においては、特徴抽出が行われ、そして、実際のファイルエンドに到達するまで、悪意について裁決+が形成されない。As thedata appliance 102 scans the entire file, each time a feature is observed, the value of that feature is incremented by one in the feature vector. Once the end of the file is reached (or an end of feature observations otherwise occurs), the constructed feature vector is fed into the gradient boosting tree model (e.g., received from the security platform 122). As described in more detail below, non-linear classification models can be constructed using both n-gram (e.g., 8-gram) and non-n-gram features. One example of a non-n-gram feature is the purported size of a file (which can be read as a value from a packet containing the file's header). File data that appears after the intended end of the file (e.g., based on the file size specified in the header) is called an overlay. In addition to serving as a feature, the intended file length can be used as a proxy for how long the file is expected to be. A non-linear classifier can be run on the packet stream of the file until the intended file length is reached. A verdict can then be formed for the file, regardless of whether the end of the file has actually been reached or not. The inclusion of an overlay in a given file is also an example of a feature that may be used as part of a non-linear classification model. In various embodiments, the overlay portion of the file is not analyzed, and again, analysis can occur prior to the actual end of the file. In other embodiments, feature extraction is performed and a ruling on malicious intent is not made until the actual end of the file is reached.

一つの例示的な実施形態において、ツリーモデルは、5000個のバイナリツリーを含む。各ツリー上の全てのノードは、特徴および対応する閾値を含んでいる。ツリーの一部の例を図5に示されている。図5に示される例において、特徴(例えば、特徴F4)の値がその閾値(例えば、30)より小さい場合、左分岐がとられる(502)。特徴の値が閾値以上である場合、右分岐がとられる(504)。ツリーは、関連する値(例えば、0.7)を有する、リーフノード(例えば、ノード506)に到達するまで進む。到達した各リーフの値は(ツリーそれぞれについて)合計され(乗算されるのではなく)、裁決を計算するための最終スコアを得る。スコアが閾値を下回る場合、ファイルは良性とみなされ、そして、閾値以上である場合、ファイルは悪意があるとみなされる。最終スコアを得る際の乗算の欠如は、データ機器102のリソース制約環境においてモデルをより効率的に使用する助けとなる。In one exemplary embodiment, the tree model includes 5000 binary trees. Every node on each tree includes a feature and a corresponding threshold. An example of a portion of a tree is shown in FIG. 5. In the example shown in FIG. 5, if the value of a feature (e.g., feature F4) is less than its threshold (e.g., 30), the left branch is taken (502). If the value of the feature is equal to or greater than the threshold, the right branch is taken (504). The tree proceeds until a leaf node (e.g., node 506) is reached that has an associated value (e.g., 0.7). The values of each leaf reached are summed (for each tree) (rather than multiplied) to obtain a final score for computing the adjudication. If the score is below the threshold, the file is deemed benign, and if it is equal to or greater than the threshold, the file is deemed malicious. The lack of multiplication in obtaining the final score helps the model to be used more efficiently in the resource-constrained environment ofdata appliance 102.

様々な実施形態において、ツリー自身は、(更新されたモデルが受信されるまで)データ機器102において固定され、そして、同時に複数のセッションによってアクセスされ得る共有メモリ内に保管され得る。セッション当たりのコストは、セッションの特徴ベクトルを保管するコストであり、一旦セッションの解析が完了するとゼロにすることができる。In various embodiments, the tree itself may be fixed in the data appliance 102 (until an updated model is received) and stored in a shared memory that may be accessed by multiple sessions simultaneously. The cost per session is the cost of storing the feature vectors for a session, which may be zeroed once analysis of the session is complete.

6. プロセスの実施例6. Process implementation example

図6は、データ機器においてインラインマルウェア検出を実行するためのプロセスについて一つの例を示している。様々な実施形態において、プロセス600は、データ機器102によって、そして、特には、脅威エンジン244によって実行される。脅威エンジン244は、適切なスクリプト言語(例えば、Python)で作成されたスクリプト(または、スクリプトのセット)を使用して実装することができる。プロセス600は、また、クライアント装置110といった、エンドポイントにおいても(例えば、クライアント装置110において実行するエンドポイント保護アプリケーションによって)実行され得る。FIG. 6 illustrates an example process for performing inline malware detection on a data appliance. In various embodiments,process 600 is performed bydata appliance 102, and in particular bythreat engine 244.Threat engine 244 may be implemented using a script (or set of scripts) written in a suitable scripting language (e.g., Python).Process 600 may also be performed at an endpoint, such as client device 110 (e.g., by an endpoint protection application executing on client device 110).

プロセス600は、ファイルがセッションの一部として送信されている旨の指示（indication）が機器102によって受信されると、602で開始する。602で実行される処理の一つの例として、所与のセッションについて、関連するプロトコルデコーダは、プロトコルデコーダによってファイルの開始が検出されるとき、適切なファイル特有のデコーダを呼び出すか、または、そうでなければ使用することができる。上述のように、ファイルタイプは(例えば、デコーダ402によって)決定され、そして、セッションに関連付けられる(例えば、ファイルタイプが変化するか、または、ファイルパケットが送信されなくなるまで、後続のファイルタイプ解析を行う必要がないようにする)。Process 600 begins at 602 when an indication is received bydevice 102 that a file is being transmitted as part of a session. As one example of the processing performed at 602, for a given session, an associated protocol decoder may invoke or otherwise use an appropriate file-specific decoder when the start of a file is detected by the protocol decoder. As described above, the file type is determined (e.g., by decoder 402) and associated with the session (e.g., such that no subsequent file type analysis needs to be performed until the file type changes or no more file packets are transmitted).

604において、nグラム解析が、受信パケットのシーケンスに対して実行される。上述のように、nグラム解析は、機器102によってセッションにおいて実行されている他の解析とインラインで行うことができる。例えば、機器102が特定のパケットについて(例えば、特定の発見的方法の存在をチェックするために)解析を実行している間に、それは、また、パケット内の8グラムがセキュリティプラットフォーム122によって提供される8グラムと一致するか否かを決定することもできる。604で実行される処理の最中に、nグラム一致が見つかったときは、条件をファイルタイプ（filetype）に基づいてアクションにマッピングするために対応するパターンIDが使用される。このアクションは、重み付けされたカウンタをインクリメントするか(例えば、ファイルタイプが線形分類子に関連付けられている場合)、または、一致を説明するために特徴ベクトルを更新するか(例えば、ファイルタイプが非線形分類子に関連付けられている場合)のいずれかである。At 604, n-gram analysis is performed on the sequence of received packets. As mentioned above, n-gram analysis can be performed inline with other analysis being performed by thedevice 102 on the session. For example, while thedevice 102 is performing an analysis on a particular packet (e.g., to check for the presence of a particular heuristic), it can also determine whether an 8-gram in the packet matches an 8-gram provided by thesecurity platform 122. During the processing performed at 604, when an n-gram match is found, the corresponding pattern ID is used to map the condition to an action based on the filetype. The action is either to increment a weighted counter (e.g., if the filetype is associated with a linear classifier) or to update a feature vector to account for the match (e.g., if the filetype is associated with a non-linear classifier).

nグラム解析は、エンドオブファイル条件またはチェックポイントのいずれかが到達されるまで、パケットごとに、継続する。その時点(606)で、適切なモデルが、ファイルの裁決を決定するために使用される(すなわち、モデルを使用して得られた最終値を悪意の閾値と比較する)。上述のように、モデルは、nグラム特徴を組み込み、そして、また、他の特徴を(例えば、非線形分類器の場合に)組み込むこともできる。The n-gram analysis continues, packet by packet, until either an end-of-file condition or a checkpoint is reached. At that point (606), the appropriate model is used to determine the verdict of the file (i.e., the final value obtained using the model is compared to a maliciousness threshold). As mentioned above, the model incorporates n-gram features, and may also incorporate other features (e.g., in the case of a non-linear classifier).

最終的に、608では、606でなされた決定に応答してアクションがとられる。応答アクションの一つの例は、セッションの終了である。応答アクションの別の例は、セッションを継続させるが、ファイルが送信されないようにする(代わりに、隔離エリアに置く)ことである。様々な実施形態において、機器102は、その裁決(良性の裁決、悪性の裁決、または、その両方のいずれか)をセキュリティプラットフォーム122と共有するように構成されている。セキュリティプラットフォーム122は、ファイルの独立した解析を完了すると、裁決を形成したモデルの性能の評価を含む、様々な目的のために、機器102によって報告された裁決を使用することができる。Finally, at 608, action is taken in response to the determination made at 606. One example of a responsive action is to terminate the session. Another example of a responsive action is to allow the session to continue but prevent the file from being transmitted (instead, placing it in a quarantine area). In various embodiments, thedevice 102 is configured to share its verdict (either a benign verdict, a malicious verdict, or both) with thesecurity platform 122. Once thesecurity platform 122 has completed its independent analysis of the file, it can use the verdict reported by thedevice 102 for various purposes, including evaluating the performance of the model that formed the verdict.

サンプルについて脅威署名（threat signature）の例を図7Bに示す。特に、「4d73f42438fb5a8579219cdfa9cbbb4ce3f771ffed93af81b052831e4813f8」のSHA-256ハッシュを有するサンプルについて、各ペアにおける第１値は特徴に対応し、そして、第２値はカウントに対応している。図7Bに示される例において、数字を含む特徴(例えば、特徴「3905」)は、nグラム特徴に対応し、そして、「J」と数字を含む特徴(例えば、特徴「J18」)は、非nグラム特徴に対応している。An example of a threat signature for a sample is shown in Figure 7B. In particular, for a sample with a SHA-256 hash of "4d73f42438fb5a8579219cdfa9cbbb4ce3f771ffed93af81b052831e4813f8", the first value in each pair corresponds to a feature and the second value corresponds to a count. In the example shown in Figure 7B, features that contain numbers (e.g., feature "3905") correspond to n-gram features, and features that contain a "J" and a number (e.g., feature "J18") correspond to non-n-gram features.

一つの例示的な実施形態において、セキュリティプラットフォーム122は、データ機器102といった機器による使用のためのモデルを生成するときに、特定の偽陽性率（false positive ratio）(例えば、0.001)を目標とするように構成されている。従って、ある場合には(例えば、1000個のファイルのうち1個)、ここにおいて説明される技術に従ったモデルを使用してインライン解析を実行している際に、データ機器102は、良性のファイルが悪意あるものと誤って判断し得る。そうしたシナリオでは、セキュリティプラットフォーム122が、ファイルが実際には良性であると後に続いて決定した場合に、後で(例えば、別の機器によって)悪意あるものとしてフラグ付けされないように、それをホワイトリストに追加することができる。In one exemplary embodiment,security platform 122 is configured to target a particular false positive ratio (e.g., 0.001) when generating models for use by devices such asdata device 102. Thus, in some cases (e.g., 1 in 1000 files),data device 102 may erroneously determine that a benign file is malicious when performing an inline analysis using a model according to the techniques described herein. In such a scenario, ifsecurity platform 122 subsequently determines that the file is in fact benign, it may add it to a whitelist to prevent it from being later flagged as malicious (e.g., by another device).

ホワイトリスト（whitelisting）に対する１つのアプローチは、そのファイルを機器102に保管されたホワイトリストに追加するように、セキュリティプラットフォーム122に対して指示することである。別のアプローチは、セキュリティプラットフォーム122について、偽陽性のホワイトリストシステム154を指示し、そして、ホワイトリストシステム154について、順に、機器102といった機器を偽陽性情報で最新の状態に保つことである。上述のように、機器102といった機器の１つの問題は、リソース制約されていることである。機器でホワイトリストを維持することに使用されるリソースを最小化する１つのアプローチは、最近最も使われなかった（Least Recently Used、LRU）キャッシュを使用してホワイトリストを維持することである。ホワイトリストは、ファイルハッシュを含むことができ、そして、また、特徴ベクトルまたは特徴ベクトルのハッシュといった、他の要素に基づくこともできる。One approach to whitelisting is to instruct thesecurity platform 122 to add the file to a whitelist stored on thedevice 102. Another approach is to instruct thesecurity platform 122 to a falsepositive whitelist system 154, which in turn keeps devices such as thedevice 102 up to date with the false positive information. As mentioned above, one problem with devices such as thedevice 102 is that they are resource constrained. One approach to minimize the resources used to maintain the whitelist on the device is to maintain the whitelist using a Least Recently Used (LRU) cache. The whitelist can include file hashes, and can also be based on other factors, such as feature vectors or hashes of feature vectors.

VI. モデルの構築VI. Building the model

図1に示された環境に戻ると、先に説明したように、セキュリティプラットフォーム122は、受信したサンプルについて静的および動的解析を実行するように構成さていれる。セキュリティプラットフォーム122は、種々のソースから解析のためのサンプルを受信することができる。上述のように、サンプルソースの一つの例示的なタイプは、データ機器(例えば、データ機器102、136、および148)である。他のソース(例えば、他のセキュリティ機器ベンダー、セキュリティ研究者、等といった、サンプルの１つ以上の第三者プロバイダ)も、また、必要に応じて使用することができる。以下でより詳細に説明されるように、セキュリティプラットフォーム122は、モデルを構築するために、受信するサンプルのコーパス（corpus）を使用することができる(例えば、モデルは、ここにおいて説明される技術の実施形態に従って、次いで、セキュリティ機器102によって使用され得る)。Returning to the environment shown in FIG. 1, as previously described,security platform 122 is configured to perform static and dynamic analysis on received samples.Security platform 122 may receive samples for analysis from a variety of sources. As discussed above, one exemplary type of sample source is a data appliance (e.g.,data appliances 102, 136, and 148). Other sources (e.g., one or more third-party providers of samples, such as other security appliance vendors, security researchers, etc.) may also be used as desired. As described in more detail below,security platform 122 may use the corpus of samples it receives to build models (e.g., the models may then be used bysecurity appliance 102 in accordance with embodiments of the techniques described herein).

様々な実施形態において、静的解析エンジン306は、受信したサンプルに対して特徴抽出を実行するように構成されている(例えば、上述のように他の静的解析機能を実行している間にも)。特徴抽出(例えば、セキュリティプラットフォーム122による)を実行するための一つの例示的なプロセスが、図8Aに示されている。プロセス800は、サンプルの静的解析が開始されると、802で開始する。特徴抽出(804)の最中に、処理されるサンプル(例えば、図3のサンプル130)から、全ての8グラム(または、8グラムが使用されていない実施形態における他の適用可能なnグラム)が抽出される。特に、解析されているサンプル内の8グラムのヒストグラムが(例えば、ハッシュテーブルに)抽出され、これは、処理されているサンプル内で所与の8グラムが観察された回数を示す。静的解析エンジン306による特徴解析の最中に8グラムを抽出することの１つの利点は、(例えば、モデルを構築する際に)第三者から得られたサンプルの使用における潜在的なプライバシーおよび契約上の問題を軽減できることである。結果として得られるヒストグラムからオリジナルのファイルを再構成することができないからである。抽出されたヒストグラムは806で保管される。In various embodiments, thestatic analysis engine 306 is configured to perform feature extraction on received samples (e.g., while performing other static analysis functions as described above). One exemplary process for performing feature extraction (e.g., by the security platform 122) is shown in FIG. 8A.Process 800 begins at 802 when static analysis of a sample is initiated. During feature extraction (804), all 8-grams (or other applicable n-grams in embodiments where 8-grams are not used) are extracted from the sample being processed (e.g.,sample 130 of FIG. 3). In particular, a histogram of 8-grams in the sample being analyzed is extracted (e.g., in a hash table), which indicates the number of times a given 8-gram was observed in the sample being processed. One advantage of extracting 8-grams during feature analysis by thestatic analysis engine 306 is that it mitigates potential privacy and contractual concerns in the use of samples obtained from third parties (e.g., in building models), since the original file cannot be reconstructed from the resulting histograms. The extracted histograms are stored at 806.

様々な実施形態において、静的解析エンジン306は、所与のサンプルについて抽出されたヒストグラム(例えば、ハッシュテーブルを使用して表される)を、他のサンプルから抽出されたヒストグラムと共にストレージ142(例えば、ハドゥープ（Hadoop）クラスタ)に保管する。ハドゥープ内のデータは圧縮され、そして、ハドゥープデータについて操作が実行されると、必要なデータはオンザフライ（on the fly）圧縮解除される。ファイルについて一つの例示的なハッシュテーブル(JSONで表される)の例が図7Aに示されている。行（line）702はファイルのSHA-256ハッシュを示している。行704は、サンプル130がセキュリティプラットフォーム122に到着するUNIX（登録商標）時間を示している。行706は、オーバーレイ部分におけるnグラムのカウントを示している(例えば、'd00fbf4e08bc366':1は、'd00fbf4e08bc366'の１つのインスタンスがオーバーレイセクション内で見つかったことを示す)。行708は、ファイル内に存在する8グラムそれぞれのカウントを示している。行710は、ファイルがオーバーレイを有することを示している。行712は、ファイルのファイルタイプが「.exe」であることを示している。行714は、セキュリティプラットフォーム122がサンプル130の処理を終了したUNIX時間を示している。行716は、ファイルがヒットした非8グラム特徴それぞれのカウントを示している。最後に、行718は、ファイルが(例えば、セキュリティプラットフォーム122によって)悪意があるものと決定されたことを示している。In various embodiments, thestatic analysis engine 306 stores the extracted histogram (e.g., represented using a hash table) for a given sample in storage 142 (e.g., a Hadoop cluster) along with histograms extracted from other samples. Data in Hadoop is compressed, and when operations are performed on the Hadoop data, the necessary data is decompressed on the fly. An example of an exemplary hash table (represented in JSON) for a file is shown in FIG. 7A.Line 702 shows the SHA-256 hash of the file.Line 704 shows the UNIX time that thesample 130 arrives at thesecurity platform 122.Line 706 shows the count of n-grams in the overlay section (e.g., 'd00fbf4e08bc366':1 indicates that one instance of 'd00fbf4e08bc366' was found in the overlay section).Line 708 shows the count of each 8-gram present in the file. Line 710 indicates that the file has an overlay. Line 712 indicates that the file type of the file is ".exe". Line 714 indicates the UNIX time thatsecurity platform 122 finished processingsample 130. Line 716 indicates a count of each non-8-gram feature that the file hit. Finally, line 718 indicates that the file was determined to be malicious (e.g., by security platform 122).

一つの例示的な実施形態において、ハドゥープクラスタに保管された8グラムのヒストグラムのセットは、１日あたり、概ね3テラバイトの8グラムのヒストグラムデータによって成長する。ヒストグラムは、悪意のあるサンプルおよび良性サンプルの両方に対応している(例えば、上述のようにセキュリティプラットフォーム122によって実行される他の静的および動的解析の結果に基づいて、そのようにラベル付けされる。)In one exemplary embodiment, the set of 8-gram histograms stored in the Hadoop cluster grows by approximately 3 terabytes of 8-gram histogram data per day. The histograms correspond to both malicious and benign samples (e.g., labeled as such based on the results of other static and dynamic analyses performed bysecurity platform 122 as described above).

解析されるサンプルから抽出される8グラムのヒストグラムは、ファイル自身よりも概ね10%大きく、そして、典型的なサンプルは、概ね100万個の異なる8グラムを含むヒストグラムを有する。異なる可能な8グラムの総数は、2の64乗（2⁶⁴）である。上述のように、対照的に、セキュリティプラットフォーム122によって(例えば、サブスクリプションの一部として)データ機器102といったデバイスに送信される分類モデルは、様々な実施形態において、数千個の特徴(例えば、1000個の特徴)だけを含む。潜在的に最大2⁶⁴個の機能のセットを、モデルで使用するために最も重要な1000個の特徴まで削減する一つの例示的な方法は、相互情報技術を使用することである。他のアプローチ(例えば、カイ二乗スコア）も、また、適用可能である。4つの必要とされるパラメータは、所与の機能を有する悪意のあるサンプルの数、所与の機能を有する良性サンプルの数、悪意のあるサンプルの総数、および良性サンプルの総数を含む。相互情報の利点の１つは、非常に大きなデータセットにおいて効率的に使用できることである。ハドゥープにおいて、相互情報アプローチは、複数のマッパー（mapper）にわたりタスクを分散することによって、単一のパスで(すなわち、所与のファイルタイプについてハドゥープクラスタデータセット内に保管された8グラムのヒストグラム全てを通じて)実行することができ、それぞれが特定の機能を処理する責任を負う。最も高い相互情報を有するこれらの特徴は、悪意を最も示す、かつ／あるいは、良性を最も示す特徴のセットとして、該当する場合、選択することができる。結果として生じた1000個の特徴は、次いで、該当する場合、モデル(例えば、線形分類モデルおよび非線形分類モデル)を構築するために使用することができる。例えば、線形分類モデルを構築するために、モデルビルダ（builder）152(pythonといった適切な言語で作成されたオープンソースツール及び／又はスクリプトのセットを使用して実装されるもの)は、上位1000個の特徴、および、適用可能な重み付けを、機器102がチェックするためのnグラム特徴のセットとして保存する(例えば、上記のセクションV.A.4に記載されているように)。 The histogram of 8-grams extracted from the sample being analyzed is approximately 10% larger than the file itself, and a typical sample has a histogram containing approximately 1 million different 8-grams. The total number of different possible 8-grams is 2 to the power of 64 (2⁶⁴ ). As noted above, in contrast, the classification model transmitted by the security platform 122 (e.g., as part of a subscription) to a device such as thedata appliance 102 contains only a few thousand features (e.g., 1000 features) in various embodiments. One exemplary method for reducing a set of potentially up to 2⁶⁴ features to the 1000 most important features for use in the model is to use mutual information techniques. Other approaches (e.g., chi-squared score) are also applicable. The four required parameters include the number of malicious samples with a given feature, the number of benign samples with a given feature, the total number of malicious samples, and the total number of benign samples. One advantage of mutual information is that it can be used efficiently on very large data sets. In Hadoop, the mutual information approach can be performed in a single pass (i.e., through all 8-gram histograms stored in the Hadoop cluster dataset for a given file type) by distributing the task across multiple mappers, each responsible for processing a particular feature. Those features with the highest mutual information can be selected as the set of features most indicative of maliciousness and/or most indicative of benignity, if applicable. The resulting 1000 features can then be used to build models (e.g., linear and non-linear classification models), if applicable. For example, to build a linear classification model, a model builder 152 (implemented using a set of open source tools and/or scripts written in a suitable language such as python) saves the top 1000 features, and applicable weights, as a set of n-gram features for thedevice 102 to check (e.g., as described in section VA4 above).

いくつかの実施形態において、非線形分類モデルは、また、特徴の上位1000個(または、他の所望の数)を使用して、モデルビルダ152によっても構築される。他の実施形態において、非線形分類モデルは、上位の（top）特徴(例えば、950)を主に使用して構築されるが、パケット毎の特徴抽出および解析の最中に検出され得る、他の非グラム特徴(例えば、50個のそうした特徴)も、また、組み込む。非線形分類モデルに組み込むことができる非nグラム特徴のいくつかの例は、(1)ヘッダのサイズ、(2)ファイル内のチェックサムの存否、(3)ファイル内のセクションの数、(4)ファイルの意図された長さ(PEファイルのヘッダに示されるように)、(5)ファイルがオーバーレイ部分を含むか否か、および(6)PEを実行するためにファイルがWindows EFIサブシステムを必要とするか否か、を含む。In some embodiments, the non-linear classification model is also built by themodel builder 152 using the top 1000 (or other desired number) of features. In other embodiments, the non-linear classification model is built primarily using the top features (e.g., 950), but also incorporates other non-n-gram features (e.g., 50 such features) that may be detected during per-packet feature extraction and analysis. Some examples of non-n-gram features that can be incorporated into the non-linear classification model include (1) the size of the header, (2) the presence or absence of a checksum in the file, (3) the number of sections in the file, (4) the intended length of the file (as indicated in the PE file header), (5) whether the file contains an overlay portion, and (6) whether the file requires the Windows EFI subsystem to run the PE.

いくつかの実施態様においては、上位1000個の特徴を選択するために相互情報を使用するのではなく、特徴のより大きなセット(過剰に生成された特徴のセット)が決定される。一つの例として、上位5000個の機能は、相互情報を使用して最初に選択することができる。5000個のセットは、次いで、従来の特徴選択技法(例えば、バギング（bagging）)への入力として使用することができる。それは、非常に大きなデータセット(例えば、ハドゥープデータセット全体)には上手くスケールできないが、縮小されたセット(例えば、5000個の特徴)ではより効果的である。相互情報を使用して識別された5000個の特徴のセットから最終的な1000個の特徴を選択するために、従来の特徴選択技術が使用され得る。In some embodiments, rather than using mutual information to select the top 1000 features, a larger set of features (an over-generated feature set) is determined. As an example, the top 5000 features can be first selected using mutual information. The set of 5000 can then be used as input to traditional feature selection techniques (e.g., bagging), which does not scale well for very large datasets (e.g., the entire Hadoop dataset) but is more effective for reduced sets (e.g., 5000 features). Traditional feature selection techniques can be used to select the final 1000 features from the set of 5000 features identified using mutual information.

一旦最終的な1000個の特徴が選択されると、非線形モデルを構築するための一つの例示的な方法は、scikit-learnまたはXGBoostといったオープンソースツールを使用することである。該当する場合、パラメータチューニングは、交差検証（cross-validation）を使用することなどにより、実行することができる。Once the final 1000 features are selected, one exemplary method for building a nonlinear model is to use open source tools such as scikit-learn or XGBoost. If applicable, parameter tuning can be performed, such as by using cross-validation.

モデルを生成するための一つの例示的なプロセスが図8Bに示されている。様々な実施形態において、プロセス850は、セキュリティプラットフォーム122によって実行される。プロセス850は、抽出された特徴(例えば、nグラム特徴を含む)のセットが受信されると、852で開始する。特徴のセットを受信することができる一つの例字的な方法は、プロセス800の結果として保管された特徴を読み取ることによるものである。854では、852で受信された特徴から、特徴の削減されたセットが決定される。上述のように、特徴の削減されたセットを決定する一つの例示的な方法は、相互情報を使用することによるものである。他のアプローチ(例えば、カイ二乗スコア)も、また、使用することができる。さらに、また、上述のように、相互情報を用いて特徴の初期セットを選択し、バギングまたは他の適切な技術を使用して特徴の初期セットを精緻化するといった、技術の組み合わせも、また、852／854で使用することができる。最終的に、上述のように、一旦(例えば、854で)特徴が選択されると、856で適切なモデルが構築される(例えば、オープンソースまたは他のツールを使用し、そして、該当する場合は、パラメータチューニングを実行する)。モデル(例えば、プロセス850を使用してモデルビルダ152によって生成されるもの)は、データ機器102および他の適用可能な受信者(例えば、データ機器136および148)に対して(例えば、加入サービスの一部として)送信され得る。One exemplary process for generating a model is shown in FIG. 8B. In various embodiments,process 850 is performed bysecurity platform 122.Process 850 begins at 852 when a set of extracted features (e.g., including n-gram features) is received. One exemplary way that the set of features can be received is by reading the features stored as a result ofprocess 800. At 854, a reduced set of features is determined from the features received at 852. As discussed above, one exemplary way of determining the reduced set of features is by using mutual information. Other approaches (e.g., chi-squared score) can also be used. Furthermore, as also discussed above, a combination of techniques can also be used at 852/854, such as using mutual information to select an initial set of features and refining the initial set of features using bagging or other suitable techniques. Finally, as discussed above, once features are selected (e.g., at 854), a suitable model is constructed at 856 (e.g., using open source or other tools and performing parameter tuning, if applicable). The model (e.g., generated bymodel builder 152 using process 850) may be transmitted (e.g., as part of a subscription service) todata appliance 102 and other applicable recipients (e.g.,data appliances 136 and 148).

様々な実施形態において、モデルビルダ152は、毎日(または他の適用可能な)ベースでモデル(例えば、線形および非線形分類モデル)を生成する。プロセス850を実行することにより、または、そうでなければ定期的にモデルを生成することによって、セキュリティプラットフォーム122は、機器102といった機器によって使用されるモデルが、最新のタイプのマルウェア脅威(例えば、悪意のある個人によって最新に展開された脅威)を検出することを確保するように助けることができる。In various embodiments,model builder 152 generates models (e.g., linear and non-linear classification models) on a daily (or other applicable) basis. By performingprocess 850 or otherwise generating models periodically,security platform 122 can help ensure that models used by devices, such asdevice 102, detect the latest types of malware threats (e.g., threats most recently deployed by malicious individuals).

新しく生成されたモデルが、(例えば、閾値を超える一連の品質評価メトリックスに基づいて決定されるように)既存のモデルよりも良好であると決定されるときはいつも、更新されたモデルは、データ機器102といったデータ機器に送信され得る。場合によって、そうした更新は、特徴に割り当てられた重み付けを調整する。そうした更新は、機器に容易に展開され、(例えば、リアルタイムアップデートとして)機器に採用される。他の事例において、そうした更新は、特徴自身を調整する。そうした更新は、デコーダといった、機器のコンポーネントに対するパッチを必要とし得るので、展開がより複雑になり得る。モデル生成の最中にオーバートレーニングを使用する１つの利点は、デコーダが特定の特徴を検出することができるか否かを、モデルが考慮できることである。Whenever a newly generated model is determined to be better than an existing model (e.g., as determined based on a set of quality assessment metrics that exceed a threshold), an updated model may be sent to a data appliance, such asdata appliance 102. In some cases, such updates adjust the weights assigned to features. Such updates are easily deployed to the appliance and adopted by the appliance (e.g., as real-time updates). In other cases, such updates adjust the features themselves. Such updates may require patches to appliance components, such as the decoder, making deployment more complicated. One advantage of using overtraining during model generation is that the model can take into account whether the decoder can detect a particular feature.

様々な実施形態において、機器は、受信された際に、更新をモデルに対して展開するために(例えば、セキュリティプラットフォーム122によって)必要とされる。他の実施形態において、機器は、選択的に(少なくとも一定期間)更新を展開することが可能である。一つの例として、新しいモデルが機器102によって受信された場合、既存のモデルおよび新たなモデルは、両方が、機器102においてある期間について並列に実行され得る(例えば、既存のモデルが生産において使用され、かつ、新たなモデルは、実際には実行することなく行われるであろうアクションについてレポートする)。機器の管理者は、機器におけるトラフィックを処理するために既存のモデルまたは新たなモデルのいずれが使用されるべきかを示すことができる(例えば、どのモデルがより良好なパフォーマンスを示すかに基づいて)。様々な実施形態において、機器102は、どのモデルが機器102において動作しているか、および、そのモデルがどの程度有効であるか(例えば、偽陽性の統計情報)といった、情報を示すテレメトリ（telemetry）をセキュリティプラットフォーム122に戻す。In various embodiments, the appliance is required (e.g., by security platform 122) to deploy updates to the models as they are received. In other embodiments, the appliance can selectively deploy updates (at least over a period of time). As one example, when a new model is received byappliance 102, the existing model and the new model can both run in parallel onappliance 102 for a period of time (e.g., the existing model is used in production and the new model reports on actions that would be taken without actually executing them). The appliance administrator can indicate whether the existing model or the new model should be used to process traffic on the appliance (e.g., based on which model performs better). In various embodiments,appliance 102 returns telemetry tosecurity platform 122 indicative of information such as which model is running onappliance 102 and how effective the model is (e.g., false positive statistics).

上述の実施形態は、理解を明確にするためにある程度詳細に説明されているが、本発明は、提供される詳細について限定されるものではない。本発明を実施するための多くの代替的な方法が存在している。開示された実施形態は、例示的なものであり、かつ、限定的なものではない。Although the above-described embodiments have been described in some detail for clarity of understanding, the invention is not limited to the details provided. There are many alternative ways to implement the invention. The disclosed embodiments are illustrative and not limiting.

Claims

Translated fromJapanese

システムであって、
プロセッサであり、
ファイルのセットから抽出された、複数のｎグラムを含む、特徴のセットを受信し、
前記複数のｎグラムのうち少なくとも一部を含む特徴の縮小セットを決定し、前記特徴の縮小セットは、相互情報を使用して決定され、悪意を最も示し、かつ／あるいは、良性を最も示す特徴のセットを、前記特徴の縮小セットとして選択し、かつ、
インラインマルウェア解析を実行するように、データ機器によって使用可能なモデルを生成するために、前記特徴の縮小セットを使用する、
プロセッサと、
前記プロセッサに結合され、かつ、前記プロセッサに命令を提供するように構成されている、メモリと、
を含む、システム。 1. A system comprising:
a processor;
receiving a set of features extracted from a set of files, the set including a plurality of n-grams;
determining a reduced set of features including at least a portion of the plurality of n-grams,the reduced set of features being determined using mutual information, selecting a set of features most indicative of maliciousness and/or most indicative of benignness as the reduced set of features; and
using the reduced set of features to generate a model usable by a data appliance to perform inline malware analysis;
A processor;
a memory coupled to the processor and configured to provide instructions to the processor;
Including, the system.

前記特徴のセットは、既知の悪意あるファイルのセットから抽出された特徴を含む、
請求項１に記載のシステム。 the set of features includes features extracted from a set of known malicious files;
The system of claim 1 .

前記特徴のセットは、既知の良性ファイルのセットから抽出された特徴を含む、
請求項１に記載のシステム。 the set of features includes features extracted from a set of known benign files;
The system of claim 1 .

前記特徴の縮小セットは、カイ二乗スコアを使用して決定される、
請求項１に記載のシステム。 The reduced set of features is determined using a chi-squared score.
The system of claim 1 .

前記生成されたモデルはnグラム特徴を含む、
請求項１に記載のシステム。 The generated model includes n-gram features.
The system of claim 1 .

前記モデルは、線形モデルである、
請求項１に記載のシステム。 The model is a linear model.
The system of claim 1 .

前記モデルは、非線形モデルである、
請求項１に記載のシステム。 The model is a non-linear model.
The system of claim 1 .

前記複数のｎグラムは、前記ファイルのセットの静的解析の最中に抽出される、
請求項１に記載のシステム。 the plurality of n-grams are extracted during static analysis of the set of files;
The system of claim 1 .

前記モデルは、第１データ機器に送信されるシステム、
請求項１に記載のシステム。 a system in which the model is transmitted to a first data device;
The system of claim 1 .

第２データ機器によって報告された偽陽性結果に応答して、前記プロセッサは、更新されたモデルを生成し、かつ、前記更新されたモデルを前記第１データ機器に送信するように構成されている、
請求項９に記載のシステム。 in response to a false positive result reported by the second data device, the processor is configured to generate an updated model and transmit the updated model to the first data device.
The system of claim9 .

方法であって、
データ機器の中央処理ユニット（CPU）によって、ファイルのセットから抽出された、複数のｎグラムを含む、特徴のセットを受信する、ステップと、
前記CPUによって、前記複数のｎグラムのうち少なくとも一部を含む特徴の縮小セットを決定する、ステップであり、前記特徴の縮小セットは、相互情報を使用して決定され、悪意を最も示し、かつ／あるいは、良性を最も示す特徴のセットを、前記特徴の縮小セットとして選択する、ステップと、
前記CPUによって、インラインマルウェア解析を実行するように、前記データ機器によって使用可能なモデルを生成するために、前記特徴の縮小セットを使用する、ステップと、
を含む、方法。1. A method comprising:
receiving a set of features extracted from a set of files, the set of features including a plurality of n-grams,by a central processing unit (CPU) of the data appliance ;
determining, by the CPU, a reduced set of features including at least a portion of the pluralityof n-grams, the reduced set of features being determined using mutual information, and selecting a set of features most indicative of maliciousness and/or most indicative of benignness as the reduced set of features;
using, by the CPU, the reduced set of features to generate a model usable by the data appliance to performinline malware analysis;
A method comprising:

有形のコンピュータ読取り可能な記憶媒体に保管されている、複数のコンピュータ命令を含むコンピュータプログラムであって、
コンピュータ命令が実行されると、コンピュータに、
ファイルのセットから抽出された、複数のｎグラムを含む、特徴のセットを受信する、ステップと、
前記複数のｎグラムのうち少なくとも一部を含む特徴の縮小セットを決定するステップであり、前記特徴の縮小セットは、相互情報を使用して決定され、悪意を最も示し、かつ／あるいは、良性を最も示す特徴のセットを、前記特徴の縮小セットとして選択する、ステップと、
インラインマルウェア解析を実行するように、データ機器によって使用可能なモデルを生成するために、前記特徴の縮小セットを使用する、ステップと、
を実施させる、コンピュータプログラム。
A computer program comprising a plurality of computer instructions stored on a tangible computer readable storage medium,
When the computer instructions are executed, they cause the computer to:
receiving a set of features extracted from a set of files, the set of features including a plurality of n-grams;
determining a reduced set of features comprising at least a portion of the plurality of n-grams, the reduced set of features being determined using mutual information, and selecting a set of features most indicative of maliciousness and/or most indicative of benignness as the reduced set of features;
using the reduced set of features to generate a model usable by a data appliance to perform inline malware analysis;
A computer program that causes the