Movatterモバイル変換


[0]ホーム

URL:


CN120407451A - Multi-level page table traversal acceleration method, system, device and medium - Google Patents

Multi-level page table traversal acceleration method, system, device and medium

Info

Publication number
CN120407451A
CN120407451ACN202510896630.XACN202510896630ACN120407451ACN 120407451 ACN120407451 ACN 120407451ACN 202510896630 ACN202510896630 ACN 202510896630ACN 120407451 ACN120407451 ACN 120407451A
Authority
CN
China
Prior art keywords
cache
prefetch
level
entry
page table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510896630.XA
Other languages
Chinese (zh)
Inventor
赵国峰
薛海军
王明圣
孙宗齐
王景
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co LtdfiledCriticalShandong Inspur Science Research Institute Co Ltd
Priority to CN202510896630.XApriorityCriticalpatent/CN120407451A/en
Publication of CN120407451ApublicationCriticalpatent/CN120407451A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本发明提供了一种多级页表遍历加速方法、系统、设备及介质,属于计算机技术领域。方法包括:将虚拟地址输入至路径预测器和动态权重分级缓存,实时采集指令访问占比、TLB命中率和预取成功率;通过路径预测器根据虚拟地址的预设位进行哈希计算,查询历史记录表获取预测路径及置信度等级,以确定查询范围;当缓存命中时,计算出命中条目的分值,进行缓存条目的优先级排序和淘汰决策,同时采用混合替换算法管理缓存条目;在预设条件下,通过自适应预取引擎启动预取操作,计算并动态扩展预取步长,通过两级监控防止内存带宽争用;当指令访问占比、TLB命中率和预取成功率满足条件时激活指令优先模式,通过调整相关计算参数,更新命中条目的分值。

The present invention provides a multi-level page table traversal acceleration method, system, device, and medium, belonging to the field of computer technology. The method includes: inputting a virtual address into a path predictor and a dynamic weighted hierarchical cache, collecting instruction access ratio, TLB hit rate, and prefetch success rate in real time; performing a hash calculation based on preset bits of the virtual address by the path predictor, querying a historical record table to obtain a predicted path and confidence level to determine the query range; when a cache hit occurs, calculating the score of the hit entry, prioritizing and eliminating the cache entry, and simultaneously managing the cache entry using a hybrid replacement algorithm; under preset conditions, initiating a prefetch operation through an adaptive prefetch engine, calculating and dynamically expanding the prefetch step size, and preventing memory bandwidth contention through two-level monitoring; activating an instruction priority mode when the instruction access ratio, TLB hit rate, and prefetch success rate meet conditions, and updating the score of the hit entry by adjusting relevant calculation parameters.

Description

Multi-stage page table traversal acceleration method, system, equipment and medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a multi-level page table traversal acceleration method, a system, equipment and a medium.
Background
In modern processor architectures, the multi-level page table mechanism is the core technology that enables virtual address to physical address translation. With the improvement of the performance of the processor and the expansion of the memory space, the efficiency of the page table traversal directly affects the overall performance of the system. However, conventional multi-level page table walk schemes suffer from the following significant technical drawbacks:
First, the problem of cache inefficiency constrains address translation performance. Existing TLB (TranslationLookasideBuffer) generally adopts a fixed replacement policy (such as LRU algorithm), and does not consider actual profit differences of page table entries of different levels. For example, in a four-level page table structure, L3-level page table entry hits can directly skip L2-level and L1-level queries, and the actual benefit is 3 times of that of L1-level hits, but the traditional LRU algorithm only takes access time as a replacement basis, so that the benefit difference of different levels cannot be quantified, and high-value page table entries are frequently eliminated. This "hierarchical gain-aware miss" makes TLB hit rates difficult to break through bottlenecks, especially in deep page table scenarios where performance loss is significant.
Second, the local failure of the prefetch mechanism results in a dip in cross-level access efficiency. Because of the tree-shaped characteristic of the page table structure, page table entries corresponding to continuous virtual addresses are sparsely distributed in a physical memory, and are stored in a nonlinear continuous mode. The existing linear prefetching mechanism (such as fixed-step prefetching) can not adapt to the branch characteristic of the tree structure when accessing across levels, so that the prefetching hit rate is reduced by more than 40%. For example, when a page table walk path jumps from a PUD (secondary page table) to a PTE (quaternary page table), the linearly prefetched page table entry is misplaced from the actual access path, causing memory bandwidth waste and additional delay.
Furthermore, resource contention for the instruction page and the data page forms a performance degradation loop. When an instruction page table entry shares STLB (SecondaryTLB) with a data page table entry, high frequency instruction accesses continue to squeeze the cache space of the data page table entry. The traditional scheme lacks differential management of page types, so that the data page table entries are frequently invalidated, and page table repayment during data access is further caused, and the missing of the data page table entries further aggravates TLB conflict, so that a vicious circle of 'cache competition-invalidation aggravation' is formed. Especially in instruction intensive workload, this problem can lead to a data access delay increase of more than 30%.
The essence of the problem is that the existing page table traversal scheme optimizes links such as cache management, prefetching strategy, page type control and the like, and lacks a cross-level collaborative management mechanism. Each module operates independently, and cannot be dynamically linked based on page table level characteristics, access modes and resource states, so that system-level performance optimization is limited.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a multi-stage page table traversal acceleration method, a system, equipment and a medium, which realize page table traversal acceleration, cache hit rate improvement and memory bandwidth optimization through path prediction, dynamic cache classification, self-adaptive prefetching and instruction priority parameter adjustment.
The invention aims to achieve the aim, and the aim is achieved by the following technical scheme:
in a first aspect, an embodiment of the present application provides a multi-level page table walk acceleration method, including:
the virtual address is respectively input into a path predictor and a dynamic weight hierarchical cache, and the instruction access proportion, the TLB hit rate and the prefetch success rate are acquired in real time;
Carrying out hash calculation according to the preset bit of the virtual address by a path predictor, and inquiring a history record table to obtain a predicted path and a confidence level;
when the cache hits, the score of hit items is calculated by using a comprehensive scoring formula, the priority ordering and the elimination decision of the cache items are carried out based on the score, and meanwhile, the cache items are managed by adopting a mixed replacement algorithm;
When the preset triggering condition is met, starting a prefetching operation through the self-adaptive prefetching engine, calculating and dynamically expanding the prefetching step length by adopting the Q4.12 fixed point number, and simultaneously preventing memory bandwidth contention through two-stage monitoring;
when the instruction access duty ratio, the TLB hit rate and the prefetch success rate meet preset conditions, an instruction priority mode is activated, and the score of the hit entry is updated by adjusting the related parameters of the comprehensive scoring formula.
In an optional implementation manner, the inputting the virtual address into the path predictor and the dynamic weight hierarchical cache respectively, and collecting the instruction access duty ratio, the TLB hit rate and the prefetch success rate in real time includes:
Inputting virtual addresses VA [47:0] to the path predictor and the dynamic weight hierarchical cache respectively;
When the CPU accesses data, the counter value counter is decremented by 1, and the instruction access duty ratio InstrRatio is calculated in real time through a formula InstrRatio = (counter+32768)/65536.0;
the TLB hit rate is calculated by reading the times of TLB hit and miss recorded in the hardware performance counter;
Reading the number of prefetch hits and misses recorded in a hardware performance counter, and calculating the prefetch success rate;
Broadcasting the collected instruction access ratio, TLB hit rate and prefetch success rate through a 32-bit state bus every cycle, wherein in the 32-bit state bus, a bit section [7:0] represents operation mode coding, a bit section [15:8] represents instruction ratio, a bit section [23:16] represents prefetch success rate, and a bit section [31:24] represents TLB hit rate.
In an optional embodiment, the hash calculation is performed by the path predictor according to the preset bits of the virtual address, the query history table obtains the predicted path and the confidence level, and the determining the query range according to the confidence level by the dynamic weight hierarchical cache includes:
Carrying out hash calculation according to the virtual address VA [47:39] by a path predictor, and inquiring a history record table to obtain a predicted path and a path prediction confidence Z;
Determining confidence level according to path prediction confidence Z, wherein Z is higher than or equal to 6, middle confidence when Z is 5< Z <6, and low confidence when Z is less than or equal to 5
Determining the level of the queried page table according to the confidence level, wherein the page table adopts a PGD, PUD, PMD, PTE four-level page table;
and when the confidence level is high, only inquiring the range of +/-1 of the prediction level, and when the confidence level is low, performing full-level parallel inquiry.
In an optional embodiment, when the cache hits, the score of the hit entry is calculated by using a comprehensive scoring formula, the priority ordering and the elimination decision of the cache entry are performed based on the score, and the cache entry is managed by adopting a mixed replacement algorithm, which includes:
When a cache hits, calculating a Score of a hit entry by using a comprehensive scoring formula score=alpha. LEVELWEIGHT +beta. TypeBoost-gamma.Δt, wherein alpha is a scaling factor of path prediction confidence, LEVELWEIGHT is weight of a page table level where the cache entry is located, beta is a type gain coefficient of an instruction access duty ratio, typeBoost is a type gain of the cache entry, gamma is a time attenuation factor related to a prefetch hit rate, Δt is an access time interval of the cache entry, if the cache entry is at a PGD layer, LEVELWEIGHT =4, if the cache entry is at a PUD layer, LEVELWEIGHT =3, if the cache entry is at a PMD layer, LEVELWEIGHT =2, and if the cache entry is at a PTE layer, LEVELWEIGHT =1;
Determining the cache priority of the item according to the Score of the hit item and caching;
when the number of the hollow entries in the current cache is greater than or equal to 10% of the total number of the entries, classifying the cache entries into PGD, PUD, PMD, PTE four hierarchical categories, randomly selecting 2 candidate entries from each category, classifying the cache entries into PGD, PUD, PMD, PTE four hierarchical categories, and randomly selecting 2 candidate entries from each category for replacement;
Eliminating the entry of Score <2 when the number of empty entries in the current cache is less than 10% of the total number of entries;
for entries that hit 3 times consecutively, they are latched in the cache for 200 cycles.
In an alternative embodiment, when the preset trigger condition is met, starting the prefetch operation through the adaptive prefetch engine, calculating and dynamically expanding the prefetch step size by adopting the Q4.12 fixed point number, and simultaneously preventing the memory bandwidth contention through two-stage monitoring, including:
If the entry loading of the current PTE level is completed, the residual dynamic cache space is more than 20% of the total cache space, and Z is more than or equal to 5, starting a prefetch operation through the adaptive prefetch engine;
Updating a step value every 100 periods, wherein the initial step is +/-2 pages, and calculating an updated step value stride by using a step calculation formula according to the real-time item hit rate so as to dynamically expand the step;
the miss rate Q of the L1D-Cache is sampled every 50 cycles, and when Q >15% is accumulated for 3 times, the prefetch operation is closed for 200 cycles.
In an alternative embodiment, the step size calculation formula is:
Wherein, theIs the entry hit rate.
In an optional implementation manner, the activating the instruction priority mode when the instruction access ratio, the TLB hit rate and the prefetch success rate meet the preset conditions, and updating the score of the hit entry by adjusting the relevant parameters of the comprehensive scoring formula, includes:
Detecting the instruction access duty cycle InstrRatio in real time;
When InstrRatio is detected to be more than 60%, the TLB hit rate is less than 85% and the prefetch success rate is more than 40%, an instruction priority mode is activated, typeBoost is lifted to be 1.2 times of the current value, the beta value is forcedly set to be 0.1, and the Score of the item is updated;
The value of the type gain factor β of the instruction access ratio is dynamically adjusted in real time according to the formula β=max (0, min (0.5, 0.2+0.5. (InstrRatio-0.6))) and the Score of the entry is updated.
In a second aspect, an embodiment of the present application further provides a multi-level page table walk acceleration system, including:
the address input and state monitoring module is used for inputting the virtual address into the path predictor and the dynamic weight hierarchical cache respectively, and collecting the instruction access ratio, the TLB hit rate and the prefetch success rate in real time;
The path prediction and query planning module is used for carrying out hash calculation according to the preset bit of the virtual address through the path predictor, and obtaining a predicted path and a confidence level by querying the history record table;
the weight evaluation and cache management are used for calculating the score of the hit item by using a comprehensive scoring formula when the cache hits, and carrying out priority ordering and elimination decision of the cache item based on the score, and simultaneously adopting a mixed replacement algorithm to manage the cache item;
The prefetch strategy adjustment module is used for starting prefetch operation through the self-adaptive prefetch engine when a preset trigger condition is met, calculating and dynamically expanding prefetch step length by adopting Q4.12 fixed point numbers, and simultaneously preventing memory bandwidth contention through two-stage monitoring;
The mode switching and parameter adjusting module is used for activating the instruction priority mode when the instruction access ratio, the TLB hit rate and the prefetch success rate meet the preset conditions, and updating the score of the hit entry by adjusting the related parameters of the comprehensive scoring formula.
In a third aspect, an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the multi-level page table walk acceleration method as described in any one of the preceding claims when the program is executed by the processor.
In a fourth aspect, embodiments of the present application also provide a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multi-level page table walk acceleration method as described in any of the preceding claims.
From the above technical scheme, the invention has the following advantages:
In the multi-stage page table traversal acceleration method provided by the application, the efficient optimization of page table traversal is realized by combining the dynamic indexes such as instruction access duty ratio, TLB hit rate, prefetch success rate and the like acquired in real time through the cooperative work of the path predictor and the dynamic weight hierarchical cache. The path predictor calculates a predicted path based on the virtual address bit hash and queries in a grading manner according to the confidence, reduces the query range when the confidence is high, reduces time consumption, ensures accuracy when the confidence is low in full-level parallel query, and the dynamic weight grading cache preferentially reserves a high-level page table and a high-frequency access item through a comprehensive grading formula (fusion level weight, type gain and time attenuation) and a mixed replacement algorithm, so that the cache utilization rate is improved. The self-adaptive prefetching engine starts prefetching according to the PTE loading state, the Cache residual space and the prediction confidence, dynamically adjusts the step length by using the Q4.12 fixed point number to match the hit rate, prevents memory bandwidth contention by monitoring the L1D-Cache miss rate, and improves the prefetching efficiency. When the instruction access ratio is high and the TLB hit rate is low, an instruction priority mode is activated, the scoring parameter is adjusted to improve the priority of the instruction-related page table entry, and the instruction access delay is reduced. The overall scheme effectively improves the hit rate of the TLB, reduces the traversal delay of the page table, optimizes the memory bandwidth allocation, and obviously improves the system performance in an instruction intensive scene.
The application carries out hash calculation on the virtual address through the path predictor, realizes accurate prediction on the access path, and dynamically adjusts the query range according to the confidence level. And focusing a prediction level + -1 range when the confidence is high, and carrying out full-level parallel query when the confidence is low, so that invalid traversal is greatly reduced, and the query efficiency is improved.
The application realizes the intelligent ordering and elimination of the cache entries based on the comprehensive scoring formula. And in combination with multidimensional factors such as page table level weight, instruction access duty ratio, time attenuation and the like, high-value items are reserved preferentially, the cache hit rate is improved, and the memory access delay is reduced.
According to the application, the prefetching engine dynamically triggers prefetching according to the PTE loading state, the Cache space and the prediction confidence, the prefetching step length is dynamically expanded by adopting a Q4.12 fixed point number algorithm, and the prefetching switch is intelligently adjusted by monitoring the L1D-Cache miss rate, so that the prefetching hit rate is improved, and meanwhile, the memory bandwidth contention is avoided.
The application activates the instruction priority mode when the specific condition is satisfied by monitoring the instruction access duty ratio, the TLB hit rate and the prefetch success rate in real time. By improving TypeBoost values of relevant items of the instructions and adjusting beta parameters, priority caching of page table items of key instructions is ensured, instruction execution delay is remarkably reduced, and CPU utilization rate is improved.
According to the application, the periodic level broadcasting of the performance index is realized through the 32-bit state bus, key data such as TLB hit, prefetch success rate and the like are collected in real time by combining with the hardware performance counter, so that the deep cooperation of a software algorithm and a hardware architecture is formed, the efficiency of hardware resources is fully exerted, and the system-level performance optimization is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a multi-level page table walk acceleration method provided by the application.
Fig. 2 is a flow chart of another multi-level page table walk acceleration method provided by the present application.
Fig. 3 is a flow chart of a multi-level path prediction generation method provided by the application.
Fig. 4 is a flow chart of a dynamic weight hierarchical cache management method provided by the application.
Fig. 5 is a flow chart of a cross-level adaptive prefetch control method according to the present application.
Fig. 6 is a flow chart of a system state monitoring and mode switching method provided by the application.
Fig. 7 is a schematic structural diagram of a multi-stage page table walk acceleration system provided by the present application.
Fig. 8 is a schematic structural diagram of an electronic device provided by the present application.
Detailed Description
In the specific steps of the multi-level page table walk acceleration method described in detail below, various embodiments of the present disclosure will be more fully described. The present disclosure is capable of various embodiments and of modifications and variations therein. It should be understood, however, that there is no intent to limit the various embodiments of the present disclosure to the particular embodiments disclosed herein, but rather, the present disclosure is to be construed to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the present disclosure.
Hereinafter, the terms "comprises" or "comprising" as may be used in various embodiments of the present disclosure indicate the presence of the disclosed functions, operations or elements, and are not limiting of the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present disclosure, the terms "comprises," "comprising," and their cognate terms are intended to refer to a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be interpreted as first excluding the existence of or increasing likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for accelerating multi-level page table walk in an embodiment is shown, the method comprising:
s1, respectively inputting the virtual address into a path predictor and a dynamic weight hierarchical cache, and collecting the instruction access ratio, the TLB hit rate and the prefetch success rate in real time.
In a specific embodiment, virtual address VA [47:0] is first input to the path predictor and dynamic weight hierarchical cache, respectively.
Meanwhile, a 32-bit sliding window counter is adopted to detect the instruction access duty ratio in real time, when a CPU executes an instruction, the counter value is increased by 1, when the CPU accesses data, the counter value is reduced by 1, and the instruction access duty ratio InstrRatio is calculated in real time through a formula InstrRatio = (counter+32768)/65536.0, so that accurate monitoring of the instruction and data access proportion is realized.
The TLB hit rate is calculated by reading the times of TLB hit and miss recorded in the hardware performance counter;
Reading the number of prefetch hits and misses recorded in a hardware performance counter, and calculating the prefetch success rate;
Broadcasting the acquired instruction access ratio, TLB hit rate and prefetch success rate through a 32-bit state bus every cycle, wherein the state bus every cycle is broadcasted through the 32-bit state bus, wherein [7:0] represents running mode coding, [15:8] represents instruction ratio, [23:16] represents prefetch success rate, and [31:24] represents TLB hit rate, so that state information sharing and cooperative work among all modules are realized.
S2, carrying out hash calculation according to preset bits of the virtual address through a path predictor, inquiring a history record table to obtain a predicted path and confidence level, and determining an inquiring range according to the confidence level through a dynamic weight hierarchical cache.
In a specific embodiment, firstly, a path predictor performs hash calculation according to virtual addresses VA [47:39], and queries a history table to obtain a predicted path and a path prediction confidence Z. For example, the path predictor uses a history table of a 4-way group association structure, updates the confidence value of the corresponding path after each page table traversal, hits +3, misses-5, performs CRC8 hash calculation on VA [47:39] to obtain an index, and outputs a confidence level.
Then, determining confidence level according to the path prediction confidence Z, wherein when Z is more than or equal to 6, the confidence level is high, when Z is more than or equal to 6, the confidence level is middle-set, and when Z is less than or equal to 5, the confidence level is low
The level of the queried page table is determined according to the confidence level, wherein the page table adopts PGD, PUD, PMD, PTE four-stage page tables, namely four-stage page tables (PGD- & gtPUD- & gtPMD- & gtPTE) adopting an x86 architecture.
And when the confidence level is high, only inquiring the range of +/-1 of the prediction level, and when the confidence level is low, performing full-level parallel inquiry.
And S3, when the cache hits, calculating the score of the hit item by using a comprehensive scoring formula, and carrying out priority ordering and elimination decision of the cache item based on the score, and simultaneously, managing the cache item by adopting a mixed replacement algorithm.
In a specific embodiment, when a cache hits, a Score of a hit entry is calculated by using a comprehensive scoring formula score=α. LEVELWEIGHT +β. TypeBoost- γ.Δt, wherein α is a scaling factor of a path prediction confidence, LEVELWEIGHT is a weight of a page table level where the cache entry is located, β is a type gain factor of an instruction access duty ratio, typeBoost is a type gain of the cache entry, γ is a time attenuation factor related to a prefetch hit rate, Δt is an access time interval of the cache entry, LEVELWEIGHT =4 if the cache entry is in a PGD layer, LEVELWEIGHT =3 if the cache entry is in a PUD layer, LEVELWEIGHT =2 if the cache entry is in a PMD layer, and LEVELWEIGHT =1 if the cache entry is in a PTE layer.
It should be specifically noted that, the initial level weights of the dynamic weight hierarchical cache may be flexibly configured according to the actual page table level and the application scenario, and meanwhile, in the accumulation mode of actual profit calculation, each level weight accumulation rule is applicable to page table systems with different level structures.
The dynamic weight hierarchical cache is characterized by adopting a multi-dimensional scoring model, comprising three dimensions of hierarchical weights, type gains and time attenuation, wherein the hierarchical weights are initially set to be L4=4, L3=3, L2=2 and L1=1, actual profit calculation adopts an accumulation mode, for example, total weights=L3+L2+L1=6 when L3 hits, the type gains access the type gain coefficient beta of the proportion through instructions, the time attenuation adopts a 48-bit high-precision timer to calculate the access time interval deltat of cache entries, and in the comprehensive scoring formula, alpha is dynamically scaled by path prediction confidence, beta is calculated by a monitoring module in real time, and gamma dynamically changes along with the prefetching hit rate.
After the calculation is completed, determining the caching priority of the item according to the Score of the hit item and caching the item;
when the number of the hollow entries in the current cache is greater than or equal to 10% of the total number of the entries, classifying the cache entries into PGD, PUD, PMD, PTE four hierarchical categories, randomly selecting 2 candidate entries from each category, classifying the cache entries into PGD, PUD, PMD, PTE four hierarchical categories, and randomly selecting 2 candidate entries from each category for replacement;
Eliminating the entry of Score <2 when the number of empty entries in the current cache is less than 10% of the total number of entries;
for entries that hit 3 times consecutively, they are latched in the cache for 200 cycles.
And S4, when the preset triggering condition is met, starting a prefetching operation through the self-adaptive prefetching engine, calculating and dynamically expanding the prefetching step length by adopting the Q4.12 fixed point number, and simultaneously preventing memory bandwidth contention through two-stage monitoring.
In a specific embodiment, the preset trigger condition is that the entry loading of the current PTE level is completed, the dynamic cache remaining space is more than 20% of the total cache space, and Z is more than or equal to 5. If this condition is met, a prefetch operation is initiated by the adaptive prefetch engine.
After starting, updating a step value once every 100 periods, wherein the initial step is +/-2 pages, and utilizing a step calculation formula according to the real-time item hit rateAnd simultaneously, sampling the missing rate Q of the L1D-Cache every 50 periods, and closing the prefetch operation for 200 periods when the Q is 15% for 3 times.
It should be specifically noted that, the step size calculation formula in the dynamic step size adjustment may be adaptively adjusted according to different hardware architectures and performance requirements, but a dynamic expansion manner of the step size needs to be determined based on a logarithmic analysis of the recent hit rate, so as to ensure the effectiveness and adaptability of the prefetch operation.
S5, activating an instruction priority mode when the instruction access proportion, the TLB hit rate and the prefetch success rate meet preset conditions, and updating the score of the hit item by adjusting the related parameters of the comprehensive scoring formula.
In a specific embodiment, the instruction access duty cycle InstrRatio is detected in real time;
When InstrRatio >60%, TLB hit rate less than 85% and prefetch success rate greater than 40% are detected, an instruction priority mode is activated, typeBoost is raised to current 1.2 times, beta value is forcedly set to 0.1, and Score of an item is updated, and meanwhile, the value of type gain coefficient beta of instruction access ratio is dynamically adjusted in real time according to the formula beta=max (0, min (0.5, 0.2+0.5. (InstrRatio-0.6))), and Score of the item is updated.
In this embodiment, the path predictor performs hash computation on the virtual address and predicts the access path by combining with the history, the dynamic weight hierarchical cache adjusts the query range according to the confidence level, the comprehensive scoring formula fuses the hierarchical weight, the instruction type gain and the time attenuation factor to optimize the cache priority, the adaptive prefetch engine dynamically expands the prefetch step size and monitors the cache miss rate by adopting the Q4.12 fixed point algorithm, and the instruction priority mode adjusts the scoring parameter according to the real-time performance index, thereby realizing the reduction of page table traversal delay, the improvement of TLB hit rate, the efficient utilization of the memory bandwidth and the remarkable optimization of the system performance in the instruction intensive scene.
Further, as a refinement and extension of the specific implementation manner of the foregoing embodiment, for fully describing the specific implementation process in this embodiment, another multi-level page table traversal acceleration method is provided, which takes a four-level page table (pgd→pud→pmd→pte) with an x86 architecture as an example, and is also applicable to page table systems with other hierarchical structures.
As shown in fig. 2, the method works cooperatively as follows:
The address input stage is that the virtual address VA [47:0] is input to the path predictor and the dynamic weight hierarchical cache at the same time. The monitoring module collects the instruction/data access proportion in real time.
And in the parallel prediction stage, a path predictor queries a history record table according to the VA [47:39] hash value and outputs a predicted path (such as PUD-PTE direct) and a confidence level (high/medium/low three-level). The dynamic weight hierarchical cache selects a query range according to the confidence level, namely, high confidence level (more than or equal to 6), only querying a prediction level + -1 range (if a PUD is predicted, the PUD+PMD is queried), and low confidence level (less than or equal to 5), and performing full-level parallel query.
In the weight calculation stage, when the cache hits, the scoring module calculates hit item Score values according to a formula:
Score = α • LevelWeight + β• TypeBoost -γ • Δt
Wherein LEVELWEIGHT employs register preset values (extended pgd=4, pud=3, pmd=2, pte=1 in a four-level page table scenario)
Prefetch adjustment stage the prefetch engine operates dynamically according to the following conditions:
The triggering condition is that PTE loading is completed, the residual buffer space is more than 20%, and the prediction confidence is more than or equal to 5;
Step size calculation: (implemented with Q4.12 fixed point numbers).
And in the conflict monitoring stage, when the L1D-Cache miss rate is continuously 3 times and is more than 15%, the prefetch function is turned off for 200 cycles.
In one embodiment of the present invention, based on step S2, a possible embodiment thereof will be given below as a non-limiting illustration.
Referring to FIG. 3, the embodiment discloses a multi-level path prediction generation method, which comprises history table updating and prediction generation.
The history table is updated by adopting a 4-way group association structure, the confidence value of the corresponding path is updated after each page table traversal, the value is +3 when hit, the value is-5 when miss, and each item of storage comprises the following contents:
struct {
uin8_ tva _tag [5];// VA [47:43] hash value
Uint16_t path_bits;// success path bitmap (bit 0: PGD, bit1: PUD.)
Uint8_ tconfidence;// confidence counter (0-15)
}
And (3) predicting and generating, performing CRC8 hash calculation on VA (47:39) to obtain an index, and outputting a confidence level. The relevant commands are as follows:
if(confidence>= 12) return HIGH;
else if(confidence>= 6) return MEDIUM;
else return LOW;
In one embodiment of the present invention, based on step S3, a possible embodiment thereof will be given below as a non-limiting illustration.
Referring to fig. 4, the embodiment discloses a dynamic weight hierarchical cache management method, which comprises an entry replacement decision, a high-frequency protection mechanism and an instruction priority mode.
And (3) making an item replacement decision, namely when the free item is more than or equal to 10%, classifying the items into barrels according to a hierarchy (four barrels of PGD/PUD/PMD/PTE), randomly sampling 2 candidate items in each barrel, and when the free item is less than 10%, activating Score threshold filtering, and preferentially eliminating the items of which the Score is less than 2.0.
High frequency protection mechanism sets 200 cycle lock flags (lock_bit=1) for entries hit 3 times consecutively, with the lock period Score value being weighted by 1.5.
Instruction priority mode, when the monitoring module detects InstrRatio to 60%, the TypeBoost coefficient of the instruction page is raised to 1.2 times, and the beta value of the data page is forcedly set to 0.1.
In one embodiment of the present invention, based on step S4, a possible embodiment thereof will be given below as a non-limiting illustration.
Referring to fig. 5, the present embodiment discloses a cross-level adaptive prefetch control method, which includes dynamic step size adjustment and collision detection.
Dynamic step size adjustment, step size value is updated every 100 periods, and the hardware implementation adopts a shifter to calculate 2n. The specific commands are as follows:
always_comb begin
hit_log = (hit_rate>0) ? $clog2(hit_rate*100) : 0;
stride=1 < < (hit_log [3:0] &4' b 0011);// limit maximum step size 8
end
Conflict detection includes short-term detection and long-term regulation. Short-term detection refers to sampling Miss event counter of L1D-Cache every 50 cycles, long-term regulation refers to switching the prefetch engine to low power consumption mode (only maintain + -1 page prefetch) after accumulating 3 times of high misses.
In one embodiment of the present invention, based on step S5, a possible embodiment thereof will be given below as a non-limiting illustration.
Referring to fig. 6, the present embodiment discloses a system status monitoring and mode switching method, which includes instruction duty ratio detection, mode switching decision and status bus transmission.
Instruction duty cycle detection employs a 32-bit sliding window counter, +1 per instruction access, and-1 per data access. Calculate InstrRatio = (counter+32768)/65536.0 in real time.
The mode switching decision means that the instruction priority mode is activated when a plurality of conditions are simultaneously satisfied, wherein InstrRatio >60%, TLB hit rate is less than 85%, and prefetch success rate is more than 40%.
Status bus transfers, each cycle broadcast over a 32-bit status bus, where [7:0] represents run mode encoding, [15:8] represents instruction duty cycle, [23:16] represents prefetch success rate, and [31:24] represents TLB hit rate.
As shown in fig. 7, the following is an embodiment of a multi-level page table walk acceleration system provided in an embodiment of the present disclosure, where the system and the multi-level page table walk acceleration method of the foregoing embodiments belong to the same inventive concept, and details of the multi-level page table walk acceleration system are not described in detail in the embodiment of the multi-level page table walk acceleration system, and reference may be made to the embodiment of the foregoing multi-level page table walk acceleration method.
A multi-level page table walk acceleration system, comprising:
the address input and state monitoring module is used for inputting the virtual address into the path predictor and the dynamic weight hierarchical cache respectively, and collecting the instruction access ratio, the TLB hit rate and the prefetch success rate in real time;
The path prediction and query planning module is used for carrying out hash calculation according to the preset bit of the virtual address through the path predictor, and obtaining a predicted path and a confidence level by querying the history record table;
the weight evaluation and cache management are used for calculating the score of the hit item by using a comprehensive scoring formula when the cache hits, and carrying out priority ordering and elimination decision of the cache item based on the score, and simultaneously adopting a mixed replacement algorithm to manage the cache item;
The prefetch strategy adjustment module is used for starting prefetch operation through the self-adaptive prefetch engine when a preset trigger condition is met, calculating and dynamically expanding prefetch step length by adopting Q4.12 fixed point numbers, and simultaneously preventing memory bandwidth contention through two-stage monitoring;
The mode switching and parameter adjusting module is used for activating the instruction priority mode when the instruction access ratio, the TLB hit rate and the prefetch success rate meet the preset conditions, and updating the score of the hit entry by adjusting the related parameters of the comprehensive scoring formula.
According to the multi-level page table traversal acceleration system provided by the embodiment, the path predictor is used for carrying out hash computation on the preset bits of the virtual address and predicting the access path by combining the history record, the dynamic weight hierarchical cache is used for determining the query range according to the confidence level, the page table level weight, the instruction type gain and the time attenuation factor are fused by means of the comprehensive scoring formula to optimize the priority of the cache entry, the Q4.12 fixed point number is adopted for dynamically expanding the prefetch step size by matching with the self-adaptive prefetch engine, the memory bandwidth contention is prevented by monitoring the cache miss rate, the scoring parameter is adjusted in the instruction intensive scene activation priority mode, and the remarkable reduction of page table traversal delay, the improvement of the TLB and the cache hit rate, the efficient allocation of the memory bandwidth and the optimization of the overall system performance are realized.
Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.
The multi-level page table traversal acceleration method provided by the embodiment of the application can be applied to electronic equipment. It will be appreciated by those skilled in the art that the structure of the electronic device according to the embodiments of the present application is not limited to the electronic device, and the electronic device may include more or less components than those illustrated, or may combine some components, or may have different arrangements of components. In embodiments of the present application, electronic devices include, but are not limited to, laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the embodiments of the application described and/or claimed herein.
The electronic device may include a processor, an external memory interface, an internal memory, a Universal Serial Bus (USB) interface, a charge management module, a power management module, a battery, a wireless communication module, an audio module, a speaker, a microphone, a sensor module, keys, a camera, a display screen, and a SIM) card interface, etc.
The processor may include one or more processing units, such as a processor may include a central processing unit (centralprocessingunit, CPU), etc.), an application processor (applicationprocessor, AP), a modem processor, a graphics processor (graphicsprocessingunit, GPU), an image signal processor (imagesignalprocessor, ISP), a controller, a memory, a video codec, a digital signal processor (digitalsignalprocessor, DSP), a baseband processor, and/or a neural-network processor (neural-networkprocessingunit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The processor may be a neural hub and a command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor for storing instructions and data. In some embodiments, the memory in the processor is a cache memory. The memory may hold instructions or data that the processor has just used or recycled. If the processor needs to reuse the instruction or data, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor is reduced, so that the system efficiency is improved.
The external memory interface may be used to connect an external memory card, such as a MicroSD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor through an external memory interface to realize the data storage function. Such as storing files of music, video, etc. in an external memory card.
The internal memory may be used to store computer-executable program code that includes instructions. The processor executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory. The internal memory may include a stored program area and a stored data area. The internal memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash memory (universalflashstorage, UFS), and the like.
The wireless communication function of the electronic device may be implemented by an antenna, a wireless communication module, a modem processor, a baseband processor, and the like.
The wireless communication module may provide solutions for wireless communication including wireless local area network (wirelesslocalareanetworks, WLAN) (e.g., wireless fidelity (WIRELESSFIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (globalnavigationsatellitesystem, GNSS), frequency modulation (frequencymodulation, FM), near Field Communication (NFC), infrared (IR), etc. for application on an electronic device.
The electronic device may implement audio functions, etc. through an audio module, speaker, receiver, microphone, headphone interface, application processor, etc.
The electronic device may implement a photographing function through an ISP, a camera, a video codec, a GPU, a display screen, an application processor, and the like.
The electronic device may implement display functions through a GPU, a display screen, an application processor, and the like.
The GPU is a microprocessor for image processing and is connected with the display screen and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor may include one or more GPUs that execute program instructions to generate or change display information.
The display screen is used for displaying images, videos, and the like. The display screen includes a display panel.
The electronic equipment realizes that the virtual address is subjected to hash calculation through the path predictor and the path is predicted by combining the history record, the dynamic weight hierarchical cache adjusts the query range according to the confidence level, the comprehensive scoring formula fuses the hierarchical weight, the instruction type gain and the time attenuation to optimize the cache priority, the self-adaptive prefetching engine dynamically expands the prefetching step length and monitors the cache miss rate by adopting the Q4.12 fixed point number algorithm, and the instruction priority mode adjusts the scoring parameter according to the real-time index, so that the beneficial effects of reducing the page table traversal delay, improving the TLB hit rate, optimizing the memory bandwidth utilization and enhancing the instruction intensive scene performance are achieved.
In the storage medium provided by the application, a program product capable of realizing a multi-stage page table walk acceleration method is stored.
The multi-stage page table traversal acceleration method comprises the following steps:
the virtual address is respectively input into a path predictor and a dynamic weight hierarchical cache, and the instruction access proportion, the TLB hit rate and the prefetch success rate are acquired in real time;
Carrying out hash calculation according to the preset bit of the virtual address by a path predictor, and inquiring a history record table to obtain a predicted path and a confidence level;
when the cache hits, the score of hit items is calculated by using a comprehensive scoring formula, the priority ordering and the elimination decision of the cache items are carried out based on the score, and meanwhile, the cache items are managed by adopting a mixed replacement algorithm;
When the preset triggering condition is met, starting a prefetching operation through the self-adaptive prefetching engine, calculating and dynamically expanding the prefetching step length by adopting the Q4.12 fixed point number, and simultaneously preventing memory bandwidth contention through two-stage monitoring;
when the instruction access duty ratio, the TLB hit rate and the prefetch success rate meet preset conditions, an instruction priority mode is activated, and the score of the hit entry is updated by adjusting the related parameters of the comprehensive scoring formula.
In some possible embodiments, the multi-level page table walk acceleration method of the present disclosure may be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the present disclosure as described in the above section of the "exemplary method" when the program product is run on the terminal device.
The storage media of the present disclosure may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

CN202510896630.XA2025-07-012025-07-01 Multi-level page table traversal acceleration method, system, device and mediumPendingCN120407451A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510896630.XACN120407451A (en)2025-07-012025-07-01 Multi-level page table traversal acceleration method, system, device and medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510896630.XACN120407451A (en)2025-07-012025-07-01 Multi-level page table traversal acceleration method, system, device and medium

Publications (1)

Publication NumberPublication Date
CN120407451Atrue CN120407451A (en)2025-08-01

Family

ID=96528719

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510896630.XAPendingCN120407451A (en)2025-07-012025-07-01 Multi-level page table traversal acceleration method, system, device and medium

Country Status (1)

CountryLink
CN (1)CN120407451A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110010521A1 (en)*2009-07-132011-01-13James WangTLB Prefetching
US20200073819A1 (en)*2018-09-042020-03-05Arm LimitedParallel page table entry access when performing address translations
CN114238176A (en)*2021-12-142022-03-25海光信息技术股份有限公司 Processor, address translation method for processor, electronic device
CN114238167A (en)*2021-12-142022-03-25海光信息技术股份有限公司Information prefetching method, processor and electronic equipment
CN114546897A (en)*2020-11-262022-05-27龙芯中科技术股份有限公司Memory access method and device, electronic equipment and storage medium
CN115391239A (en)*2022-09-052022-11-25山东大学 A cache dynamic data prefetch method, system, device and storage medium based on local algorithm
WO2022247070A1 (en)*2021-05-242022-12-01北京工业大学High-performance-oriented intelligent cache replacement policy adaptive to prefetching
CN116383102A (en)*2023-05-302023-07-04北京微核芯科技有限公司Translation look-aside buffer access method, device, equipment and storage medium
CN117609111A (en)*2023-10-272024-02-27杭州鸿钧微电子科技有限公司Data prefetching method, device, equipment and medium for system cache
CN119621608A (en)*2024-12-132025-03-14上海交通大学 Hardware-assisted page table management system, method, medium, program product and terminal
CN119690869A (en)*2024-11-282025-03-25飞腾信息技术有限公司Page table prefetching method, device, equipment and storage medium
CN119739650A (en)*2024-12-262025-04-01上海交通大学 Virtual memory architecture design method, system, medium and device for MCM GPU

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110010521A1 (en)*2009-07-132011-01-13James WangTLB Prefetching
US20200073819A1 (en)*2018-09-042020-03-05Arm LimitedParallel page table entry access when performing address translations
CN114546897A (en)*2020-11-262022-05-27龙芯中科技术股份有限公司Memory access method and device, electronic equipment and storage medium
WO2022247070A1 (en)*2021-05-242022-12-01北京工业大学High-performance-oriented intelligent cache replacement policy adaptive to prefetching
CN114238176A (en)*2021-12-142022-03-25海光信息技术股份有限公司 Processor, address translation method for processor, electronic device
CN114238167A (en)*2021-12-142022-03-25海光信息技术股份有限公司Information prefetching method, processor and electronic equipment
CN115391239A (en)*2022-09-052022-11-25山东大学 A cache dynamic data prefetch method, system, device and storage medium based on local algorithm
CN116383102A (en)*2023-05-302023-07-04北京微核芯科技有限公司Translation look-aside buffer access method, device, equipment and storage medium
CN117609111A (en)*2023-10-272024-02-27杭州鸿钧微电子科技有限公司Data prefetching method, device, equipment and medium for system cache
CN119690869A (en)*2024-11-282025-03-25飞腾信息技术有限公司Page table prefetching method, device, equipment and storage medium
CN119621608A (en)*2024-12-132025-03-14上海交通大学 Hardware-assisted page table management system, method, medium, program product and terminal
CN119739650A (en)*2024-12-262025-04-01上海交通大学 Virtual memory architecture design method, system, medium and device for MCM GPU

Similar Documents

PublicationPublication DateTitle
Wu et al.Dynamic content update for wireless edge caching via deep reinforcement learning
US10169240B2 (en)Reducing memory access bandwidth based on prediction of memory request size
CN112667528B (en) A method for pre-fetching data and related equipment
US8250306B2 (en)Method for improving frequency-based caching algorithms by maintaining a stable history of evicted items
US11593268B2 (en)Method, electronic device and computer program product for managing cache
US9367454B2 (en)Address index recovery using hash-based exclusive or
CN106802772A (en)The method of data record, device and solid state hard disc
EP4474991A1 (en)Shared cache management method and apparatus, and storage medium
CN113435601A (en)Data prefetching method and device and storage device
CN113688062A (en)Method for storing data and related product
CN104021226B (en)Prefetch the update method and device of rule
Weerasinghe et al.Adaptive context caching for efficient distributed context management systems
ChaoWeb cache intelligent replacement strategy combined with GDSF and SVM network re-accessed probability prediction
CN107291630B (en)Cache memory processing method and device
Torabi et al.A learning-based caching mechanism for edge content delivery
CN112199400B (en)Method and device for data processing
CN120407451A (en) Multi-level page table traversal acceleration method, system, device and medium
CN119155741A (en)Edge service cache optimization method and device, computer equipment and storage medium
CN113127515A (en)Power grid-oriented regulation and control data caching method and device, computer equipment and storage medium
CN112231241B (en) A data reading method and device, computer-readable storage medium
Meint et al.From FIFO to predictive cache replacement
CN113268458A (en)Caching method and system based on cost-sensitive classification algorithm
CN109685101B (en)Multi-dimensional data self-adaptive acquisition method and system
CN114884883A (en)Flow forwarding method, device, equipment and storage medium
CN112256205A (en)Nonvolatile cache data prefetching method and device, electronic equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp