CN119937305A

Movatterモバイル変換

Info

Publication number: CN119937305A
Application number: CN202510012266.6A
Authority: CN
Inventors: 刘烨; 朱文瑞
Original assignee: Xian Shiyou University
Current assignee: Xian Shiyou University
Priority date: 2025-01-03
Filing date: 2025-01-03
Publication date: 2025-05-06

Abstract

The invention discloses an automatic drilling key parameter regulating and controlling method based on an improved DDPG algorithm. The method comprises the steps of collecting and preprocessing drilling data, constructing a complete data set for training, constructing and training a drilling Rate (ROP) time sequence prediction model based on a self-attention mechanism (self-attention), constructing a simulation environment by using the trained ROP prediction model, defining a state space, an action space and a multi-target rewarding function of a reinforcement learning DDPG algorithm model, training a reinforcement learning model by adopting a depth deterministic strategy gradient algorithm (DDPG), and learning an optimal Weight On Bit (WOB) and rotating speed (RPM) regulation strategy in the simulation environment by an intelligent agent through an Actor-Critic network structure to realize dynamic optimization of the drilling Rate (ROP). The method improves the efficiency and stability of the drilling process through automatic and intelligent parameter regulation, reduces the wear rate of the drill bit and the operation cost, has good adaptability and multi-objective optimization capability, and is suitable for drilling operation under complex stratum conditions.

Description

Drilling key parameter automatic regulation and control method based on improved DDPG algorithm

Technical Field

The invention relates to the technical field of petroleum engineering, in particular to a key parameter regulation and control method in the drilling process.

Background

In drilling engineering, weight On Bit (WOB) and rotational speed (RPM) are key parameters affecting the rate of penetration (Rate of Penetration, ROP). Engineers typically manually adjust these parameters based on real-time drilling feedback to achieve optimization of drilling efficiency. However, the conventional manual regulation and control method has the problems that firstly, depending on experience of operators, stable optimization effect is difficult to maintain under complex stratum conditions, and secondly, the regulation process lacks real-time performance and accuracy, which may cause reduction of drilling efficiency or occurrence of underground complex conditions.

With the development of intelligent drilling technology, automated control systems are gradually introduced into the drilling sites. Existing automated control methods are mainly based on rule models or optimization algorithms, e.g. Self, R in document Reducing drilling cost by finding optimal operational parameters using PARTICLE SWARM algorithm propose using the PSO algorithm to find the best parameter combination. The method can realize a certain optimizing effect in a specific scene, but is difficult to simultaneously consider multiple indexes such as drilling rate, drill bit service life, wellbore quality and the like when facing complex stratum environments and multi-target constraint conditions. In addition, the traditional intelligent optimization algorithm needs to be recalculated every time the parameters are adjusted, and the prior learning experience cannot be effectively utilized to perform quick response, so that the defects of poor instantaneity and insufficient adaptability in a quickly-changed drilling environment are caused.

In recent years, the rapid development of artificial intelligence technology has provided new solutions for the automated regulation of drilling parameters. Reinforcement learning (Reinforcement Learning, RL) is a data-driven optimization technique that enables real-time learning of optimal strategies through continuous interaction of intelligent agents with the environment. The method has the core advantage that real-time drilling feedback can be directly integrated into the decision process, and parameter setting is dynamically optimized. However, when the existing reinforcement learning method is applied to drilling parameter regulation, the combination of engineering constraint and formation complexity is not fully considered, so that the optimization result of the reinforcement learning method lacks operability in actual engineering, and research invention of the related field is not formed.

Based on this, there is a need for an intelligent method that can integrate multiple target constraints, dynamically adjust weight on bit and rotational speed, to improve drilling efficiency, optimize drilling rate, and ensure safety and economy of downhole operations.

Disclosure of Invention

In order to overcome the defects of poor instantaneity, insufficient adaptability and limited multi-objective optimization capability of the conventional drilling parameter regulation and control method, the invention provides a drilling key parameter automatic regulation and control method based on a depth certainty strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm, aiming at realizing dynamic optimization regulation and control of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) through an intelligent algorithm so as to improve the rate of penetration (ROP) and ensure the quality and operation safety of a well. According to the method, the reinforcement learning DDPG algorithm model is constructed, the drilling parameters are dynamically adjusted, the multi-target requirements such as the drilling efficiency, the service life of the drill bit, the quality of the well bore and the like are comprehensively considered, and the intelligent level and the economic benefit of the drilling process are remarkably improved.

In order to solve the technical problems, the invention adopts the following technical scheme:

an automatic regulation and control method for drilling key parameters based on an improved DDPG algorithm comprises the following steps:

1) Acquiring and preprocessing drilling data, and constructing a training data set for training of the ROP time sequence prediction model;

2) Constructing and training an ROP time sequence prediction model, providing accurate dynamic response, and using the ROP time sequence prediction model as an environment feedback basis of a reinforcement learning DDPG algorithm model;

3) Constructing a simulation environment based on the ROP prediction model in the step 2), defining a state space, an action space and a reward function of the reinforcement learning DDPG algorithm model, and designing an improved DDPG algorithm model;

4) Training DDPG an algorithm model by using a simulation environment, and learning an optimal parameter regulation strategy by using an intelligent agent.

The drilling parameters in the step 1) include Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL), and the target parameter is rate of drilling (ROP). The data preprocessing process comprises the following steps:

1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and linear interpolation processing is carried out on the missing data;

1.2 Maximum and minimum normalization of the data, the formula is:

wherein X_normalized is normalized data, X is original data, min (X) is minimum value of the original data, and max (X) is maximum value of the original data.

In the step 2), a deep time sequence neural network model is constructed by using the training set data in the step S1 to predict the ROP value, and the method specifically comprises the following substeps:

2.1 A deep neural network model is constructed, the model adopts self-attention framework as a core framework, and the dynamic rule of the change of drilling parameters and stratum characteristics along with time is captured by utilizing the advantages of the model in the aspect of processing time sequence data. The model comprises a neural network architecture of an encoder (encoder) and a decoder (decoder) based on a self-attention mechanism (self-attention) for extracting multi-level time series characteristics, and setting a full connection layer at an output end to generate a predicted ROP value.

2.2 Setting input characteristics and output targets of the model, wherein the input characteristics of the model comprise Weight On Bit (WOB), rotating speed (RPM), riser pressure (SPP), torque (TQ), slurry inflow (MFI) and major groove load (HKL), and the output targets are the drilling Rate (ROP) at corresponding moments. The input features are subjected to normalization processing to accelerate training and improve generalization performance of the model.

2.3 Time sequence prediction model loss function and optimizer, the model training process adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:

Wherein y_i is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted for parameter updating.

2.4 And (3) after the evaluation result of the test set reaches the expectation, storing the optimal state of the model.

In the step 3), a simulation environment is constructed based on the ROP time sequence prediction model trained in the step 2), and key elements of the reinforcement learning DDPG algorithm model are defined, which specifically comprises the following contents:

3.1 A simulation environment is built, and an ROP prediction model is used as a core for simulating the influence of Weight On Bit (WOB) and rotating speed (RPM) on the rate of penetration (ROP). Inputs to the simulation environment include the bottom layer characteristics parameters of Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL). Since these underlying characteristic parameters cannot be directly manipulated by reinforcement learning, they are used as state inputs to the environment, affecting the rate of penetration prediction and subsequent optimization process in the simulation environment. The simulation environment calculates the rate of penetration (ROP) through the ROP prediction model and provides feedback indexes of the rate of penetration under a reward and punishment mechanism for optimizing DDPG algorithm models.

Constraints in the simulation environment include physical operating limitations of weight and speed (e.g., maximum allowable weight and speed of equipment), limitations of bit wear rate, and wellbore quality requirements, which ensure that model-generated regulation schemes are viable and that the rate of drilling can be effectively increased in actual operation.

3.2 Key elements of the DDPG algorithm model are defined, which elements include a preset action space and a set reward function. The action space comprises adjustment of the weight on bit and the rotating speed, and the two parameters are targets which can be regulated and controlled by the DDPG algorithm model. The model optimizes the rate of penetration (ROP) by increasing, decreasing, or maintaining a constant action on these parameters.

The reward function is designed as multi-objective optimization, taking into account the satisfaction of ROP lifting, adjustment costs and constraints. The specific rewarding function forms are:

Reward=α·ROP_gain-β·Adjustment_cost-γ·Constraint_penalty

The ROP_gain represents the drilling rate lifting amount of the current time step t compared with the previous time step t-1, and is multiplied by a weight coefficient alpha to encourage the increase of the drilling rate, the Adjustment_cost represents the Adjustment range of the drilling pressure and the rotating speed in the current time step, and is multiplied by a weight coefficient beta to penalize excessive parameter Adjustment and reduce the operation cost and equipment wear, and the Constraint_penalty is used for measuring whether engineering constraints (such as physical limits of the drilling pressure and the rotating speed, the drill bit wear rate, the borehole quality and the like) are violated and multiplying by a higher weight coefficient gamma to ensure strict compliance with safety and performance standards. Through the rewarding mechanism, the intelligent agent is not only stimulated to promote the drilling speed, but also restricted to operate in a reasonable adjustment range, and any behavior violating engineering restrictions is strictly punished, so that the balance optimization of efficiency, safety and economic benefit is realized in the drilling process.

3.3 Architecture of reinforcement learning model that uses depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm and employs the structure of strategy network (Actor) and value network (Critic) to optimize the regulation strategy of Weight On Bit (WOB) and rotational speed (RPM). Compared with the traditional DDPG algorithm model, the invention is optimized and improved in the following aspects:

3.3.1 The model adds an extra full connection layer in the traditional strategy network, so that the network depth reaches four layers. The introduction of the additional layer enables the network to better understand and process multidimensional data in the drilling process, thereby improving the accuracy and stability of the strategy;

3.3.2 In value network design, the present invention introduces a residual connection (Residual Connections). Specifically, by adding two linear layers (residual 1 and residual 2), the input features are added directly to the output of the hidden layer. The residual connection effectively relieves the gradient vanishing problem in the deep network, promotes a more stable and efficient training process, and improves the accuracy of value evaluation;

3.3.3 The model uses normal distribution (mean 0, standard deviation 0.1) for weight initialization in each layer of the policy network and the value network. The initialization method is helpful to avoid gradient disappearance or explosion phenomenon, so that the network can keep better performance in the initial stage of training;

3.3.4 To prevent model overfitting and promote generalization capability, the invention adds Dropout layers (p=0.1) after the first hidden layer of the policy network and the key hidden layer of the value network, respectively. The Dropout layer reduces the dependence of the network on specific neurons by randomly discarding the output of part of neurons, and enhances the adaptability and robustness of the model under different drilling environments;

3.3.5 The output of the strategy network is limited to the range of [ -1,1] via the tanh activation function, and then the action values are scaled to the actual action space range drilling pressures [ wob_low, wob_high ] and rotational speeds [ rpm_low, rpm_high ] by linear transformation. The specific formula is as follows:

the action represents the current action value, and the design ensures that the output bit pressure and the output rotation speed are in a reasonable and operable range, and avoids generating invalid or extreme action values, thereby improving the enforceability and the safety of the regulation strategy.

In the step 4), the simulation environment and the reinforcement learning DDPG algorithm model constructed in the step 3) are utilized, and the model is trained to learn the optimal parameter regulation strategy through training, which concretely comprises the following contents:

4.1 A training process is established, a deep reinforcement learning depth deterministic strategy gradient algorithm (DDPG) algorithm is adopted, and a model is trained. In the training process, the intelligent agent interacts with the simulation environment, and at each time step t, one action a_t is selected according to the current state s_t, and the next state s_t+1 and the instant prize r_t are obtained. By constantly interacting with the environment, empirical data is accumulated.

4.2 Experience pool creation and updating, an experience replay pool (Experience Replay Buffer) is introduced for storing interaction data of the intelligent agent, including quadruples of state, action, rewards and next state (s_t,a_t,r_t,s_t+1). And randomly extracting small-batch (mini-batch) data from the experience pool for training when updating the strategy each time, so as to break the data correlation and improve the training stability and efficiency.

4.3 Updating model parameters using the policy network and the value network. Wherein the goal of the value network is to estimate the expected jackpot under a given state and action, and the loss function of the value network is defined as:

Wherein,

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′)

Gamma is the discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θ^Q is updated using a gradient descent method by minimizing the loss function L_Q.

The purpose of the strategic network is to output optimal actions (i.e., adjustments to weight on bit and rotational speed) based on the current state, with the objective of maximizing the jackpot of the intelligent agent in the environment. The strategy network directly decides the regulation strategy which the intelligent agent should take under different states through learning the mapping relation from the states to the actions.

The goal of the updating of the policy network is to increase the value that can be gained by the actions selected in the various states. The parameter updating of the strategy network is realized by a strategy gradient method, and the gradient calculation formula is as follows:

Where θ^μ is a parameter of the policy network, μ (s_i|θ^μ) represents an action output by the policy network in state s_i, Q (s, a|θ^Q) is a value network, and a value of performing the action in state s is evaluated. By means of the policy gradient method, the policy network parameter θ^μ is updated so that the policy network can output actions that can maximize the expected jackpot under a given state.

4.4 The convergence of the model and the policy optimization, and the convergence condition of the model is judged by monitoring the maximum cumulative rewards and the stability of the policy in the training process. The model is considered to have converged when the jackpot reaches an expected level on the validation set and the policy output tends to stabilize. At this time, the optimal model parameters are saved for subsequent practical applications.

The trained DDPG algorithm model can output an optimal Weight On Bit (WOB) and rotating speed (RPM) adjustment strategy according to the current drilling state, so that the optimization of the rate of drilling (ROP) is realized, the drilling efficiency is improved, and the drilling safety is ensured.

Compared with the prior art, the invention has the main advantages that:

(1) Realizing the intelligent regulation and control of key parameters. According to the invention, automatic regulation and control of Weight On Bit (WOB) and rotating speed (RPM) are realized through the improved DDPG algorithm model, key parameters can be dynamically regulated according to the real-time drilling state, manual intervention is not needed, and the automation degree of drilling operation is greatly improved.

(2) Improving the accuracy and efficiency of rate of penetration (ROP) optimization. By combining the ROP time sequence prediction model and DDPG algorithm, the invention can predict the drilling speed in real time under different working conditions, and can maximize the lifting effect of the drilling speed through intelligent regulation and control, thereby remarkably improving the drilling efficiency.

(3) Multi-objective optimization capability. The invention designs a multi-objective rewarding function, comprehensively considers the drilling speed improvement, the parameter adjustment cost and the safety constraint, can ensure the service life of the drill bit and the quality of the well bore while improving the drilling speed, and meets various engineering requirements.

(4) And the adaptability of complex working conditions is enhanced. By constructing the simulation environment and improving the dynamic optimization of DDPG algorithm models, the invention can stably run under different stratum conditions, equipment configuration and operation constraints and is suitable for complex drilling working conditions.

(5) Reducing the risk of manual intervention and operation. In the traditional drilling operation, the adjustment of the drilling pressure and the rotating speed is required to depend on manual experience, and the problems of misoperation and low efficiency are easy to occur. According to the invention, through the improved intelligent decision making capability of the reinforcement learning DDPG algorithm model, the requirement of manual intervention is remarkably reduced, and the operation risk is reduced.

In conclusion, the invention provides an efficient, stable and safe automatic regulation and control method for drilling key parameters through an intelligent algorithm and a reinforcement learning DDPG algorithm model, which can remarkably improve drilling efficiency, reduce operation risk, adapt to complex working condition requirements and have wide application prospects.

Drawings

FIG. 1 is a flow chart of a method for automatically regulating and controlling drilling parameters based on a modified DDPG algorithm.

FIG. 2 is a schematic diagram of the architecture of a self-attention based neural network.

Fig. 3 is a schematic diagram of a neural network structure of the reinforcement learning DDPG algorithm.

Fig. 4 is a graph comparing the results of the improved DDPG algorithm optimization ROP.

Detailed Description

Example 1

Referring to fig. 1, the method for automatically regulating and controlling drilling key parameters based on the improved DDPG algorithm comprises the following steps:

2) Constructing and training an ROP time sequence prediction model, and inputting the data in the step 1) into the model to provide accurate dynamic response, wherein the accurate dynamic response is used as an environment feedback basis of a reinforcement learning DDPG algorithm model;

4) And 3) training the reinforcement learning DDPG algorithm model by utilizing the ROP simulation environment in the step 3), learning the optimal parameter regulation strategy by an intelligent agent, and finally saving and applying the model.

In this embodiment, in step 1), key parameter data related to the drilling process is collected in real time by sensors and monitoring systems at the drilling site. The drilling operation parameters comprise Weight On Bit (WOB), rotating speed (RPM), vertical pipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI) and major ditch load (HKL), and the target parameter is drilling Rate (ROP). The collected data includes historical data and real-time drilling data for forming a complete training data set. And then preprocessing the acquired data, wherein the preprocessing process is as follows:

1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and the quality and consistency of the residual data are ensured. Table 1 shows the processed partial data samples:

table 1 ROP logging sample data example

1.2 Data interpolation is carried out on the subsequent data, a linear interpolation method is used for filling the missing value, and the interpolation is carried out on the data missing in a specific time interval by utilizing the data at adjacent moments.

1.3 Maximum and minimum normalization (Min-Max Normalization) of the data, the normalization formula is as follows:

In the embodiment, in step 2, a ROP time sequence prediction model based on self-attention is constructed by using training set data to capture dynamic rules of drilling parameters and formation characteristics changing with time. As shown in fig. 2, the main flow of the model construction is as follows:

2.1 A full connection output layer, which takes an encoder layer (encoder) and a decoder layer (decoder) of a self-attention mechanism (self-attention) as a core architecture, and maps the extracted features to a prediction target, namely a rate of penetration (ROP).

2.2 Loss function and model optimization, model training adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:

Wherein y_i is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted to update parameters, the initial learning rate is 0.001, and the learning rate is adjusted according to an attenuation factor of 0.9 after every 10 training periods.

2.3 Training process design, training the model by a small batch gradient descent method (batch size=32) for 100 rounds, and checking ROP values after automatic parameter adjustment on a verification set to obtain corresponding improvement.

In this embodiment, in step 3), a simulation environment is constructed based on the ROP timing prediction model trained in step 2), which is used for training and optimizing DDPG models, and referring to fig. 3, the following specifically include:

3.1 A simulation environment is constructed. The simulation environment takes a trained ROP prediction model as a core, and simulates the dynamic influence of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) on the rate of drilling (ROP). Input characteristics of the simulation environment include current Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL) parameters. The bottom layer characteristic parameters are used as state input of the environment, directly influence the drilling speed prediction result in the simulation environment, and provide basis for strategy optimization of DDPG algorithm models. Because the bottom layer characteristic parameters cannot be directly regulated and controlled through DDPG models, the reinforcement learning DDPG algorithm mainly learns and optimizes the adjustment strategy of the weight on bit and the rotating speed.

3.2 Basic conditions of the reinforcement learning process, including state space, action space, and reward functions, are defined based on the simulation environment. The State space (State, s) includes current Weight On Bit (WOB) and rotational speed (RPM), two adjustable characteristic parameters, other non-adjustable parameters, and historical rate of penetration (WOB) data. Through the definition of the state space, the reinforcement learning DDPG algorithm model can comprehensively sense the current drilling working condition and provide sufficient information for subsequent decisions.

The Action space (Action, a) is defined as an adjustment value of the weight on bit and the rotation speed, and specifically includes increasing, decreasing or maintaining the weight on bit and the rotation speed unchanged, and performing physical operation limitation on the weight on bit and the rotation speed. The reinforcement learning DDPG algorithm model outputs optimal actions under different conditions through learning, thereby optimizing rate of penetration (ROP) and ensuring wellbore quality and bit life. The reward function (Reward, r) is designed as a multi-objective optimization function, taking into account the rate of penetration (ROP gain), the parameter Adjustment costs (Adjustment costs) and the constraint satisfaction (Constraint penalty) in total. The specific form of the reward function is as follows:

Reward=α·ROP_gain-β·Adjustment_cost-γ·Constraint_penalty

Where ROP_gain is the rate of rise, adjustment_cost is the Adjustment cost, constraint_penalty is the penalty term for violating the Constraint, and α, β, γ are the weight coefficients used to balance the targets.

3.3 The reinforcement learning model uses a deep reinforcement learning algorithm DDPG (DEEP DETERMINISTIC Policy Gradient) whose architecture includes a Policy network (Actor) and a value network (Critic). The strategy network is used to output optimal adjustment actions (i.e., adjustment values for weight-on-bit and rotational speed) based on the current state, while the value network is used to evaluate the jackpot for each state-action pair, helping the strategy network optimize the regulation strategy.

The policy network and the value network both adopt full-connection layer structures, the activation function is ReLU (RECTIFIED LINEAR Unit), and the calculation formula is as follows:

ReLU(x)=max(0,x)

In the model architecture, the strategy network receives current state information (such as weight on bit, rotation speed, mud flow, borehole depth, historical ROP value, etc.) of the simulation environment, and outputs two continuous values corresponding to the adjustment values of the weight on bit and the rotation speed respectively. The value network then receives the current state and the action generated by the policy network and calculates a value estimate (i.e., a jackpot) for the state-action pair. The strategy network and the value network are continuously matched and optimized, so that the model can learn the optimal parameter regulation strategy step by step.

In this embodiment, in step 4), training is performed using the simulation environment and the reinforcement learning DDPG algorithm model. By utilizing the simulation environment constructed in the step S3 and the reinforcement learning DDPG algorithm model, the model is trained to learn the optimal Weight On Bit (WOB) and rotating speed (RPM) regulation strategy, so that the dynamic optimization of the drilling Rate (ROP) is realized, and the specific implementation process is as follows:

4.1 The training process adopts a depth deterministic strategy gradient algorithm (DDPG), and the model gradually learns to an optimal parameter regulation strategy through continuous interaction between an intelligent Agent (Agent) for reinforcement learning and a simulation environment. At each time step t, the intelligent agent generates an action a_t over the strategy network based on the current state s_t, which corresponds to the adjustment values for weight on bit and rotational speed. The simulation environment updates the state to s_t+1 based on the entered action a_t and the current state s_t, while generating an instant prize r_t. The calculation of the reward value considers the drill speed improvement, the parameter adjustment cost and the satisfaction of the constraint condition, so as to guide the behavior optimization of the intelligent agent. Through this continuous interaction, the intelligent agent accumulates empirical data, providing training samples for subsequent model optimization.

4.2 During the building and updating training process of the experience pool, an experience replay pool (Experience Replay Buffer) is introduced for storing the interaction data of the intelligent agent and the simulation environment. The data stored by the experience pool is recorded in the form of four tuples, including status, action, instant prize, and next status (s_t,a_t,r_t,s_t+1). Each time model parameters are updated, small batch data (mini-batch) is randomly extracted from the experience pool for training the strategy network and the value network. The data correlation is broken through random sampling, and the experience pool improves the stability and efficiency of training. The capacity of the experience pool is typically set to a fixed size, e.g., 10,000 interaction records, with the most recent data overlapping the oldest data, ensuring that training is always based on the most recent interaction information.

4.3 Updating settings of model parameters. In the training process, parameters of the strategy network and the value network are updated through different optimization targets. The value network aims to estimate the jackpot under the current state and action, and the loss function is defined as:

Wherein,

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′)

S is the state at a certain moment, and α is the action at the current state. y_i denotes that the target jackpot gamma is a discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θ^Q is updated using a gradient descent method by minimizing the loss function L_Q.

The policy network is responsible for generating an optimal action a based on the current state s, with the optimization objective being to maximize the cumulative rewards of the intelligent agent. The strategy network updates parameters through a strategy gradient method, and the gradient formula is as follows:

4.4 In the training process, the convergence condition of the model is judged by monitoring the stability of the accumulated rewards and the strategy. When the intelligent agent performance in the simulation environment tends to be stable, the cumulative rewarding curve reaches the expected level, and the action change of the strategy network output in the multi-round interaction is smaller, which indicates that the model has converged. In this case, the optimal parameters of the trained policy network and value network are saved for subsequent practical applications.

4.5 In the present example, the model generates optimal Weight On Bit (WOB) and rotational speed (RPM) adjustment strategies based on real-time drilling conditions (e.g., current weight on bit, rotational speed, mud flow, etc.) to optimize rate of penetration (ROP). Referring to fig. 4, the improved DDPG algorithm model of the present invention can significantly improve drilling efficiency by continuously adjusting and controlling under different conditions.

Claims

Translated fromChinese

1.一种基于改进DDPG算法的钻井关键参数自动化调控方法，其特征在于，包括以下步骤：1. A method for automatic control of key drilling parameters based on an improved DDPG algorithm, characterized in that it comprises the following steps:

1)采集并预处理钻井数据，构建用于训练的完整数据集；1) Collect and preprocess drilling data to build a complete data set for training;

2)构建并训练基于self-attention的神经网络结构钻速(ROP)时序预测模型；2) Build and train a self-attention-based neural network structure drilling rate (ROP) time series prediction model;

3)基于2)中ROP时序预测模型搭建仿真环境，并定义强化学习DDPG算法模型的状态空间、动作空间及奖励函数，设计改进的DDPG算法网络结构模型；3) Build a simulation environment based on the ROP time series prediction model in 2), define the state space, action space and reward function of the reinforcement learning DDPG algorithm model, and design an improved DDPG algorithm network structure model;

4)利用3)的仿真环境对所述强化学习智能体决策模型进行交互学习，通过智能代理学习最佳的钻压(WOB)和转速(RPM)调控策略，以实现钻速(ROP)的动态优化。4) Using the simulation environment of 3) to interactively learn the reinforcement learning agent decision model, the optimal weight on bit (WOB) and rotation speed (RPM) control strategy is learned by the intelligent agent to achieve dynamic optimization of the drilling rate (ROP).

2.根据权利要求1所述的方法，其特征在于，所述步骤1)中预处理包括：2. The method according to claim 1, characterized in that the pretreatment in step 1) comprises:

1.1)对采集的数据进行初步清洗，剔除缺失值比例超过10％的数据条目，去除超出三倍标准差的异常数据，对缺失数据进行线性插值处理；1.1) Perform preliminary cleaning of the collected data, remove data entries with missing value ratio exceeding 10%, remove abnormal data exceeding three times the standard deviation, and perform linear interpolation on the missing data;

1.2)将数据进行最大最小归一化，公式为：1.2) Normalize the data to the maximum and minimum values, the formula is:

其中，X_normalized是归一化后的数据；X是原始数据；min(X)是原始数据的最小值；max(X)是原始数据的最大值。Among them, X_normalized is the normalized data; X is the original data; min(X) is the minimum value of the original data; max(X) is the maximum value of the original data.

3.根据权利要求1所述的方法，其特征在于，所述步骤2)进一步包括：3. The method according to claim 1, characterized in that the step 2) further comprises:

2.1)设计预测模型结构，包括基于自注意力机制(self-attention)的编码器层(encoder)与解码器层(decoder)用于提取多层次的时间序列特征，以及全连接输出层用于生成预测的钻速(ROP)值；2.1) Design the prediction model structure, including the encoder and decoder layers based on the self-attention mechanism to extract multi-level time series features, and the fully connected output layer to generate the predicted drilling rate (ROP) value;

2.2)设置损失函数为均方误差(MSE)，并采用Adam优化器进行模型参数的优化，初始学习率设定为0.001，每10个训练周期后按衰减因子0.9调整学习率：2.2) Set the loss function to mean square error (MSE), and use Adam optimizer to optimize model parameters. The initial learning rate is set to 0.001, and the learning rate is adjusted by a decay factor of 0.9 after every 10 training cycles:

2.3)采用小批量梯度下降法(batch size＝32)对模型进行100轮训练，并在验证集上评估模型性能，保存训练过程中的最优模型状态。2.3) The model was trained for 100 rounds using mini-batch gradient descent (batch size = 32), and the model performance was evaluated on the validation set, saving the optimal model state during the training process.

4.根据权利要求1所述的方法，其特征在于，所述步骤3)进一步包括：4. The method according to claim 1, characterized in that the step 3) further comprises:

3.1)构建仿真环境，所述仿真环境以训练好的ROP预测模型为核心，输入特征包括钻压(WOB)、转速(RPM)、立管压力(SPP)、扭矩(TQ)、钻井时间(BDT)、泥浆流入(MFI)和大沟载荷(HKL)；3.1) Constructing a simulation environment, which is based on the trained ROP prediction model and whose input features include weight on bit (WOB), rotation speed (RPM), standpipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI) and large trench load (HKL);

3.2)定义强化学习模型的关键要素，包括状态空间、动作空间及奖励函数，其中状态空间包括：钻压(WOB)、转速(RPM)、立管压力(SPP)、扭矩(TQ)、钻井时间(BDT)、泥浆流入(MFI)、大沟载荷(HKL)及历史钻速(ROP)数据；动作空间定义为钻压和转速的增大、减小或保持不变进行约束条件限制；奖励函数为多目标优化函数，形式为：3.2) Define the key elements of the reinforcement learning model, including state space, action space and reward function. The state space includes: weight on bit (WOB), rotation speed (RPM), standpipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI), large trench load (HKL) and historical drilling rate (ROP) data; the action space is defined as the increase, decrease or remain unchanged of the weight on bit and rotation speed to constrain the constraints; the reward function is a multi-objective optimization function in the form of:

Reward＝α·ROP_gain-β·Adjustment_cost-γ·Constraint_penaltyReward＝α·ROP_gain -β·Adjustment_cost -γ·Constraint_penalty

其中，ROP_gain表示当前时间步t相较于前一时间步t-1的钻速提升量，乘以权重系数α，用以鼓励钻速的增加；Adjustment_cost代表在当前时间步对钻压和转速的调整幅度，乘以权重系数β，用以惩罚过大的参数调整，减少操作成本和设备磨损；Constraint_penalty则用于衡量是否违反了工程约束(如钻压和转速的物理限制、钻头磨损率、井眼质量等)，乘以较高的权重系数γ，以确保严格遵守安全和性能标准；Among them, ROP_gain represents the increase in drilling speed at the current time step t compared with the previous time step t-1, multiplied by the weight coefficient α, to encourage the increase in drilling speed; Adjustment_cost represents the adjustment of drilling pressure and rotation speed at the current time step, multiplied by the weight coefficient β, to punish excessive parameter adjustments and reduce operating costs and equipment wear; Constraint_penalty is used to measure whether the engineering constraints (such as physical limitations of drilling pressure and rotation speed, drill bit wear rate, wellbore quality, etc.) are violated, multiplied by a higher weight coefficient γ to ensure strict compliance with safety and performance standards;

3.3)创建强化学习网络结构模型，所述模型采用深度强化学习算法DDPG(DeepDeterministic Policy Gradient)，包括策略网络(Actor)和价值网络(Critic)，其中网络结构采用全连接层，激活函数为ReLU。所述强化学习模型中的策略网络根据当前状态输出连续的钻压和转速调整值，价值网络则评估每个状态-动作对的累积奖励，以辅助策略网络优化调控策略。其网络结构的特征进一步包括：3.3) Create a reinforcement learning network structure model, which uses the deep reinforcement learning algorithm DDPG (Deep Deterministic Policy Gradient), including a policy network (Actor) and a value network (Critic), where the network structure uses a fully connected layer and the activation function is ReLU. The policy network in the reinforcement learning model outputs continuous drilling pressure and speed adjustment values according to the current state, and the value network evaluates the cumulative reward of each state-action pair to assist the policy network in optimizing the control strategy. The characteristics of its network structure further include:

3.3.1)本模型在传统的策略网络中增加了一个额外的全连接层，使网络深度达到四层。该额外层的引入使得网络能够更好地理解和处理钻井过程中的多维度数据，提高了策略的精确性和稳定性；3.3.1) This model adds an extra fully connected layer to the traditional policy network, making the network depth reach four layers. The introduction of this extra layer enables the network to better understand and process multi-dimensional data in the drilling process, improving the accuracy and stability of the strategy;

3.3.2)在价值网络设计中，本发明引入了残差连接(Residual Connections)。具体来说，通过添加两个线性层(residual1和residual2)，将输入特征直接与隐藏层的输出相加。这种残差连接有效缓解了深层网络中的梯度消失问题，促进了更稳定和高效的训练过程，提高了价值评估的准确性；3.3.2) In the value network design, the present invention introduces residual connections. Specifically, by adding two linear layers (residual1 and residual2), the input features are directly added to the output of the hidden layer. This residual connection effectively alleviates the gradient vanishing problem in deep networks, promotes a more stable and efficient training process, and improves the accuracy of value assessment;

3.3.3)模型在策略网络和价值网络的每一层中采用了正态分布(均值为0，标准差为0.1)进行权重初始化。这种初始化方法有助于避免梯度消失或爆炸现象；3.3.3) The model uses a normal distribution (mean 0, standard deviation 0.1) for weight initialization in each layer of the policy network and value network. This initialization method helps to avoid gradient disappearance or explosion;

3.3.4)为了防止模型过拟合并提升泛化能力，本发明在策略网络的第一隐藏层和价值网络的关键隐藏层后分别加入了Dropout层(p＝0.1)。Dropout层通过随机丢弃部分神经元的输出，减少了网络对特定神经元的依赖，增强了模型在不同钻井环境下的适应性和鲁棒性；3.3.4) In order to prevent the model from overfitting and improve the generalization ability, the present invention adds a Dropout layer (p=0.1) after the first hidden layer of the strategy network and the key hidden layer of the value network. The Dropout layer reduces the network's dependence on specific neurons by randomly discarding the output of some neurons, thereby enhancing the adaptability and robustness of the model in different drilling environments;

3.3.5)策略网络的输出经过tanh激活函数限制在[-1，1]范围内，随后通过线性变换将动作值缩放至实际的动作空间范围钻井压力[WOB_low，WOB_high]和转速[RPM_low，RPM_high]。具体公式为：3.3.5) The output of the policy network is limited to the range of [-1, 1] through the tanh activation function, and then the action value is scaled to the actual action space range of drilling pressure [WOB_low, WOB_high] and speed [RPM_low, RPM_high] through linear transformation. The specific formula is:

action代表当前动作的值，这种设计确保输出的钻压和转速在合理且可操作的范围内，避免生成无效或极端的动作值，从而提高了调控策略的可实施性和安全性。Action represents the value of the current action. This design ensures that the output drilling pressure and rotation speed are within a reasonable and operable range, avoiding the generation of invalid or extreme action values, thereby improving the feasibility and safety of the control strategy.

5.根据权利要求4所述的方法，其特征在于，所述仿真环境中的约束条件包括钻压和转速在钻井过程中的真实物理操作限制，确保所生成的调控方案在实际操作中的可行性。5. The method according to claim 4 is characterized in that the constraints in the simulation environment include the real physical operation limitations of drilling pressure and rotation speed during the drilling process, ensuring the feasibility of the generated control scheme in actual operation.

6.根据权利要求1所述的方法，其特征在于，所述步骤4)进一步包括：6. The method according to claim 1, characterized in that the step 4) further comprises:

4.1)智能代理与仿真环境不断交互，在每一时间步t，根据当前状态s_t通过策略网络生成动作a_t，仿真环境根据动作a_t和状态s_t更新状态至下一状态s_t+1并生成即时奖励；4.1) The intelligent agent continuously interacts with the simulation environment. At each time step t, the policy network generates an action a_{t according to the current state s t}_. The simulation environment updates the state to the next state s_t+1 according to the action a_t and state s_t and generates an immediate reward.

4.2)建立并更新经验重放池(Experience Replay Buffer)，存储交互数据(s_t,a_t,r_t，s_t+1)，每次更新模型参数时，从经验池中随机抽取小批量数据进行训练；4.2) Establish and update the Experience Replay Buffer to store interaction data (s_t , a_t , r_t , s_t+1 ). Each time the model parameters are updated, a small batch of data is randomly extracted from the experience pool for training;

4.3)通过最小化价值网络的损失函数和采用策略梯度方法更新策略网络参数，确保策略网络能够在不同状态下输出能够最大化预期累积奖励的动作，其中价值网络的损失函数定义为：4.3) By minimizing the loss function of the value network and updating the policy network parameters using the policy gradient method, ensure that the policy network can output actions that maximize the expected cumulative reward in different states, where the loss function of the value network is defined as:

其中，in,

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^Q′)y_i =r_i +γQ′(s_i+1 , μ′(s_i+1 |θ^μ′ )|θ^Q′ )

s为某一时刻下的状态，α为当前状态下的动作。yi表示目标累积奖励γ为折扣因子，Q′和μ′分别为目标价值网络和目标策略网络的参数。通过最小化损失函数L_Q，使用梯度下降法更新价值网络参数θ；s is the state at a certain moment, α is the action in the current state. yi represents the target cumulative reward, γ is the discount factor, Q′ and μ′ are the parameters of the target value network and the target policy network respectively. By minimizing the loss function L_Q , the value network parameter θ is updated using the gradient descent method;

4.4)监控训练过程中的累计奖励和策略稳定性，判断模型是否收敛，收敛后保存最优的策略网络和价值网络参数；4.4) Monitor the cumulative rewards and strategy stability during training, determine whether the model has converged, and save the optimal strategy network and value network parameters after convergence;

4.5)将训练完成的强化学习模型应用于实际钻井操作中，根据实时钻井状态生成最优的钻压(WOB)和转速(RPM)调整策略，实现钻速(ROP)最大调控。4.5) Apply the trained reinforcement learning model to actual drilling operations, generate the optimal weight on bit (WOB) and rotation speed (RPM) adjustment strategy according to the real-time drilling status, and achieve maximum control of the rate of penetration (ROP).