Movatterモバイル変換


[0]ホーム

URL:


CN119937305A - An automated control method for key drilling parameters based on improved DDPG algorithm - Google Patents

An automated control method for key drilling parameters based on improved DDPG algorithm
Download PDF

Info

Publication number
CN119937305A
CN119937305ACN202510012266.6ACN202510012266ACN119937305ACN 119937305 ACN119937305 ACN 119937305ACN 202510012266 ACN202510012266 ACN 202510012266ACN 119937305 ACN119937305 ACN 119937305A
Authority
CN
China
Prior art keywords
drilling
network
model
rop
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510012266.6A
Other languages
Chinese (zh)
Inventor
刘烨
朱文瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Shiyou University
Original Assignee
Xian Shiyou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Shiyou UniversityfiledCriticalXian Shiyou University
Priority to CN202510012266.6ApriorityCriticalpatent/CN119937305A/en
Publication of CN119937305ApublicationCriticalpatent/CN119937305A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses an automatic drilling key parameter regulating and controlling method based on an improved DDPG algorithm. The method comprises the steps of collecting and preprocessing drilling data, constructing a complete data set for training, constructing and training a drilling Rate (ROP) time sequence prediction model based on a self-attention mechanism (self-attention), constructing a simulation environment by using the trained ROP prediction model, defining a state space, an action space and a multi-target rewarding function of a reinforcement learning DDPG algorithm model, training a reinforcement learning model by adopting a depth deterministic strategy gradient algorithm (DDPG), and learning an optimal Weight On Bit (WOB) and rotating speed (RPM) regulation strategy in the simulation environment by an intelligent agent through an Actor-Critic network structure to realize dynamic optimization of the drilling Rate (ROP). The method improves the efficiency and stability of the drilling process through automatic and intelligent parameter regulation, reduces the wear rate of the drill bit and the operation cost, has good adaptability and multi-objective optimization capability, and is suitable for drilling operation under complex stratum conditions.

Description

Drilling key parameter automatic regulation and control method based on improved DDPG algorithm
Technical Field
The invention relates to the technical field of petroleum engineering, in particular to a key parameter regulation and control method in the drilling process.
Background
In drilling engineering, weight On Bit (WOB) and rotational speed (RPM) are key parameters affecting the rate of penetration (Rate of Penetration, ROP). Engineers typically manually adjust these parameters based on real-time drilling feedback to achieve optimization of drilling efficiency. However, the conventional manual regulation and control method has the problems that firstly, depending on experience of operators, stable optimization effect is difficult to maintain under complex stratum conditions, and secondly, the regulation process lacks real-time performance and accuracy, which may cause reduction of drilling efficiency or occurrence of underground complex conditions.
With the development of intelligent drilling technology, automated control systems are gradually introduced into the drilling sites. Existing automated control methods are mainly based on rule models or optimization algorithms, e.g. Self, R in document Reducing drilling cost by finding optimal operational parameters using PARTICLE SWARM algorithm propose using the PSO algorithm to find the best parameter combination. The method can realize a certain optimizing effect in a specific scene, but is difficult to simultaneously consider multiple indexes such as drilling rate, drill bit service life, wellbore quality and the like when facing complex stratum environments and multi-target constraint conditions. In addition, the traditional intelligent optimization algorithm needs to be recalculated every time the parameters are adjusted, and the prior learning experience cannot be effectively utilized to perform quick response, so that the defects of poor instantaneity and insufficient adaptability in a quickly-changed drilling environment are caused.
In recent years, the rapid development of artificial intelligence technology has provided new solutions for the automated regulation of drilling parameters. Reinforcement learning (Reinforcement Learning, RL) is a data-driven optimization technique that enables real-time learning of optimal strategies through continuous interaction of intelligent agents with the environment. The method has the core advantage that real-time drilling feedback can be directly integrated into the decision process, and parameter setting is dynamically optimized. However, when the existing reinforcement learning method is applied to drilling parameter regulation, the combination of engineering constraint and formation complexity is not fully considered, so that the optimization result of the reinforcement learning method lacks operability in actual engineering, and research invention of the related field is not formed.
Based on this, there is a need for an intelligent method that can integrate multiple target constraints, dynamically adjust weight on bit and rotational speed, to improve drilling efficiency, optimize drilling rate, and ensure safety and economy of downhole operations.
Disclosure of Invention
In order to overcome the defects of poor instantaneity, insufficient adaptability and limited multi-objective optimization capability of the conventional drilling parameter regulation and control method, the invention provides a drilling key parameter automatic regulation and control method based on a depth certainty strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm, aiming at realizing dynamic optimization regulation and control of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) through an intelligent algorithm so as to improve the rate of penetration (ROP) and ensure the quality and operation safety of a well. According to the method, the reinforcement learning DDPG algorithm model is constructed, the drilling parameters are dynamically adjusted, the multi-target requirements such as the drilling efficiency, the service life of the drill bit, the quality of the well bore and the like are comprehensively considered, and the intelligent level and the economic benefit of the drilling process are remarkably improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
an automatic regulation and control method for drilling key parameters based on an improved DDPG algorithm comprises the following steps:
1) Acquiring and preprocessing drilling data, and constructing a training data set for training of the ROP time sequence prediction model;
2) Constructing and training an ROP time sequence prediction model, providing accurate dynamic response, and using the ROP time sequence prediction model as an environment feedback basis of a reinforcement learning DDPG algorithm model;
3) Constructing a simulation environment based on the ROP prediction model in the step 2), defining a state space, an action space and a reward function of the reinforcement learning DDPG algorithm model, and designing an improved DDPG algorithm model;
4) Training DDPG an algorithm model by using a simulation environment, and learning an optimal parameter regulation strategy by using an intelligent agent.
The drilling parameters in the step 1) include Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL), and the target parameter is rate of drilling (ROP). The data preprocessing process comprises the following steps:
1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and linear interpolation processing is carried out on the missing data;
1.2 Maximum and minimum normalization of the data, the formula is:
wherein Xnormalized is normalized data, X is original data, min (X) is minimum value of the original data, and max (X) is maximum value of the original data.
In the step 2), a deep time sequence neural network model is constructed by using the training set data in the step S1 to predict the ROP value, and the method specifically comprises the following substeps:
2.1 A deep neural network model is constructed, the model adopts self-attention framework as a core framework, and the dynamic rule of the change of drilling parameters and stratum characteristics along with time is captured by utilizing the advantages of the model in the aspect of processing time sequence data. The model comprises a neural network architecture of an encoder (encoder) and a decoder (decoder) based on a self-attention mechanism (self-attention) for extracting multi-level time series characteristics, and setting a full connection layer at an output end to generate a predicted ROP value.
2.2 Setting input characteristics and output targets of the model, wherein the input characteristics of the model comprise Weight On Bit (WOB), rotating speed (RPM), riser pressure (SPP), torque (TQ), slurry inflow (MFI) and major groove load (HKL), and the output targets are the drilling Rate (ROP) at corresponding moments. The input features are subjected to normalization processing to accelerate training and improve generalization performance of the model.
2.3 Time sequence prediction model loss function and optimizer, the model training process adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:
Wherein yi is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted for parameter updating.
2.4 And (3) after the evaluation result of the test set reaches the expectation, storing the optimal state of the model.
In the step 3), a simulation environment is constructed based on the ROP time sequence prediction model trained in the step 2), and key elements of the reinforcement learning DDPG algorithm model are defined, which specifically comprises the following contents:
3.1 A simulation environment is built, and an ROP prediction model is used as a core for simulating the influence of Weight On Bit (WOB) and rotating speed (RPM) on the rate of penetration (ROP). Inputs to the simulation environment include the bottom layer characteristics parameters of Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL). Since these underlying characteristic parameters cannot be directly manipulated by reinforcement learning, they are used as state inputs to the environment, affecting the rate of penetration prediction and subsequent optimization process in the simulation environment. The simulation environment calculates the rate of penetration (ROP) through the ROP prediction model and provides feedback indexes of the rate of penetration under a reward and punishment mechanism for optimizing DDPG algorithm models.
Constraints in the simulation environment include physical operating limitations of weight and speed (e.g., maximum allowable weight and speed of equipment), limitations of bit wear rate, and wellbore quality requirements, which ensure that model-generated regulation schemes are viable and that the rate of drilling can be effectively increased in actual operation.
3.2 Key elements of the DDPG algorithm model are defined, which elements include a preset action space and a set reward function. The action space comprises adjustment of the weight on bit and the rotating speed, and the two parameters are targets which can be regulated and controlled by the DDPG algorithm model. The model optimizes the rate of penetration (ROP) by increasing, decreasing, or maintaining a constant action on these parameters.
The reward function is designed as multi-objective optimization, taking into account the satisfaction of ROP lifting, adjustment costs and constraints. The specific rewarding function forms are:
Reward=α·ROPgain-β·Adjustmentcost-γ·Constraintpenalty
The ROPgain represents the drilling rate lifting amount of the current time step t compared with the previous time step t-1, and is multiplied by a weight coefficient alpha to encourage the increase of the drilling rate, the Adjustmentcost represents the Adjustment range of the drilling pressure and the rotating speed in the current time step, and is multiplied by a weight coefficient beta to penalize excessive parameter Adjustment and reduce the operation cost and equipment wear, and the Constraintpenalty is used for measuring whether engineering constraints (such as physical limits of the drilling pressure and the rotating speed, the drill bit wear rate, the borehole quality and the like) are violated and multiplying by a higher weight coefficient gamma to ensure strict compliance with safety and performance standards. Through the rewarding mechanism, the intelligent agent is not only stimulated to promote the drilling speed, but also restricted to operate in a reasonable adjustment range, and any behavior violating engineering restrictions is strictly punished, so that the balance optimization of efficiency, safety and economic benefit is realized in the drilling process.
3.3 Architecture of reinforcement learning model that uses depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm and employs the structure of strategy network (Actor) and value network (Critic) to optimize the regulation strategy of Weight On Bit (WOB) and rotational speed (RPM). Compared with the traditional DDPG algorithm model, the invention is optimized and improved in the following aspects:
3.3.1 The model adds an extra full connection layer in the traditional strategy network, so that the network depth reaches four layers. The introduction of the additional layer enables the network to better understand and process multidimensional data in the drilling process, thereby improving the accuracy and stability of the strategy;
3.3.2 In value network design, the present invention introduces a residual connection (Residual Connections). Specifically, by adding two linear layers (residual 1 and residual 2), the input features are added directly to the output of the hidden layer. The residual connection effectively relieves the gradient vanishing problem in the deep network, promotes a more stable and efficient training process, and improves the accuracy of value evaluation;
3.3.3 The model uses normal distribution (mean 0, standard deviation 0.1) for weight initialization in each layer of the policy network and the value network. The initialization method is helpful to avoid gradient disappearance or explosion phenomenon, so that the network can keep better performance in the initial stage of training;
3.3.4 To prevent model overfitting and promote generalization capability, the invention adds Dropout layers (p=0.1) after the first hidden layer of the policy network and the key hidden layer of the value network, respectively. The Dropout layer reduces the dependence of the network on specific neurons by randomly discarding the output of part of neurons, and enhances the adaptability and robustness of the model under different drilling environments;
3.3.5 The output of the strategy network is limited to the range of [ -1,1] via the tanh activation function, and then the action values are scaled to the actual action space range drilling pressures [ wob_low, wob_high ] and rotational speeds [ rpm_low, rpm_high ] by linear transformation. The specific formula is as follows:
the action represents the current action value, and the design ensures that the output bit pressure and the output rotation speed are in a reasonable and operable range, and avoids generating invalid or extreme action values, thereby improving the enforceability and the safety of the regulation strategy.
In the step 4), the simulation environment and the reinforcement learning DDPG algorithm model constructed in the step 3) are utilized, and the model is trained to learn the optimal parameter regulation strategy through training, which concretely comprises the following contents:
4.1 A training process is established, a deep reinforcement learning depth deterministic strategy gradient algorithm (DDPG) algorithm is adopted, and a model is trained. In the training process, the intelligent agent interacts with the simulation environment, and at each time step t, one action at is selected according to the current state st, and the next state st+1 and the instant prize rt are obtained. By constantly interacting with the environment, empirical data is accumulated.
4.2 Experience pool creation and updating, an experience replay pool (Experience Replay Buffer) is introduced for storing interaction data of the intelligent agent, including quadruples of state, action, rewards and next state (st,at,rt,st+1). And randomly extracting small-batch (mini-batch) data from the experience pool for training when updating the strategy each time, so as to break the data correlation and improve the training stability and efficiency.
4.3 Updating model parameters using the policy network and the value network. Wherein the goal of the value network is to estimate the expected jackpot under a given state and action, and the loss function of the value network is defined as:
Wherein,
yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′)
Gamma is the discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θQ is updated using a gradient descent method by minimizing the loss function LQ.
The purpose of the strategic network is to output optimal actions (i.e., adjustments to weight on bit and rotational speed) based on the current state, with the objective of maximizing the jackpot of the intelligent agent in the environment. The strategy network directly decides the regulation strategy which the intelligent agent should take under different states through learning the mapping relation from the states to the actions.
The goal of the updating of the policy network is to increase the value that can be gained by the actions selected in the various states. The parameter updating of the strategy network is realized by a strategy gradient method, and the gradient calculation formula is as follows:
Where θμ is a parameter of the policy network, μ (siμ) represents an action output by the policy network in state si, Q (s, a|θQ) is a value network, and a value of performing the action in state s is evaluated. By means of the policy gradient method, the policy network parameter θμ is updated so that the policy network can output actions that can maximize the expected jackpot under a given state.
4.4 The convergence of the model and the policy optimization, and the convergence condition of the model is judged by monitoring the maximum cumulative rewards and the stability of the policy in the training process. The model is considered to have converged when the jackpot reaches an expected level on the validation set and the policy output tends to stabilize. At this time, the optimal model parameters are saved for subsequent practical applications.
The trained DDPG algorithm model can output an optimal Weight On Bit (WOB) and rotating speed (RPM) adjustment strategy according to the current drilling state, so that the optimization of the rate of drilling (ROP) is realized, the drilling efficiency is improved, and the drilling safety is ensured.
Compared with the prior art, the invention has the main advantages that:
(1) Realizing the intelligent regulation and control of key parameters. According to the invention, automatic regulation and control of Weight On Bit (WOB) and rotating speed (RPM) are realized through the improved DDPG algorithm model, key parameters can be dynamically regulated according to the real-time drilling state, manual intervention is not needed, and the automation degree of drilling operation is greatly improved.
(2) Improving the accuracy and efficiency of rate of penetration (ROP) optimization. By combining the ROP time sequence prediction model and DDPG algorithm, the invention can predict the drilling speed in real time under different working conditions, and can maximize the lifting effect of the drilling speed through intelligent regulation and control, thereby remarkably improving the drilling efficiency.
(3) Multi-objective optimization capability. The invention designs a multi-objective rewarding function, comprehensively considers the drilling speed improvement, the parameter adjustment cost and the safety constraint, can ensure the service life of the drill bit and the quality of the well bore while improving the drilling speed, and meets various engineering requirements.
(4) And the adaptability of complex working conditions is enhanced. By constructing the simulation environment and improving the dynamic optimization of DDPG algorithm models, the invention can stably run under different stratum conditions, equipment configuration and operation constraints and is suitable for complex drilling working conditions.
(5) Reducing the risk of manual intervention and operation. In the traditional drilling operation, the adjustment of the drilling pressure and the rotating speed is required to depend on manual experience, and the problems of misoperation and low efficiency are easy to occur. According to the invention, through the improved intelligent decision making capability of the reinforcement learning DDPG algorithm model, the requirement of manual intervention is remarkably reduced, and the operation risk is reduced.
In conclusion, the invention provides an efficient, stable and safe automatic regulation and control method for drilling key parameters through an intelligent algorithm and a reinforcement learning DDPG algorithm model, which can remarkably improve drilling efficiency, reduce operation risk, adapt to complex working condition requirements and have wide application prospects.
Drawings
FIG. 1 is a flow chart of a method for automatically regulating and controlling drilling parameters based on a modified DDPG algorithm.
FIG. 2 is a schematic diagram of the architecture of a self-attention based neural network.
Fig. 3 is a schematic diagram of a neural network structure of the reinforcement learning DDPG algorithm.
Fig. 4 is a graph comparing the results of the improved DDPG algorithm optimization ROP.
Detailed Description
Example 1
Referring to fig. 1, the method for automatically regulating and controlling drilling key parameters based on the improved DDPG algorithm comprises the following steps:
1) Acquiring and preprocessing drilling data, and constructing a training data set for training of the ROP time sequence prediction model;
2) Constructing and training an ROP time sequence prediction model, and inputting the data in the step 1) into the model to provide accurate dynamic response, wherein the accurate dynamic response is used as an environment feedback basis of a reinforcement learning DDPG algorithm model;
3) Constructing a simulation environment based on the ROP prediction model in the step 2), defining a state space, an action space and a reward function of the reinforcement learning DDPG algorithm model, and designing an improved DDPG algorithm model;
4) And 3) training the reinforcement learning DDPG algorithm model by utilizing the ROP simulation environment in the step 3), learning the optimal parameter regulation strategy by an intelligent agent, and finally saving and applying the model.
In this embodiment, in step 1), key parameter data related to the drilling process is collected in real time by sensors and monitoring systems at the drilling site. The drilling operation parameters comprise Weight On Bit (WOB), rotating speed (RPM), vertical pipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI) and major ditch load (HKL), and the target parameter is drilling Rate (ROP). The collected data includes historical data and real-time drilling data for forming a complete training data set. And then preprocessing the acquired data, wherein the preprocessing process is as follows:
1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and the quality and consistency of the residual data are ensured. Table 1 shows the processed partial data samples:
table 1 ROP logging sample data example
1.2 Data interpolation is carried out on the subsequent data, a linear interpolation method is used for filling the missing value, and the interpolation is carried out on the data missing in a specific time interval by utilizing the data at adjacent moments.
1.3 Maximum and minimum normalization (Min-Max Normalization) of the data, the normalization formula is as follows:
wherein Xnormalized is normalized data, X is original data, min (X) is minimum value of the original data, and max (X) is maximum value of the original data.
In the embodiment, in step 2, a ROP time sequence prediction model based on self-attention is constructed by using training set data to capture dynamic rules of drilling parameters and formation characteristics changing with time. As shown in fig. 2, the main flow of the model construction is as follows:
2.1 A full connection output layer, which takes an encoder layer (encoder) and a decoder layer (decoder) of a self-attention mechanism (self-attention) as a core architecture, and maps the extracted features to a prediction target, namely a rate of penetration (ROP).
2.2 Loss function and model optimization, model training adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:
Wherein yi is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted to update parameters, the initial learning rate is 0.001, and the learning rate is adjusted according to an attenuation factor of 0.9 after every 10 training periods.
2.3 Training process design, training the model by a small batch gradient descent method (batch size=32) for 100 rounds, and checking ROP values after automatic parameter adjustment on a verification set to obtain corresponding improvement.
In this embodiment, in step 3), a simulation environment is constructed based on the ROP timing prediction model trained in step 2), which is used for training and optimizing DDPG models, and referring to fig. 3, the following specifically include:
3.1 A simulation environment is constructed. The simulation environment takes a trained ROP prediction model as a core, and simulates the dynamic influence of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) on the rate of drilling (ROP). Input characteristics of the simulation environment include current Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL) parameters. The bottom layer characteristic parameters are used as state input of the environment, directly influence the drilling speed prediction result in the simulation environment, and provide basis for strategy optimization of DDPG algorithm models. Because the bottom layer characteristic parameters cannot be directly regulated and controlled through DDPG models, the reinforcement learning DDPG algorithm mainly learns and optimizes the adjustment strategy of the weight on bit and the rotating speed.
3.2 Basic conditions of the reinforcement learning process, including state space, action space, and reward functions, are defined based on the simulation environment. The State space (State, s) includes current Weight On Bit (WOB) and rotational speed (RPM), two adjustable characteristic parameters, other non-adjustable parameters, and historical rate of penetration (WOB) data. Through the definition of the state space, the reinforcement learning DDPG algorithm model can comprehensively sense the current drilling working condition and provide sufficient information for subsequent decisions.
The Action space (Action, a) is defined as an adjustment value of the weight on bit and the rotation speed, and specifically includes increasing, decreasing or maintaining the weight on bit and the rotation speed unchanged, and performing physical operation limitation on the weight on bit and the rotation speed. The reinforcement learning DDPG algorithm model outputs optimal actions under different conditions through learning, thereby optimizing rate of penetration (ROP) and ensuring wellbore quality and bit life. The reward function (Reward, r) is designed as a multi-objective optimization function, taking into account the rate of penetration (ROP gain), the parameter Adjustment costs (Adjustment costs) and the constraint satisfaction (Constraint penalty) in total. The specific form of the reward function is as follows:
Reward=α·ROPgain-β·Adjustmentcost-γ·Constraintpenalty
Where ROPgain is the rate of rise, adjustmentcost is the Adjustment cost, constraintpenalty is the penalty term for violating the Constraint, and α, β, γ are the weight coefficients used to balance the targets.
3.3 The reinforcement learning model uses a deep reinforcement learning algorithm DDPG (DEEP DETERMINISTIC Policy Gradient) whose architecture includes a Policy network (Actor) and a value network (Critic). The strategy network is used to output optimal adjustment actions (i.e., adjustment values for weight-on-bit and rotational speed) based on the current state, while the value network is used to evaluate the jackpot for each state-action pair, helping the strategy network optimize the regulation strategy.
The policy network and the value network both adopt full-connection layer structures, the activation function is ReLU (RECTIFIED LINEAR Unit), and the calculation formula is as follows:
ReLU(x)=max(0,x)
In the model architecture, the strategy network receives current state information (such as weight on bit, rotation speed, mud flow, borehole depth, historical ROP value, etc.) of the simulation environment, and outputs two continuous values corresponding to the adjustment values of the weight on bit and the rotation speed respectively. The value network then receives the current state and the action generated by the policy network and calculates a value estimate (i.e., a jackpot) for the state-action pair. The strategy network and the value network are continuously matched and optimized, so that the model can learn the optimal parameter regulation strategy step by step.
In this embodiment, in step 4), training is performed using the simulation environment and the reinforcement learning DDPG algorithm model. By utilizing the simulation environment constructed in the step S3 and the reinforcement learning DDPG algorithm model, the model is trained to learn the optimal Weight On Bit (WOB) and rotating speed (RPM) regulation strategy, so that the dynamic optimization of the drilling Rate (ROP) is realized, and the specific implementation process is as follows:
4.1 The training process adopts a depth deterministic strategy gradient algorithm (DDPG), and the model gradually learns to an optimal parameter regulation strategy through continuous interaction between an intelligent Agent (Agent) for reinforcement learning and a simulation environment. At each time step t, the intelligent agent generates an action at over the strategy network based on the current state st, which corresponds to the adjustment values for weight on bit and rotational speed. The simulation environment updates the state to st+1 based on the entered action at and the current state st, while generating an instant prize rt. The calculation of the reward value considers the drill speed improvement, the parameter adjustment cost and the satisfaction of the constraint condition, so as to guide the behavior optimization of the intelligent agent. Through this continuous interaction, the intelligent agent accumulates empirical data, providing training samples for subsequent model optimization.
4.2 During the building and updating training process of the experience pool, an experience replay pool (Experience Replay Buffer) is introduced for storing the interaction data of the intelligent agent and the simulation environment. The data stored by the experience pool is recorded in the form of four tuples, including status, action, instant prize, and next status (st,at,rt,st+1). Each time model parameters are updated, small batch data (mini-batch) is randomly extracted from the experience pool for training the strategy network and the value network. The data correlation is broken through random sampling, and the experience pool improves the stability and efficiency of training. The capacity of the experience pool is typically set to a fixed size, e.g., 10,000 interaction records, with the most recent data overlapping the oldest data, ensuring that training is always based on the most recent interaction information.
4.3 Updating settings of model parameters. In the training process, parameters of the strategy network and the value network are updated through different optimization targets. The value network aims to estimate the jackpot under the current state and action, and the loss function is defined as:
Wherein,
yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′)
S is the state at a certain moment, and α is the action at the current state. yi denotes that the target jackpot gamma is a discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θQ is updated using a gradient descent method by minimizing the loss function LQ.
The policy network is responsible for generating an optimal action a based on the current state s, with the optimization objective being to maximize the cumulative rewards of the intelligent agent. The strategy network updates parameters through a strategy gradient method, and the gradient formula is as follows:
Where θμ is a parameter of the policy network, μ (siμ) represents an action output by the policy network in state si, Q (s, a|θQ) is a value network, and a value of performing the action in state s is evaluated. By means of the policy gradient method, the policy network parameter θμ is updated so that the policy network can output actions that can maximize the expected jackpot under a given state.
4.4 In the training process, the convergence condition of the model is judged by monitoring the stability of the accumulated rewards and the strategy. When the intelligent agent performance in the simulation environment tends to be stable, the cumulative rewarding curve reaches the expected level, and the action change of the strategy network output in the multi-round interaction is smaller, which indicates that the model has converged. In this case, the optimal parameters of the trained policy network and value network are saved for subsequent practical applications.
4.5 In the present example, the model generates optimal Weight On Bit (WOB) and rotational speed (RPM) adjustment strategies based on real-time drilling conditions (e.g., current weight on bit, rotational speed, mud flow, etc.) to optimize rate of penetration (ROP). Referring to fig. 4, the improved DDPG algorithm model of the present invention can significantly improve drilling efficiency by continuously adjusting and controlling under different conditions.

Claims (6)

Translated fromChinese
1.一种基于改进DDPG算法的钻井关键参数自动化调控方法,其特征在于,包括以下步骤:1. A method for automatic control of key drilling parameters based on an improved DDPG algorithm, characterized in that it comprises the following steps:1)采集并预处理钻井数据,构建用于训练的完整数据集;1) Collect and preprocess drilling data to build a complete data set for training;2)构建并训练基于self-attention的神经网络结构钻速(ROP)时序预测模型;2) Build and train a self-attention-based neural network structure drilling rate (ROP) time series prediction model;3)基于2)中ROP时序预测模型搭建仿真环境,并定义强化学习DDPG算法模型的状态空间、动作空间及奖励函数,设计改进的DDPG算法网络结构模型;3) Build a simulation environment based on the ROP time series prediction model in 2), define the state space, action space and reward function of the reinforcement learning DDPG algorithm model, and design an improved DDPG algorithm network structure model;4)利用3)的仿真环境对所述强化学习智能体决策模型进行交互学习,通过智能代理学习最佳的钻压(WOB)和转速(RPM)调控策略,以实现钻速(ROP)的动态优化。4) Using the simulation environment of 3) to interactively learn the reinforcement learning agent decision model, the optimal weight on bit (WOB) and rotation speed (RPM) control strategy is learned by the intelligent agent to achieve dynamic optimization of the drilling rate (ROP).2.根据权利要求1所述的方法,其特征在于,所述步骤1)中预处理包括:2. The method according to claim 1, characterized in that the pretreatment in step 1) comprises:1.1)对采集的数据进行初步清洗,剔除缺失值比例超过10%的数据条目,去除超出三倍标准差的异常数据,对缺失数据进行线性插值处理;1.1) Perform preliminary cleaning of the collected data, remove data entries with missing value ratio exceeding 10%, remove abnormal data exceeding three times the standard deviation, and perform linear interpolation on the missing data;1.2)将数据进行最大最小归一化,公式为:1.2) Normalize the data to the maximum and minimum values, the formula is:其中,Xnormalized是归一化后的数据;X是原始数据;min(X)是原始数据的最小值;max(X)是原始数据的最大值。Among them, Xnormalized is the normalized data; X is the original data; min(X) is the minimum value of the original data; max(X) is the maximum value of the original data.3.根据权利要求1所述的方法,其特征在于,所述步骤2)进一步包括:3. The method according to claim 1, characterized in that the step 2) further comprises:2.1)设计预测模型结构,包括基于自注意力机制(self-attention)的编码器层(encoder)与解码器层(decoder)用于提取多层次的时间序列特征,以及全连接输出层用于生成预测的钻速(ROP)值;2.1) Design the prediction model structure, including the encoder and decoder layers based on the self-attention mechanism to extract multi-level time series features, and the fully connected output layer to generate the predicted drilling rate (ROP) value;2.2)设置损失函数为均方误差(MSE),并采用Adam优化器进行模型参数的优化,初始学习率设定为0.001,每10个训练周期后按衰减因子0.9调整学习率:2.2) Set the loss function to mean square error (MSE), and use Adam optimizer to optimize model parameters. The initial learning rate is set to 0.001, and the learning rate is adjusted by a decay factor of 0.9 after every 10 training cycles:2.3)采用小批量梯度下降法(batch size=32)对模型进行100轮训练,并在验证集上评估模型性能,保存训练过程中的最优模型状态。2.3) The model was trained for 100 rounds using mini-batch gradient descent (batch size = 32), and the model performance was evaluated on the validation set, saving the optimal model state during the training process.4.根据权利要求1所述的方法,其特征在于,所述步骤3)进一步包括:4. The method according to claim 1, characterized in that the step 3) further comprises:3.1)构建仿真环境,所述仿真环境以训练好的ROP预测模型为核心,输入特征包括钻压(WOB)、转速(RPM)、立管压力(SPP)、扭矩(TQ)、钻井时间(BDT)、泥浆流入(MFI)和大沟载荷(HKL);3.1) Constructing a simulation environment, which is based on the trained ROP prediction model and whose input features include weight on bit (WOB), rotation speed (RPM), standpipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI) and large trench load (HKL);3.2)定义强化学习模型的关键要素,包括状态空间、动作空间及奖励函数,其中状态空间包括:钻压(WOB)、转速(RPM)、立管压力(SPP)、扭矩(TQ)、钻井时间(BDT)、泥浆流入(MFI)、大沟载荷(HKL)及历史钻速(ROP)数据;动作空间定义为钻压和转速的增大、减小或保持不变进行约束条件限制;奖励函数为多目标优化函数,形式为:3.2) Define the key elements of the reinforcement learning model, including state space, action space and reward function. The state space includes: weight on bit (WOB), rotation speed (RPM), standpipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI), large trench load (HKL) and historical drilling rate (ROP) data; the action space is defined as the increase, decrease or remain unchanged of the weight on bit and rotation speed to constrain the constraints; the reward function is a multi-objective optimization function in the form of:Reward=α·ROPgain-β·Adjustmentcost-γ·ConstraintpenaltyReward=α·ROPgain -β·Adjustmentcost -γ·Constraintpenalty其中,ROPgain表示当前时间步t相较于前一时间步t-1的钻速提升量,乘以权重系数α,用以鼓励钻速的增加;Adjustmentcost代表在当前时间步对钻压和转速的调整幅度,乘以权重系数β,用以惩罚过大的参数调整,减少操作成本和设备磨损;Constraintpenalty则用于衡量是否违反了工程约束(如钻压和转速的物理限制、钻头磨损率、井眼质量等),乘以较高的权重系数γ,以确保严格遵守安全和性能标准;Among them, ROPgain represents the increase in drilling speed at the current time step t compared with the previous time step t-1, multiplied by the weight coefficient α, to encourage the increase in drilling speed; Adjustmentcost represents the adjustment of drilling pressure and rotation speed at the current time step, multiplied by the weight coefficient β, to punish excessive parameter adjustments and reduce operating costs and equipment wear; Constraintpenalty is used to measure whether the engineering constraints (such as physical limitations of drilling pressure and rotation speed, drill bit wear rate, wellbore quality, etc.) are violated, multiplied by a higher weight coefficient γ to ensure strict compliance with safety and performance standards;3.3)创建强化学习网络结构模型,所述模型采用深度强化学习算法DDPG(DeepDeterministic Policy Gradient),包括策略网络(Actor)和价值网络(Critic),其中网络结构采用全连接层,激活函数为ReLU。所述强化学习模型中的策略网络根据当前状态输出连续的钻压和转速调整值,价值网络则评估每个状态-动作对的累积奖励,以辅助策略网络优化调控策略。其网络结构的特征进一步包括:3.3) Create a reinforcement learning network structure model, which uses the deep reinforcement learning algorithm DDPG (Deep Deterministic Policy Gradient), including a policy network (Actor) and a value network (Critic), where the network structure uses a fully connected layer and the activation function is ReLU. The policy network in the reinforcement learning model outputs continuous drilling pressure and speed adjustment values according to the current state, and the value network evaluates the cumulative reward of each state-action pair to assist the policy network in optimizing the control strategy. The characteristics of its network structure further include:3.3.1)本模型在传统的策略网络中增加了一个额外的全连接层,使网络深度达到四层。该额外层的引入使得网络能够更好地理解和处理钻井过程中的多维度数据,提高了策略的精确性和稳定性;3.3.1) This model adds an extra fully connected layer to the traditional policy network, making the network depth reach four layers. The introduction of this extra layer enables the network to better understand and process multi-dimensional data in the drilling process, improving the accuracy and stability of the strategy;3.3.2)在价值网络设计中,本发明引入了残差连接(Residual Connections)。具体来说,通过添加两个线性层(residual1和residual2),将输入特征直接与隐藏层的输出相加。这种残差连接有效缓解了深层网络中的梯度消失问题,促进了更稳定和高效的训练过程,提高了价值评估的准确性;3.3.2) In the value network design, the present invention introduces residual connections. Specifically, by adding two linear layers (residual1 and residual2), the input features are directly added to the output of the hidden layer. This residual connection effectively alleviates the gradient vanishing problem in deep networks, promotes a more stable and efficient training process, and improves the accuracy of value assessment;3.3.3)模型在策略网络和价值网络的每一层中采用了正态分布(均值为0,标准差为0.1)进行权重初始化。这种初始化方法有助于避免梯度消失或爆炸现象;3.3.3) The model uses a normal distribution (mean 0, standard deviation 0.1) for weight initialization in each layer of the policy network and value network. This initialization method helps to avoid gradient disappearance or explosion;3.3.4)为了防止模型过拟合并提升泛化能力,本发明在策略网络的第一隐藏层和价值网络的关键隐藏层后分别加入了Dropout层(p=0.1)。Dropout层通过随机丢弃部分神经元的输出,减少了网络对特定神经元的依赖,增强了模型在不同钻井环境下的适应性和鲁棒性;3.3.4) In order to prevent the model from overfitting and improve the generalization ability, the present invention adds a Dropout layer (p=0.1) after the first hidden layer of the strategy network and the key hidden layer of the value network. The Dropout layer reduces the network's dependence on specific neurons by randomly discarding the output of some neurons, thereby enhancing the adaptability and robustness of the model in different drilling environments;3.3.5)策略网络的输出经过tanh激活函数限制在[-1,1]范围内,随后通过线性变换将动作值缩放至实际的动作空间范围钻井压力[WOB_low,WOB_high]和转速[RPM_low,RPM_high]。具体公式为:3.3.5) The output of the policy network is limited to the range of [-1, 1] through the tanh activation function, and then the action value is scaled to the actual action space range of drilling pressure [WOB_low, WOB_high] and speed [RPM_low, RPM_high] through linear transformation. The specific formula is:action代表当前动作的值,这种设计确保输出的钻压和转速在合理且可操作的范围内,避免生成无效或极端的动作值,从而提高了调控策略的可实施性和安全性。Action represents the value of the current action. This design ensures that the output drilling pressure and rotation speed are within a reasonable and operable range, avoiding the generation of invalid or extreme action values, thereby improving the feasibility and safety of the control strategy.5.根据权利要求4所述的方法,其特征在于,所述仿真环境中的约束条件包括钻压和转速在钻井过程中的真实物理操作限制,确保所生成的调控方案在实际操作中的可行性。5. The method according to claim 4 is characterized in that the constraints in the simulation environment include the real physical operation limitations of drilling pressure and rotation speed during the drilling process, ensuring the feasibility of the generated control scheme in actual operation.6.根据权利要求1所述的方法,其特征在于,所述步骤4)进一步包括:6. The method according to claim 1, characterized in that the step 4) further comprises:4.1)智能代理与仿真环境不断交互,在每一时间步t,根据当前状态st通过策略网络生成动作at,仿真环境根据动作at和状态st更新状态至下一状态st+1并生成即时奖励;4.1) The intelligent agent continuously interacts with the simulation environment. At each time step t, the policy network generates an action at according to the current state s t. The simulation environment updates the state to the next state st+1 according to the action at and state st and generates an immediate reward.4.2)建立并更新经验重放池(Experience Replay Buffer),存储交互数据(st,at,rt,st+1),每次更新模型参数时,从经验池中随机抽取小批量数据进行训练;4.2) Establish and update the Experience Replay Buffer to store interaction data (st , at , rt , st+1 ). Each time the model parameters are updated, a small batch of data is randomly extracted from the experience pool for training;4.3)通过最小化价值网络的损失函数和采用策略梯度方法更新策略网络参数,确保策略网络能够在不同状态下输出能够最大化预期累积奖励的动作,其中价值网络的损失函数定义为:4.3) By minimizing the loss function of the value network and updating the policy network parameters using the policy gradient method, ensure that the policy network can output actions that maximize the expected cumulative reward in different states, where the loss function of the value network is defined as:其中,in,yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′)yi =ri +γQ′(si+1 , μ′(si+1μ′ )|θQ′ )s为某一时刻下的状态,α为当前状态下的动作。yi表示目标累积奖励γ为折扣因子,Q′和μ′分别为目标价值网络和目标策略网络的参数。通过最小化损失函数LQ,使用梯度下降法更新价值网络参数θ;s is the state at a certain moment, α is the action in the current state. yi represents the target cumulative reward, γ is the discount factor, Q′ and μ′ are the parameters of the target value network and the target policy network respectively. By minimizing the loss function LQ , the value network parameter θ is updated using the gradient descent method;4.4)监控训练过程中的累计奖励和策略稳定性,判断模型是否收敛,收敛后保存最优的策略网络和价值网络参数;4.4) Monitor the cumulative rewards and strategy stability during training, determine whether the model has converged, and save the optimal strategy network and value network parameters after convergence;4.5)将训练完成的强化学习模型应用于实际钻井操作中,根据实时钻井状态生成最优的钻压(WOB)和转速(RPM)调整策略,实现钻速(ROP)最大调控。4.5) Apply the trained reinforcement learning model to actual drilling operations, generate the optimal weight on bit (WOB) and rotation speed (RPM) adjustment strategy according to the real-time drilling status, and achieve maximum control of the rate of penetration (ROP).
CN202510012266.6A2025-01-032025-01-03 An automated control method for key drilling parameters based on improved DDPG algorithmPendingCN119937305A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510012266.6ACN119937305A (en)2025-01-032025-01-03 An automated control method for key drilling parameters based on improved DDPG algorithm

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510012266.6ACN119937305A (en)2025-01-032025-01-03 An automated control method for key drilling parameters based on improved DDPG algorithm

Publications (1)

Publication NumberPublication Date
CN119937305Atrue CN119937305A (en)2025-05-06

Family

ID=95549680

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510012266.6APendingCN119937305A (en)2025-01-032025-01-03 An automated control method for key drilling parameters based on improved DDPG algorithm

Country Status (1)

CountryLink
CN (1)CN119937305A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120215281A (en)*2025-05-262025-06-27昆明理工大学 A dynamic optimization control method for tin smelting process
CN120337840A (en)*2025-06-182025-07-18山东云海国创云计算装备产业创新中心有限公司 A method, device, medium and product for determining parameters of an analog circuit
CN120487037A (en)*2025-07-212025-08-15中国石油大学(北京)Downhole tool face dynamic control method and system based on reinforcement learning
CN120542277A (en)*2025-07-252025-08-26龙门实验室 A friction coefficient calibration method for large cylinder forging simulation based on reinforcement learning
CN120597735A (en)*2025-08-082025-09-05吉林大学 A deep reinforcement learning-driven intelligent real-time optimization method for drilling parameters

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120215281A (en)*2025-05-262025-06-27昆明理工大学 A dynamic optimization control method for tin smelting process
CN120337840A (en)*2025-06-182025-07-18山东云海国创云计算装备产业创新中心有限公司 A method, device, medium and product for determining parameters of an analog circuit
CN120487037A (en)*2025-07-212025-08-15中国石油大学(北京)Downhole tool face dynamic control method and system based on reinforcement learning
CN120542277A (en)*2025-07-252025-08-26龙门实验室 A friction coefficient calibration method for large cylinder forging simulation based on reinforcement learning
CN120597735A (en)*2025-08-082025-09-05吉林大学 A deep reinforcement learning-driven intelligent real-time optimization method for drilling parameters

Similar Documents

PublicationPublication DateTitle
CN119937305A (en) An automated control method for key drilling parameters based on improved DDPG algorithm
Liang et al.An improved genetic algorithm optimization fuzzy controller applied to the wellhead back pressure control system
CN111474965B (en)Fuzzy neural network-based method for predicting and controlling water level of series water delivery channel
CN110807557B (en)BP neural network-based drilling rate prediction method and BP neural network-based particle swarm optimization method
CN111963115B (en) An intelligent optimization system and method for coalbed methane well drainage and production parameters based on reinforcement learning
CN114777192B (en)Secondary network heat supply autonomous optimization regulation and control method based on data association and deep learning
CN112861423A (en)Data-driven water-flooding reservoir optimization method and system
CN114519291A (en)Method for establishing working condition monitoring and control model and application method and device thereof
CN116644844A (en)Stratum pressure prediction method based on neural network time sequence
Zhou et al.Modeling and coordinated optimization method featuring coupling relationship among subsystems for improving safety and efficiency of drilling process
Zhou et al.A novel optimization method for geological drilling vertical well
Liu et al.Autonomous intelligent control of earth pressure balance shield machine based on deep reinforcement learning
CN120255354A (en) An intelligent analysis and remote control algorithm based on digital twins
CN116265708B (en) A method and system for monitoring obstruction and stuck conditions of logging instruments based on reinforcement learning
CN114326395B (en)Intelligent generator set control model online updating method based on working condition discrimination
CN119558468A (en) A horizontal subdivision capacity optimization method based on adaptive entropy deep reinforcement learning
Cao et al.Feature investigation on the ROP machine learning model using realtime drilling data
CN117171906A (en) Mechanical drilling speed prediction method based on adaptive dynamic constraint boundary coupling optimization
CN116446487B (en)Cutter suction dredger intelligent control system based on reinforcement learning
CN120124501B (en) A reinforcement learning method for horizontal well control assisted by evolutionary algorithm
CN112882381B (en)Self-optimizing decision control system of electric submersible pump
Liang et al.Fuzzy immune algorithm based remote wireless transmission for Throttled PID control strategy
CN120105874A (en) A method for increasing water injection rate in reinjection wells based on fractal network optimization
CN120597735A (en) A deep reinforcement learning-driven intelligent real-time optimization method for drilling parameters
Yang et al.Improving Rate-of-Penetration and Reducing State Fluctuation of Drill-String System Based on Multiobjective Optimization

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp