Disclosure of Invention
The invention designs a front-end Code automatic generation algorithm Draft2Code based on a hand-drawn webpage image and a deep learning algorithm. The present invention relates to the following points 2:
(1) the present invention refers to the architecture design model training algorithm of fig. 1. Performing feature extraction on an input hand-drawn webpage image by using a convolutional neural network, encoding an input source code by using a gate control cycle unit, and finally combining the output of the gate control cycle unit and the output of the gate control cycle unit to train to obtain a model;
(2) the present invention designs a code generation algorithm with reference to the architecture of fig. 2. The method comprises the steps of performing feature extraction on an input hand-drawn webpage image by using a convolutional neural network, generating a DSL according to a trained model, generating a front-end engineering code according to a mapping relation, and realizing a code generation algorithm.
(1) Design front-end Code automatic generation algorithm Draft2Code based on hand-drawn webpage image
The algorithm architecture mainly comprises 3 main parts:
1) the CNN-based computer vision model is used for extracting features from an input page design drawing image;
2) based on the GRU language model, the function is to encode the source code characteristic sequence;
3) the GRU based decoder, acting to compute the combination of the image features obtained in 1) and the coding obtained in 2), predicts the next feature in the sequence.
According to the overall architecture of the system, for each time t, firstly, inputting an image I into a visual model based on CNN, and outputting a vector p after encoding; simultaneous eigenvector xtInputting into a first GRU-based language model, and outputting a vector q after encodingt. The visual code vector p and the language code vector qtConnected component vector rtInput into a GRU-based decoder for decoding the expression data previously learned by visual model and language model, i.e., the feature vector ytAnd assigns it to xt+1For use at the next time. The following is a formula chart of the above systemThe following steps:
p=CNN(I)
qt=GRU(xt)
rt=[p,qt]
yt=softmax(GRU’(rt))
xt+1=yt
(2) establishing a visual model
The CNN is widely applied to the field of vision, is one of multilayer perceptrons, has strong generalization capability due to the characteristics of local connection and weight sharing, and is very suitable for identifying and detecting objects and graphs. In the design of a visual model, the invention adopts CNN unsupervised learning to convert an input image into a learning fixed-length vector as output.
The input image is adjusted to 256 × 256 color pictures, the activation functions are all relus (Rectified linear function), and only effective convolution is performed without processing the boundary. The number of convolution kernels for the first tier is set to 16, the second tier is 32, the third tier is 64, and the last tier is 128. And outputting the vector p to be subsequently processed through four layers of convolution.
(3) Establishing language model
From the viewpoint of target output content, whether generating a. vue file based on an Vue frame or a. jsx file based on a reach frame, HTML-based syntax is required in terms of page layout description. The invention adopts a lighter Language as DSL (Domain-Specific Language) to participate in training. Unlike HTML which has the ability to accommodate a variety of layout requirements, such as tags for block-level elements < div > and inline elements < span >, DSL as used herein is directed to fixed layout forms only, so the overall layout element has only the following 7 types set to represent block-level elements in different application scenarios:
row layout elements placed horizontally
Stack layout elements placed in the vertical direction
Header-a layout element placed at the top of the page occupying the full page width, typically containing relevant content such as the page title or navigation links
Footer-a layout element placed at the bottom of a page that occupies the full page width, typically containing navigation links or contacts
Single card layout element nearly filling an entire row of a page
Double card layout elements where one row can place two
Quad card layout elements four card layout elements can be placed in a row
Similarly, for elements such as block-level text elements and buttons, 8 types need to be set as a perfect complement:
btn-active button in activated state
Btn-inactive button
Btn-success/confirm button
Btn-warning operation button
Btn-danger dangerous operation button
Big title
Small-title subtitle
Text
This can greatly simplify the language complexity to achieve the goal of reducing search space and vocabulary.
In most markup languages, each element is represented by an open tag, and when a child element is nested inside the open tag, a closed tag of a parent element needs to be set so that a parser can understand the hierarchical relationship. Often a parent element will contain multiple children, raising the question of where to place a closed tag, which translates into dealing with long-range dependencies. The conventional RNN architecture suffers from problems of gradient explosion and gradient disappearance, and cannot process long-sequence data, so the LSTM variant GRU, consisting of 2 layers of GRU neural networks each containing 128 cells, was introduced herein to model the relationship of such long-sequence data.
New memory in GRU
Learning memory information by using recursive connections, using vectors x input at time t
tAnd the output vector q generated in the previous step
t-1Activated by sigmoid function by weight multiplication, i.e. according to the formula z
t=σ(W
z·[q
t-1,x
t]) And r
t=σ(W
r·[q
t-1,x
t]) Get two gate values, update the gate weight z
tAnd reset gate weight r
t(ii) a Sigma is an activation function sigmoid; at q
t-1After multiplication with the weight and resetting gate r
tMultiplication followed by a formula
Obtain the final new memory
Wherein W
zFor weight matrix from hidden layer to refresh gate, W
rA weight matrix from the hidden layer to the reset gate, and W from the hidden layer to the candidate state
The weight matrix of (2). Finally according to the formula q
t=(1-z
t)
Obtaining the output vector q of the current step
t. For each time t a vector q is output
tAnd carrying out subsequent treatment.
(4) Building a decoder
Visual coding vector p and linguistic coding vector q at time ttConnected component vector rtAnd inputting the data into a second GRU-based decoder model, wherein the model consists of 2 layers of GRU neural networks respectively containing 512 cells and is used for decoding the performance data obtained by the visual model and the language model learning.
A training stage:
the model is trained using a supervised learning approach. To better balance long term dependence and computational loss, a sliding window pair of length 48 is used forEach trained DSL input file is segmented to obtain a signature sequence. At each moment, inputting a hand-drawn image I and a corresponding characteristic sequence xtOutputting the predicted next feature yt. The model uses a cross-entropy cost function (cross-entropy cost) as its loss function, which will predict the next feature y of the modeltAnd the actual next feature xt+1A comparison is made.
As the context for training is updated through the sliding window at each instant, the same input image I will be reused for samples associated with the same page style; finally, two kinds of marks are set: < START > and < END >, which are used as place-occupying marks of DSL file prefix and suffix, respectively, so as to replace the specific content of the prefix and suffix in the subsequent compiling process;
training is performed by calculating the partial derivatives of the loss with respect to the network weights calculated with back propagation to minimize the multi-class log loss, the loss calculation formula being as follows:
in the above formula, xt+1Is the input vector at the next time, ytIs the output vector at the current time. Training with RMSProp (root Mean Square Prop) algorithm, the learning rate is set to 1 × 10-4And limiting the output gradient to [ -1.0, 1.0 [ -1.0 [ ]]Within ranges to account for numerical instability. In order to prevent overfitting of the model, random inactivation (Dropout) regularization is introduced, an inactivation rate of 0.3 is set after a complete connection layer of the visual model, namely 30% of neurons are randomly deleted each time in the training of the layer, so that the model is less dependent on certain local characteristics, and the generalization is stronger. The training mode adopts mini-batch (mini-batch) training with 64 groups of image sequences as one batch. After training, a relation model between the image data and the related characteristic sequence expressed by the DSL codes is established.
And (3) a testing stage:
to generate the DSL Code, a hand-drawn web page image I and a context sequence X with a feature number of 48 are input into the above-described Draft2Code model.X is to bet...xT-1Initialization is set to null vector, last sequence xTIs arranged as<START>. The predicted feature vector y is then usedtTo update the next context feature sequence. That is, x is to be adjustedt...xT-1Are respectively set as xt+1...xTThen x is addedtIs set to yt. This process is repeated until the model generates a signature<END>. And finally compiling the generated DSL characteristic sequence into a required target language by using a traditional compiling method. The whole process is shown in fig. 2.
Aiming at the code specification requirements of different frameworks, the invention writes a plurality of mapping relations between the DSL and the front-end code, and stores the mapping relations in a json format file, and the content of the mapping relations is used for replacing the generated DSL so as to meet the development requirements. And for all replacement contents, three replacement marks are proposed: brace ({ }) is used to replace sub-element content, if a < div > element contains a < button > button element, then the button element puts the replacing brace inside the div element; the brackets ([ ]) are used for replacing randomly generated texts, the characters in the hand-drawing manuscript are not analyzed, and therefore, some characters can be randomly generated and put into elements such as titles and the like with texts as main parts; the brackets (()) are used for replacing the property in the element label, mainly event binding, for example, the click event binding in vue replaces the @ click property content in the label, the invention counts according to the number of the buttons, generates empty method functions in sequence for occupying, and binds to the property of each button in sequence.
By converting the webpage design hand-painted manuscript image into an engineered front-end code, the requirement of automatic generation of the webpage code under the condition that no webpage screenshot or professional design drawing is referred to in a front-end project can be met, meanwhile, the modularized standard code conforming to Vue and a React frame can be output, secondary development is facilitated for an engineer, and the working efficiency is remarkably improved. The bilingual evaluation replacement score of the Draft2Code model system reaches 7.7 points, and a webpage corresponding to the hand-drawn manuscript can be generated accurately.
The core technology of this patent includes:
(1) DSL simplified HTML grammar is introduced for optimizing the training process, and the data volume required by training is reduced to a certain extent
(2) A front-end Code automatic generation algorithm (Draft2Code) aiming at the hand-drawn webpage image is constructed, so that codes conforming to front-end engineering can be accurately output, and further development is facilitated.