CN117608656A

Movatterモバイル変換

Info

Publication number: CN117608656A
Application number: CN202311503813.8A
Authority: CN
Inventors: 李莹; 陈子豪; 周郅俊; 刘佳豪; 陈龙; 斯炘; 赵新奎; 尹建伟
Original assignee: Zhejiang University ZJU; Hundsun Technologies Inc
Current assignee: Zhejiang University ZJU; Hundsun Technologies Inc
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-02-27

Abstract

Translated fromChinese

本发明公开了一种基于AST和LLM的混合式前端框架迁移方法，包括以下步骤：(1)原前端框架项目加载，识别项目文件并将功能性项目代码分为逻辑处理代码和用户界面定义代码；(2)词法分析、语法分析与语义提取；(3)代码重写器处理，生成中间代码；(4)增量训练优化生成定制化大模型LLM，利用定制化大模型LLM得出原项目文件的代码功能描述，进一步生成目标框架代码；(5)基于中间代码，通过定制化大模型LLM生成语法错误报告和处理建议；利用目标框架代码优化中间代码，生成最终迁移代码。利用本发明，可以提高迁移准确性，还能够给出语法错误报告和处理建议。

The invention discloses a hybrid front-end framework migration method based on AST and LLM, which includes the following steps: (1) Load the original front-end framework project, identify the project file and divide the functional project code into logical processing code and user interface definition code ; (2) Lexical analysis, syntax analysis and semantic extraction; (3) Code rewriter processing to generate intermediate code; (4) Incremental training optimization to generate a customized large model LLM, and use the customized large model LLM to obtain the original project The code function description of the file further generates the target framework code; (5) Based on the intermediate code, a customized large model LLM is used to generate syntax error reports and processing suggestions; the target framework code is used to optimize the intermediate code and generate the final migration code. By utilizing the present invention, migration accuracy can be improved, and grammatical error reports and processing suggestions can also be given.

Description

Translated fromChinese

一种基于AST和LLM的混合式前端框架迁移方法A hybrid front-end framework migration method based on AST and LLM

技术领域Technical field

本发明属于代码迁移领域，尤其是涉及一种基于AST和LLM的混合式前端框架迁移方法。The invention belongs to the field of code migration, and in particular relates to a hybrid front-end framework migration method based on AST and LLM.

背景技术Background technique

在金融机构的客户端系统产品中，大多数使用基于微软WPF开发的UI框架，WPF与windows操作系统是紧耦合的，不能支持基于Linux的国产操作系统。WPF框架对其他操作系统的兼容问题，目前仍没有解决方案。Among the client system products of financial institutions, most use the UI framework developed based on Microsoft WPF. WPF is tightly coupled with the Windows operating system and cannot support domestic operating systems based on Linux. There is still no solution to the WPF framework's compatibility problem with other operating systems.

将桌面端金融软件迁移到国产操作系统，实际上相当于进行一项复杂而全面的软件转换工程。这项工程的复杂性源自于其几乎等同于对整个软件系统进行重构的挑战，这势必导致人力资源和时间的高度投入。Migrating desktop financial software to a domestic operating system is actually equivalent to a complex and comprehensive software conversion project. The complexity of this project stems from the fact that it is almost equivalent to the challenge of refactoring the entire software system, which will inevitably lead to a high investment of human resources and time.

越来越多的公司尝试对代码进行自动化迁移，如国内公开号为CN116204208A的专利文献提出了一种前端框架升级代码迁移方法及装置，公开号为CN115951890A的专利文献提出了一种不同前端框架间的代码转换方法及系统及装置。实现这一迁移过程的自动化，不仅能够降低人力资源的投入，还有望显著减少整个迁移过程所需的时间。More and more companies are trying to automatically migrate codes. For example, the domestic patent document with the publication number CN116204208A proposes a front-end framework upgrade code migration method and device, and the patent document with the publication number CN115951890A proposes a method between different front-end frameworks. Code conversion method, system and device. Automating this migration process will not only reduce the investment in human resources, but is also expected to significantly reduce the time required for the entire migration process.

LLM(Large Language Model，大型语言模型)例如GPT-3.5或类似的模型，这些模型在自然语言处理领域取得了显著的进展，能够生成文本、回答问题、进行对话等。大型语言模型在生成代码方面具有很大潜力，可以用于自动代码生成、编程辅助等任务。因此，将LLM引入对桌面端软件的迁移具有可行性。LLM (Large Language Model) such as GPT-3.5 or similar models. These models have made significant progress in the field of natural language processing and can generate text, answer questions, conduct conversations, etc. Large language models have great potential for generating code and can be used for tasks such as automatic code generation, programming assistance, and more. Therefore, it is feasible to introduce LLM into the migration of desktop software.

但大语言模型生成代码的过程存在一些缺陷：However, there are some flaws in the process of generating code from large language models:

1)由于成熟的软件项目代码量巨大，而LLM的接口token(可处理的最小文本单元)有内容限制，在很多情况下，我们不能直接将完整的项目文件内容上传至LLM。1) Due to the huge amount of code in mature software projects and the content limitations of LLM's interface token (the smallest text unit that can be processed), in many cases, we cannot directly upload the complete project file content to LLM.

2)由于前端框架版本在不断更新，各控件库和API内容也会发生变化，而LLM建立在历史数据的预训练基础上，其所生成的信息带有一定时效性，无法根据最新数据生成高度准确的内容，导致生成的代码片段或建议在应对最新前端开发需求时可能存在一定的不准确性或不适用性。2) Since the front-end framework version is constantly updated, the content of each control library and API will also change. LLM is based on pre-training of historical data, and the information it generates has a certain timeliness, and it cannot generate height based on the latest data. Accurate content, resulting in generated code snippets or suggestions that may contain certain inaccuracies or inapplicability when dealing with the latest front-end development needs.

3)与生成通用文本不同，仅仅基于上传的片段代码进行处理并不足以精确生成迁移代码。这是因为代码的结构和逻辑要求更高的上下文理解，以确保生成的迁移代码在语义和功能上保持一致。3) Unlike generating generic text, processing based solely on uploaded fragment code is not sufficient to accurately generate migration code. This is because the structure and logic of the code require a higher contextual understanding to ensure that the generated migration code is semantically and functionally consistent.

4)在特定领域的信息处理中，大型语言模型(LLM)的训练数据可能未覆盖公司内部的开发流程、开发框架、开发内容等信息。为了获得准确的反馈，有必要补充这些特定领域的内容和相关上下文。4) In information processing in specific fields, the training data of large language models (LLM) may not cover the company's internal development process, development framework, development content and other information. In order to obtain accurate feedback, it is necessary to supplement these domain-specific content and relevant context.

发明内容Contents of the invention

本发明提供了一种基于AST和LLM的混合式前端框架迁移方法，可以提高迁移准确性，还能够给出语法错误报告和处理建议。The present invention provides a hybrid front-end framework migration method based on AST and LLM, which can improve migration accuracy and can also provide syntax error reports and processing suggestions.

一种基于AST和LLM的混合式前端框架迁移方法，包括以下步骤：A hybrid front-end framework migration method based on AST and LLM, including the following steps:

(1)原前端框架项目加载，识别项目文件并将功能性项目代码分为逻辑处理代码和用户界面定义代码；(1) The original front-end framework project is loaded, the project file is identified and the functional project code is divided into logical processing code and user interface definition code;

(2)词法分析、语法分析与语义提取，具体步骤如下：(2) Lexical analysis, syntax analysis and semantic extraction, the specific steps are as follows:

(2-1)词法分析，将逻辑处理代码和用户界面定义代码分解为一系列的词法单元；(2-1) Lexical analysis, decompose the logic processing code and user interface definition code into a series of lexical units;

(2-2)语法分析，在词法分析的基础上进行语法分析，将词法单元组织成抽象语法树AST，表示代码的层次结构和代码元素之间的嵌套关系；(2-2) Grammatical analysis, grammatical analysis is performed on the basis of lexical analysis, and lexical units are organized into an abstract syntax tree AST, which represents the hierarchical structure of the code and the nested relationship between code elements;

(2-3)语义分析，在AST的基础上进行语义分析，包括类型检查、作用域分析、符号解析、继承和实现分析等；(2-3) Semantic analysis, semantic analysis based on AST, including type checking, scope analysis, symbol analysis, inheritance and implementation analysis, etc.;

(3)代码重写器处理，具体步骤如下：(3) Code rewriter processing, the specific steps are as follows:

(3-1)基于原前端框架和迁移目标框架的控件库，编写基于控件库的各方法属性实现代码；(3-1) Based on the control library of the original front-end framework and the migration target framework, write the implementation code of each method attribute based on the control library;

(3-2)根据迁移前后各控件的属性和方法调用的对应关系，按照复杂程度归纳实现迁移规则；迁移规则分为：单点位迁移、单语句迁移、复杂结构迁移；(3-2) Based on the corresponding relationship between the attributes and method calls of each control before and after migration, the migration rules are summarized according to the complexity; the migration rules are divided into: single point migration, single statement migration, and complex structure migration;

(3-3)基于语义分析的AST模型和迁移规则实现代码重写器，生成中间代码；(3-3) Implement a code rewriter based on the AST model and migration rules of semantic analysis to generate intermediate code;

(4)增量训练优化生成定制化大模型LLM，利用定制化大模型LLM得出原项目文件的代码功能描述，进一步生成目标框架代码；(4) Incremental training optimization generates a customized large model LLM, and uses the customized large model LLM to obtain the code function description of the original project file, and further generates the target framework code;

(5)基于中间代码，通过定制化大模型LLM生成语法错误报告和处理建议；利用目标框架代码优化中间代码，生成最终迁移代码。(5) Based on the intermediate code, generate syntax error reports and processing suggestions through customized large model LLM; use the target framework code to optimize the intermediate code and generate the final migration code.

本发明的代码迁移方法，根据代码功能分别处理，进行语义分析，构建语义模型，将代码重写器生成的中间代码和LLM模型生成的目标框架代码融合生成最终迁移代码，解决手工迁移方式过程繁琐、耗费人力的问题，实现迁移的自动化，提高迁移效率，降低工作量和成本。The code migration method of the present invention processes the code separately according to its functions, conducts semantic analysis, builds a semantic model, and integrates the intermediate code generated by the code rewriter and the target framework code generated by the LLM model to generate the final migration code, thereby solving the tedious process of manual migration. , solve labor-intensive problems, automate migration, improve migration efficiency, and reduce workload and costs.

步骤(1)的具体过程为：The specific process of step (1) is:

使用代码编译分析工具如Roslyn对整个项目加载预编译，识别项目文件内容，将文件归类为逻辑处理文件、用户界面定义文件、资源文件、配置文件、其它文件，提取逻辑处理代码文件和用户界面定义代码文件，生成逻辑处理代码文件和用户界面定义代码文件的映射关系。Use code compilation analysis tools such as Roslyn to load pre-compilation for the entire project, identify the project file content, classify the files into logical processing files, user interface definition files, resource files, configuration files, and other files, and extract logical processing code files and user interfaces Define code files and generate mapping relationships between logic processing code files and user interface definition code files.

步骤(2-1)中，逻辑处理代码分解为关键字、标识符、操作符、常量等，用户界面定义代码分解为标签名、属性名、属性值。In step (2-1), the logical processing code is decomposed into keywords, identifiers, operators, constants, etc., and the user interface definition code is decomposed into tag names, attribute names, and attribute values.

步骤(2-3)中，在类型检查阶段，将确定代码中变量、表达式和值的数据类型；在作用域分析阶段，将识别变量、函数和类的作用域范围，检查变量在作用域内的声明和使用；在符号解析阶段，将建立符号表，用于存储变量、函数和类的信息，识别符号并查找声明和引用；在继承和实现分析阶段，将分析类之间的继承关系和接口。In step (2-3), in the type checking phase, the data types of variables, expressions and values in the code will be determined; in the scope analysis phase, the scope of variables, functions and classes will be identified, and the variables will be checked to be within the scope. declaration and use; in the symbol parsing stage, a symbol table will be established to store information about variables, functions and classes, identify symbols and find declarations and references; in the inheritance and implementation analysis stage, the inheritance relationships and relationships between classes will be analyzed interface.

步骤(3-2)中，对于单点位迁移，通过遍历语法节点树来实现更新迁移；对于单语句迁移，通过正则表达式匹配目标语句，提取调用信息实现迁移；对于复杂结构迁移，则针对每条规则，通过修改语法树的方式来实现迁移。In step (3-2), for single-point migration, update migration is achieved by traversing the syntax node tree; for single-statement migration, the target statement is matched by regular expressions and the call information is extracted to implement migration; for complex structure migration, the Each rule is migrated by modifying the syntax tree.

步骤(3-3)中，代码重写器的生成基于AST操作树节点进行代码的修改、移动和重组；并基于迁移规则分为常规规则重写器、特殊规则重写器和后续优化重写器。In step (3-3), the code rewriter is generated based on the AST operation tree nodes to modify, move and reorganize the code; and based on the migration rules, it is divided into regular rule rewriters, special rule rewriters and subsequent optimization rewrites. device.

在这一流程中，分为迁移前后框架控件库内容提取、迁移规则生成、代码重构、关联处理等。This process is divided into pre- and post-migration framework control library content extraction, migration rule generation, code reconstruction, association processing, etc.

在框架控件库内容提取阶段，通过爬虫框架爬取源框架软件库和迁移目标软件库的官方文档信息，开源项目和社区问答信息，对爬取到的信息加工处理。同时，为了确保迁移的规则在声明空间、控件层面、类层面、方法层面、语句体层面一一对应，将原框架的各前端控件的方法和属性和目标迁移框架的方法属性对应实现。In the framework control library content extraction stage, the crawler framework is used to crawl the official document information of the source framework software library and the migration target software library, open source project and community Q&A information, and process the crawled information. At the same time, in order to ensure that the migration rules correspond one-to-one at the declaration space, control level, class level, method level, and statement body level, the methods and attributes of each front-end control of the original framework are implemented correspondingly to the method attributes of the target migration framework.

在迁移规则生成阶段，通过爬取到的内容和迁移前后原框架和目标框架控件库代码的对比，基于粗粒度如项目文件层次、控件层次、声明空间层次、类层次、方法体层次等进行逐一匹配映射，基于语句体层次和标识符层次进行细粒度映射，生成基于语句层次的迁移规则。In the migration rule generation phase, through the comparison of the crawled content and the control library code of the original framework and the target framework before and after migration, one by one based on coarse-grained levels such as project file level, control level, declaration space level, class level, method body level, etc. Matching mapping performs fine-grained mapping based on the statement body level and identifier level, and generates migration rules based on the statement level.

具体而言，我们归纳出了三类迁移规则：(1)单点位迁移。这类规则，包括对AST中单个语法节点的迁移，例如对变量属性名的修改，函数名修改等。这种类型我们采用代码重写器，遍历语法树来进行重写。(2)单语句迁移：这类迁移涉及到对单条语句的迁移，我们采用代码重写器逐语句遍历代码，并用正则表达式进行匹配与迁移。(3)复杂结构迁移。这类迁移往往变动大且复杂，我们针对每个迁移都进行特殊处理。通过对AST的直接修改来实现。Specifically, we summarized three types of migration rules: (1) Single-point migration. Such rules include the migration of individual syntax nodes in the AST, such as modifications to variable attribute names, function name modifications, etc. In this type we use a code rewriter to traverse the syntax tree for rewriting. (2) Single statement migration: This type of migration involves the migration of a single statement. We use the code rewriter to traverse the code statement by statement and use regular expressions for matching and migration. (3) Complex structure migration. This type of migration is often large-scale and complex, and we perform special processing for each migration. This is achieved through direct modification of the AST.

代码重构阶段，通过原前端框架项目加载生成的语义模型，基于AST节点对待迁移语句进行定位，定位过程通过正则匹配、代码分析工具逐节点匹配等方式将定位语句生成为目标框架语句。In the code reconstruction phase, the semantic model generated by loading the original front-end framework project is used to locate the statements to be migrated based on AST nodes. The positioning process generates the positioning statements into target framework statements through regular matching and node-by-node matching using code analysis tools.

关联处理阶段，根据符号表递归查找定位上下文中相关变量、函数和类的声明和调用，如基于该类的声明方法，查找该类继承关系，定位第一次声明该方法位置，修改相关内容。In the association processing stage, the symbol table is used to recursively locate the declarations and calls of relevant variables, functions and classes in the context. For example, based on the declaration method of the class, find the inheritance relationship of the class, locate the first declaration of the method, and modify the relevant content.

步骤(4)的具体过程为：The specific process of step (4) is:

(4-1)将外部资料、相关文档和迁移规则总结信息作为数据源加载为大模型读取的形式，将数据切分为指定大小后以嵌入的形式存入到向量数据库中，传递给大模型，生成定制化大模型LLM；(4-1) Load external data, related documents and migration rule summary information as data sources into the form of large model reading, divide the data into specified sizes, store it in the vector database in embedded form, and pass it to the large model. Model, generate customized large model LLM;

(4-2)对项目所有待迁移代码进行统计，判断token是否超过，并基于语义分析信息对原项目文件分割；(4-2) Make statistics on all codes to be migrated in the project, determine whether the token exceeds the limit, and segment the original project files based on semantic analysis information;

(4-3)根据逻辑处理代码和用户界面定义代码的对应关系，以及上下文调用关系，合并上传定制化大模型LLM，得出原项目文件的代码功能描述；(4-3) Based on the corresponding relationship between the logic processing code and the user interface definition code, as well as the context calling relationship, merge and upload the customized large model LLM to obtain the code function description of the original project file;

(4-4)基于功能描述，调用定制化大模型LLM生成目标框架代码。(4-4) Based on the functional description, call the customized large model LLM to generate the target framework code.

步骤(4-1)中，将数据源按指定大小分割为文档块，以嵌入的形式存储到向量数据库中；在输入提示语时，从向量数据库中检索分割后的文档，通过比较余弦相似度，查找向量数据库中与该提示语问题相似的文档块，将该文档块传递给LLM模型，使用含有问题和文档块的提示语生成回答。In step (4-1), the data source is divided into document blocks according to the specified size and stored in the vector database in an embedded form; when the prompt is entered, the divided documents are retrieved from the vector database and compared with the cosine similarity , find document blocks similar to the prompt question in the vector database, pass the document block to the LLM model, and use the prompt language containing the question and document block to generate an answer.

步骤(5)中，中间代码基于原框架代码的内容，目标框架代码基于原框架代码的功能，利用目标框架代码优化中间代码，在保持原代码的结构基础上优化代码重写器未能处理的不符合语法的部分，并生成错误日志和优化建议，供后续开发人员处理。In step (5), the intermediate code is based on the content of the original framework code, and the target framework code is based on the functions of the original framework code. The target framework code is used to optimize the intermediate code, and the code rewriter is optimized while maintaining the structure of the original code. For parts that do not conform to the syntax, error logs and optimization suggestions are generated for subsequent developers to deal with.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明的方法基于项目架构、语法模型和实际运行的调用关系，分别从抽象语法树节点粒度、语句粒度、代码块粒度、命名空间粒度、项目文件粒度解析迁移过程的各符号和调用信息，实现自动化迁移得到中间代码，并参照基于大模型识别出源代码文件语句，解析它们的功能和关系，以此为参照优化中间代码，得到最终迁移代码，提高迁移准确性，并给出测试用例、标记特殊修改位置、给出修改建议。The method of the present invention is based on the project architecture, syntax model and actual running calling relationship, and analyzes each symbol and calling information of the migration process from the abstract syntax tree node granularity, statement granularity, code block granularity, namespace granularity and project file granularity respectively to realize Automatic migration obtains the intermediate code, and identifies source code file statements based on the large model, analyzes their functions and relationships, uses this as a reference to optimize the intermediate code, obtains the final migration code, improves migration accuracy, and provides test cases and markers Special modification locations and suggestions for modifications.

附图说明Description of drawings

图1为本发明一种基于AST和LLM的混合式前端框架迁移方法流程图；Figure 1 is a flow chart of a hybrid front-end framework migration method based on AST and LLM according to the present invention;

图2为本发明中迁移规则分类以及处理方式图。Figure 2 is a diagram of migration rule classification and processing methods in the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明做进一步详细描述，需要指出的是，以下所述实施例旨在便于对本发明的理解，而对其不起任何限定作用。The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be noted that the following examples are intended to facilitate the understanding of the present invention and do not limit it in any way.

如图1所示，一种基于AST和LLM的混合式前端框架迁移方法，主要包含两个方面的处理，分别是对原前端项目文件的处理和官方文档、开源项目等外部资料等处理。As shown in Figure 1, a hybrid front-end framework migration method based on AST and LLM mainly includes two aspects of processing, namely the processing of original front-end project files and the processing of external materials such as official documents and open source projects.

在对原前端项目文件的处理中，主要分为三个步骤：S1项目文件加载、S2语义分析与构建模型、S3中间代码生成。在对官方文档、开源项目等外部资料的处理中，主要分为S4模型定制、S5代码文件处理、S6链式调用，以及S7最终代码生成。The processing of the original front-end project files is mainly divided into three steps: S1 project file loading, S2 semantic analysis and model construction, and S3 intermediate code generation. The processing of external materials such as official documents and open source projects is mainly divided into S4 model customization, S5 code file processing, S6 chain calling, and S7 final code generation.

S1、项目文件加载，具体步骤如下：S1. Load the project file. The specific steps are as follows:

S11、识别读取项目，分析项目结构，生成项目结构树。S11. Identify and read projects, analyze the project structure, and generate a project structure tree.

S12、将项目文件划分为逻辑处理代码文件、用户界面定义代码文件和其它文件。S12. Divide the project files into logical processing code files, user interface definition code files and other files.

S2、语义分析与构建模型，具体步骤如下：S2. Semantic analysis and model construction. The specific steps are as follows:

S21、对逻辑处理代码文件进行词法分析、语法分析。词法分析将源代码文本分解为一系列的词法单元如关键字、标识符、操作符等。语法分析将词法单元组织成抽象语法树，标识代码的层次结构和代码元素之间的嵌套关系。语义分析在抽象语法树上进行，包括类型检查、作用域分析、成员访问、继承关系等。S21. Perform lexical analysis and syntax analysis on the logic processing code file. Lexical analysis decomposes the source code text into a series of lexical units such as keywords, identifiers, operators, etc. Syntax analysis organizes lexical units into abstract syntax trees that identify the hierarchical structure of code and the nested relationships between code elements. Semantic analysis is performed on the abstract syntax tree, including type checking, scope analysis, member access, inheritance relationships, etc.

S22、对用户界面定义代码进行词法分析、语法分析。词法分析将源代码文件分解为一系列的词法单元如标签名、属性名、属性值等。语法分析将元素节点、文档节点、属性节点、文本节点等组织成文档对象模型树，通过该模型树获取该用户界面定义的结构，以便动态修改用户界面定义内容。S22. Perform lexical analysis and syntax analysis on the user interface definition code. Lexical analysis decomposes the source code file into a series of lexical units such as tag names, attribute names, attribute values, etc. Syntax analysis organizes element nodes, document nodes, attribute nodes, text nodes, etc. into a document object model tree, and obtains the structure defined by the user interface through the model tree in order to dynamically modify the user interface definition content.

S23、在抽象语法树的基础上进行语义分析，执行一系列的检查和转换，以确保源代码的语义正确性。针对源代码分别进行类型检查、作用域分析、常量折叠、函数和过程调用检查、类型推断、引用解析、异常处理、资源管理、类型系统扩展、编译时优化等。S23. Perform semantic analysis based on the abstract syntax tree and perform a series of checks and transformations to ensure the semantic correctness of the source code. Type checking, scope analysis, constant folding, function and procedure call checking, type inference, reference resolution, exception handling, resource management, type system expansion, compile-time optimization, etc. are performed on the source code.

步骤S2中，在对逻辑处理文件的处理中，分为词法分析、语法分析、语义分析等。进行在词法分析阶段将源代码文本拆分成一系列的词法单元，例如关键字、标识符、运算符、常量等。这些词法单元是代码的基本构建块。In step S2, the processing of the logical processing file is divided into lexical analysis, syntax analysis, semantic analysis, etc. During the lexical analysis stage, the source code text is split into a series of lexical units, such as keywords, identifiers, operators, constants, etc. These tokens are the basic building blocks of code.

在语法分析阶段，将词法单元组织成一个抽象语法树(AST)。AST反映了代码的层次结构和语法关系。每个节点代表一个代码元素，如表达式、语句、函数、类等。In the syntax analysis stage, lexical units are organized into an abstract syntax tree (AST). AST reflects the hierarchical structure and syntactic relationship of the code. Each node represents a code element, such as expression, statement, function, class, etc.

在语义分析阶段，不仅关注代码的结构，还关注代码的含义，分别执行类型检查、作用域分析、符号解析、继承和实现分析。在类型检查中将确定代码中变量、表达式和值的数据类型；作用域分析将识别变量、函数和类等的作用域范围，检查变量在作用域内的声明和使用；在符号解析阶段将建立符号表，用于存储变量、函数和类等的信息，识别符号并查找声明和引用；在继承和实现分析阶段将分析类之间的继承关系和接口。In the semantic analysis stage, we not only pay attention to the structure of the code, but also pay attention to the meaning of the code, and perform type checking, scope analysis, symbol analysis, inheritance and implementation analysis respectively. In type checking, the data types of variables, expressions and values in the code will be determined; scope analysis will identify the scope of variables, functions and classes, etc., and check the declaration and use of variables within the scope; in the symbol resolution phase, the establishment of The symbol table is used to store information about variables, functions and classes, identify symbols and find declarations and references; during the inheritance and implementation analysis phase, the inheritance relationships and interfaces between classes will be analyzed.

在对用户界面定义文件的处理中，一般用户界面的定义文件多为声明性的标记语言，包含了用户界面元素的层次结构、样式、数据绑定和交互等信息。这一流程分为词法分析、语法分析、元素解析、属性解析、事件关联、样式和模板解析等。In the processing of user interface definition files, generally user interface definition files are mostly declarative markup languages, containing information such as the hierarchical structure, style, data binding, and interaction of user interface elements. This process is divided into lexical analysis, syntax analysis, element parsing, attribute parsing, event correlation, style and template parsing, etc.

在词法分析阶段，将源代码文本分解为一系列的词法单元，如元素名称、属性名称、属性值等，语法分析阶段将词法单元组织成一个文档对象模型(DOM)，DOM将文档中的每个元素、属性、文本和其它内容表示为树状结构。In the lexical analysis stage, the source code text is decomposed into a series of lexical units, such as element names, attribute names, attribute values, etc. The lexical analysis stage organizes the lexical units into a Document Object Model (DOM). The DOM combines each element in the document. Elements, attributes, text, and other content are represented as a tree structure.

在元素解析阶段，识别每个DOM节点的名称、类型和嵌套关系，在属性解析阶段，对各个元素的属性识别属性名和属性值。在数据绑定解析阶段，识别绑定的源、目标和绑定方式。在事件关联阶段，识别事件处理器，将事件与逻辑处理代码中的处理方法关联。在样式和模板解析阶段，对定义的样式和模板需要解析为对应的样式和模板对象。在资源解析阶段，对引用外部资源的元素，需要解析获取引用源。In the element parsing stage, the name, type and nesting relationship of each DOM node are identified. In the attribute parsing stage, the attribute name and attribute value of each element are identified. In the data binding parsing phase, the source, target and binding method of the binding are identified. In the event correlation phase, the event handler is identified and the event is associated with the processing method in the logical processing code. In the style and template parsing phase, the defined styles and templates need to be parsed into corresponding style and template objects. In the resource parsing stage, elements that reference external resources need to be parsed to obtain the reference source.

S3、中间代码生成，具体步骤如下：S3, intermediate code generation, the specific steps are as follows:

根据官方文档中关于迁移前后框架的类、属性、方法建立逐一映射，基于控件库中基础控件的各个属性和方法的对应情况，生成迁移规则和对应的处理逻辑代码。Establish one-by-one mappings based on the classes, properties, and methods of the pre- and post-migration frameworks in the official documentation, and generate migration rules and corresponding processing logic codes based on the corresponding properties and methods of the basic controls in the control library.

代码重写器基于生成的抽象代码树和语义分析的数据，根据对类型的定义追溯，引用解析等，将需要修改的类型名、属性名、方法名等在基类进行修改，同时遍历调用该类的所有位置，进行逐一修改。代码重写器基于抽象语法树确定需要修改的位置，进行树节点粒度、语句粒度、代码块粒度的逐一匹配，进行修改和级联修改，经处理生成中间代码。Based on the generated abstract code tree and semantic analysis data, the code rewriter will modify the type name, attribute name, method name, etc. that need to be modified in the base class based on the definition tracing of the type, reference resolution, etc., and at the same time traverse and call the All positions of the class are modified one by one. The code rewriter determines the location that needs to be modified based on the abstract syntax tree, matches the tree node granularity, statement granularity, and code block granularity one by one, performs modifications and cascade modifications, and generates intermediate code after processing.

步骤S3中，主要分为迁移前后框架控件库内容提取、迁移规则生成、代码重构、关联处理等。Step S3 is mainly divided into extracting the content of the framework control library before and after migration, generating migration rules, code reconstruction, association processing, etc.

如图2所示，具体而言，归纳出了三类迁移规则：(1)单点位迁移。这类规则，包括对AST中单个语法节点的迁移，例如对变量属性名的修改，函数名修改等。这种类型采用代码重写器，遍历语法树来进行重写。(2)单语句迁移：这类迁移涉及到对单条语句的迁移，采用代码重写器逐语句遍历代码，并用正则表达式进行匹配与迁移。(3)复杂结构迁移。这类迁移往往变动大且复杂，针对每个迁移都进行特殊处理。通过对AST的直接修改来实现。As shown in Figure 2, specifically, three types of migration rules are summarized: (1) Single-point migration. Such rules include the migration of individual syntax nodes in the AST, such as modifications to variable attribute names, function name modifications, etc. This type uses a code rewriter to traverse the syntax tree for rewriting. (2) Single statement migration: This type of migration involves the migration of a single statement, using a code rewriter to traverse the code statement by statement, and using regular expressions for matching and migration. (3) Complex structure migration. This type of migration is often large-scale and complex, and each migration requires special processing. This is achieved through direct modification of the AST.

S4、模型定制，具体步骤如下：S4. Model customization. The specific steps are as follows:

由于LLM缺少特定领域的训练数据，且预训练的数据具有时效性，而这一领域的资料内容又在不断更新，在迁移代码的前端框架的垂直领域生成的内容常常很不准确。为此需要封装调用LLM的过程，将这一过程分为：连接外部资料库、模型微调、增量训练优化模型、输入内容预处理、链式调用等。Since LLM lacks training data in specific fields, and the pre-training data is time-sensitive, and the data content in this field is constantly updated, the content generated in the vertical field of the front-end framework of the migrated code is often very inaccurate. To this end, it is necessary to encapsulate the process of calling LLM, and divide this process into: connecting to external databases, model fine-tuning, incremental training and optimization of the model, input content preprocessing, chain calls, etc.

在连接外部资料库中，调用llama-index库，通过SimpleDirectoryReader数据加载器，将不断更新的整个框架的官方文档信息文件加载，并将所有文档交由GPTSimpleVectorIndex构建索引，将所有文档分段并转换为向量，存储为索引，通过调用索引引入外部资料，生成相应回答。When connecting to the external database, the llama-index library is called, and the official document information file of the entire framework is constantly updated through the SimpleDirectoryReader data loader, and all documents are handed over to GPTSimpleVectorIndex to build the index, and all documents are segmented and converted into The vector is stored as an index, and external data is introduced by calling the index to generate corresponding answers.

由于token数的限制，不能直接导入完整的外部资料，需要将外部资料使用使用SpacyTextSplitter进行分割并生成分段小结，再对总结出的内容作出小结，进而构建成一个树状的索引。每一个树中的节点即为子树内容的摘要，最后得出的根节点即为整个资料文档的总结。Due to the limitation of the number of tokens, complete external data cannot be directly imported. The external data needs to be split using SpacyTextSplitter and segmented summaries are generated. The summarized content is then summarized to build a tree-like index. The nodes in each tree are the summary of the subtree content, and the final root node is the summary of the entire data document.

对于官方文档，多数情况下有表格、图片等信息，对此引入多模态模型识别图片。引入ImageParser类，基于OCR扫描模型生成文本，通过指定FileExtractor，会把对应的图片通过ImageParser解析成为文本，并最终成为向量来用于检索。For official documents, in most cases there are tables, pictures and other information, for which a multi-modal model is introduced to recognize pictures. The ImageParser class is introduced to generate text based on the OCR scanning model. By specifying FileExtractor, the corresponding image will be parsed into text through ImageParser, and finally become a vector for retrieval.

在模型微调中，根据输入的外部资料、相关文档和内部手工迁移规则总结信息，在原有的模型基础上训练，将原有参数改变得到微调后的新模型。In model fine-tuning, the information is summarized based on the input external data, relevant documents and internal manual migration rules, trained on the original model, and the original parameters are changed to obtain a new model after fine-tuning.

S5、代码文件处理，具体步骤如下：S5, code file processing, the specific steps are as follows:

通过将项目文件内容按语句体分割，并生成代码的各参数、功能信息，生成分割代码总结，将各个分割的代码段的总结作出小结，构建出树状索引，每一个树节点即为子树代码段的摘要，最后得出的根节点即该项目文件的功能和各参数总结。By dividing the project file content according to the statement body, and generating each parameter and function information of the code, a summary of the divided code is generated, and the summary of each divided code segment is summarized to build a tree index. Each tree node is a subtree. A summary of the code segment, and the final root node is a summary of the functions and parameters of the project file.

S6、链式调用，具体步骤如下：S6, chain call, the specific steps are as follows:

在链式调用中，LLM提供了Completion和Embedding两个核心接口，通过提示语Prompt包含历史请求记录，使得LLM可以根据上下文正确生成内容，通过将Embedding提前保存索引，通过多轮对话，将上一轮返回的答案作为新的输入，逐步确定各项目文件的功能。In the chain call, LLM provides two core interfaces: Completion and Embedding. The prompt contains historical request records, so that LLM can correctly generate content according to the context. By saving the index of Embedding in advance, through multiple rounds of dialogue, the previous The answers returned by the round are used as new input to gradually determine the functions of each project file.

S7、最终代码生成，具体步骤如下：S7. Final code generation, the specific steps are as follows:

中间代码基于原框架代码内容，目标框架代码基于原框架代码的功能，，将目标框架代码优化中间代码，可以在保持原代码的结构基础上优化代码重写器未能处理的不符合语法的部分，并生成错误日志和优化建议，供后续开发人员处理。The intermediate code is based on the content of the original framework code, and the target framework code is based on the functions of the original framework code. Optimizing the target framework code into the intermediate code can optimize the ungrammatical parts that the code rewriter cannot process while maintaining the structure of the original code. , and generate error logs and optimization suggestions for subsequent developers to deal with.

以上所述的实施例对本发明的技术方案和有益效果进行了详细说明，应理解的是以上所述仅为本发明的具体实施例，并不用于限制本发明，凡在本发明的原则范围内所做的任何修改、补充和等同替换，均应包含在本发明的保护范围之内。The above-described embodiments describe in detail the technical solutions and beneficial effects of the present invention. It should be understood that the above-mentioned are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, additions and equivalent substitutions should be included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于AST和LLM的混合式前端框架迁移方法，其特征在于，包括以下步骤：1. A hybrid front-end framework migration method based on AST and LLM, which is characterized by including the following steps:

(2-3)语义分析，在AST的基础上进行语义分析，包括类型检查、作用域分析、符号解析、继承和实现分析；(2-3) Semantic analysis, semantic analysis based on AST, including type checking, scope analysis, symbol analysis, inheritance and implementation analysis;

2.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(1)的具体过程为：2. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that the specific process of step (1) is:

使用代码编译分析工具对整个项目加载预编译，识别项目文件内容，将文件归类为逻辑处理文件、用户界面定义文件、资源文件、配置文件、其它文件，提取逻辑处理代码文件和用户界面定义代码文件，生成逻辑处理代码文件和用户界面定义代码文件的映射关系。Use code compilation analysis tools to load pre-compilation for the entire project, identify the project file content, classify the files into logical processing files, user interface definition files, resource files, configuration files, and other files, and extract logical processing code files and user interface definition codes file to generate the mapping relationship between the logic processing code file and the user interface definition code file.

3.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(2-1)中，逻辑处理代码分解为关键字、标识符、操作符，用户界面定义代码分解为标签名、属性名、属性值。3. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that in step (2-1), the logical processing code is decomposed into keywords, identifiers, operators, and the user interface definition The code is decomposed into tag names, attribute names, and attribute values.

4.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(2-3)中，在类型检查阶段，将确定代码中变量、表达式和值的数据类型；在作用域分析阶段，将识别变量、函数和类的作用域范围，检查变量在作用域内的声明和使用；在符号解析阶段，将建立符号表，用于存储变量、函数和类的信息，识别符号并查找声明和引用；在继承和实现分析阶段，将分析类之间的继承关系和接口。4. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that in step (2-3), in the type checking stage, the data of variables, expressions and values in the code will be determined Type; in the scope analysis phase, the scope of variables, functions and classes will be identified, and the declaration and use of variables within the scope will be checked; in the symbol resolution phase, a symbol table will be established to store information about variables, functions and classes , identify symbols and find declarations and references; in the inheritance and implementation analysis phase, the inheritance relationships and interfaces between classes will be analyzed.

5.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(3-2)中，对于单点位迁移，通过遍历语法节点树来实现更新迁移；对于单语句迁移，通过正则表达式匹配目标语句，提取调用信息实现迁移；对于复杂结构迁移，则针对每条规则，通过修改语法树的方式来实现迁移。5. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that, in step (3-2), for single-point migration, update migration is achieved by traversing the syntax node tree; for For single-statement migration, the target statement is matched by regular expressions and the call information is extracted to implement migration; for complex structure migration, migration is achieved by modifying the syntax tree for each rule.

6.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(3-3)中，代码重写器的生成基于AST操作树节点进行代码的修改、移动和重组；并基于迁移规则分为常规规则重写器、特殊规则重写器和后续优化重写器。6. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that in step (3-3), the code rewriter is generated based on the AST operation tree node to modify and move the code. and reorganization; and based on migration rules, it is divided into regular rule rewriters, special rule rewriters and subsequent optimization rewriters.

7.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(4)的具体过程为：7. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that the specific process of step (4) is:

8.根据权利要求7所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(4-1)中，将数据源按指定大小分割为文档块，以嵌入的形式存储到向量数据库中；在输入提示语时，从向量数据库中检索分割后的文档，通过比较余弦相似度，查找向量数据库中与该提示语问题相似的文档块，将该文档块传递给LLM模型，使用含有问题和文档块的提示语生成回答。8. The hybrid front-end framework migration method based on AST and LLM according to claim 7, characterized in that in step (4-1), the data source is divided into document blocks according to a specified size and stored in an embedded form. In the vector database; when the prompt is input, the segmented document is retrieved from the vector database, and by comparing the cosine similarity, the document block in the vector database that is similar to the prompt question is found, and the document block is passed to the LLM model, using Prompts containing questions and documentation blocks generate answers.

9.根据权利要求1所述的基于AST和LLM的混合式前端框架迁移方法，其特征在于，步骤(5)中，中间代码基于原框架代码的内容，目标框架代码基于原框架代码的功能，利用目标框架代码优化中间代码，在保持原代码的结构基础上优化代码重写器未能处理的不符合语法的部分，并生成错误日志和优化建议，供后续开发人员处理。9. The hybrid front-end framework migration method based on AST and LLM according to claim 1, characterized in that in step (5), the intermediate code is based on the content of the original framework code, and the target framework code is based on the function of the original framework code, Use the target framework code to optimize the intermediate code, optimize the ungrammatical parts that the code rewriter cannot handle while maintaining the structure of the original code, and generate error logs and optimization suggestions for subsequent developers to deal with.