Disclosure of Invention
In order to solve the above problems, the present invention provides a topic transformation system based on a big data cloud platform, which comprises:
the database module comprises a plurality of databases and a plurality of exclusive databases, wherein each database corresponds to different font types and is used for storing sample text outlines corresponding to the font types, and each exclusive database is used for storing sample text outlines exclusive to a user side;
the data interaction module is used for acquiring the topic text pictures uploaded to the cloud platform by the user side;
the data processing module comprises an image analysis unit, a first comparison unit, a second comparison unit and a database construction unit which are connected with each other,
the image analysis unit is connected with the data interaction module and is used for extracting text outlines in the topic text pictures, randomly screening the text outlines, determining font types to which the screened text outlines belong, determining the duty ratio of the font types, and judging an optimal database based on the font type with the highest duty ratio;
the first comparison unit is connected with the database module and is used for acquiring a plurality of character outlines extracted by the image analysis unit, comparing each character outline with a proprietary database of the user side and each character outline in the optimal database, and identifying character contents represented by each character outline according to comparison results;
the second comparison unit is connected with the database module and is used for acquiring the text outline of the text content which cannot be identified by the first comparison unit, sorting the similarity of the font data of each database and the optimal database in a descending order, selecting the databases one by one based on the sorting result, comparing the text outline of each text content which cannot be identified with the sample text outline in the selected databases, and identifying the text content represented by each text outline according to the comparison result;
the database construction unit is connected with the database module and the data interaction module, and is used for acquiring the text outlines which cannot be identified by the second comparison unit, sending the text outlines to a user side through the data interaction module, determining text contents represented by the text outlines, and after the user side confirms, storing the text outlines as sample text outlines into a dedicated database of the user side.
Further, a proportion interval [20%,40% ] is arranged in the image analysis unit, and the proportion of the text outline screened out by the image analysis unit during random screening to the total text outline should belong to the proportion interval [20%,40% ].
Further, the image analysis unit calculates the similarity between the screened text outline and the sample text outline in each database, determines the sample text outline with the highest similarity, determines the database to which the sample text outline belongs, and determines the font type corresponding to the database as the font type to which the screened text outline belongs.
Further, the image analysis unit determines the font type to which the filtered text outline belongs, calculates the duty ratio P of each font type according to the formula (1),
in the formula (1), ni represents the number of the screened text outlines belonging to the i-th font type, N0 represents the total amount of the screened text outlines, and i is an integer greater than 0.
Further, the image analysis unit determines the font type with the highest duty ratio, and determines the database corresponding to the font type with the highest duty ratio as the optimal database.
Further, the first comparison unit compares each text outline with each text outline in the exclusive database of the user side and the optimal database, and identifies text content represented by each text outline according to the comparison result,
the first comparison unit compares the character outline with various text character outlines to calculate the coincidence degree of the character outline and the sample character outline, screens out the sample character outline with the highest coincidence degree, and if the highest coincidence degree corresponding to the sample character outline is higher than a preset first coincidence degree comparison threshold value, the first comparison unit recognizes that the character content represented by the character outline is identical with the character content represented by the sample character outline.
Further, the second comparison unit pre-stores the font data similarity E0 of any two databases, the font data similarity E0 is calculated according to the formula (2),
in the formula (2), N represents the number of sample text outlines in the databases, ei represents the similarity between the ith sample text outline in the first database and the ith sample text outline in the second database in the two databases.
Further, the second comparison unit does not compare the text outline with the sample text outline in the rest database after identifying the text content represented by the text outline.
Further, the second comparison unit compares the text outline of each unrecognizable text content with the sample text outline in the selected database, and recognizes the text content represented by each text outline according to the comparison result,
and the second comparison unit compares the character outline with various text character outlines to calculate the coincidence ratio of the character outline and the sample character outline, screens out the sample character outline with the highest coincidence ratio, and if the highest coincidence ratio corresponding to the sample character outline is higher than a preset second coincidence ratio comparison threshold value, the second comparison unit recognizes that the character content represented by the character outline is the same as the character content represented by the sample character outline.
Further, the second overlap ratio comparison threshold is smaller than the first overlap ratio comparison threshold.
Compared with the prior art, the method comprises the steps that a database module, a data interaction module and a data processing module are arranged, the database module comprises a plurality of databases and is used for storing sample text outlines of different font types, the data interaction module receives a question text picture uploaded by a user side, an image analysis unit of the data processing module is used for judging an optimal database based on the font types of partial text outlines in the question text picture, a first comparison unit is used for identifying text contents represented by the text outlines in the question text picture based on the optimal database and a special database of the user side, a second comparison unit is used for replacing the database for identifying the text contents which cannot be identified by the first comparison unit, a database construction unit is used for storing the text outlines in the special database of the user side after confirming the text contents represented by the text outlines which cannot be identified by the first comparison unit and the second comparison unit, the optimal database is used for determining the fonts of the text outlines in the question text picture based on the process, when the text outlines cannot be identified by the optimal database, the database is replaced with the highest data similarity of the optimal database, further, the text outline identification efficiency is improved, and the text outline identification efficiency cannot be further improved when the text outlines cannot be identified by the user side, and the text outline identification rate cannot be further improved.
In particular, the image analysis unit performs random screening on the character outline, determines an optimal database based on the font type to which the screened character outline belongs, limits the proportion of the character outline to the total quantity of the character outline during random screening, characterizes the whole data through the randomly screened data, and simultaneously avoids data operation load caused by excessive screened data.
In particular, before the character outline is identified, the image analysis unit determines the optimal database, in the actual situation, the fonts corresponding to the character outline in the question picture uploaded by each user side have differences, so that the method selects the database corresponding to the font type based on the font type with the highest proportion in part of the character outlines as the optimal database, further reduces the influence of the font differences on character outline identification, and compares the character outlines with the data in the optimal database through the first comparison unit, thereby improving the efficiency and accuracy of text outline identification.
Particularly, the second comparison unit of the invention identifies the text outline replacement databases of which the text content cannot be identified by the first comparison unit, and the databases are replaced based on the similarity of the font data of each database and the current optimal database, and the databases with high similarity with the optimal database are preferably selected as data comparison basis, so that the efficiency and accuracy of text outline identification are improved.
In particular, the first comparison unit and the second comparison unit have different coincidence ratio comparison thresholds, and the second comparison unit recognizes the character outline which cannot be recognized by the first comparison unit, so that the lower coincidence ratio comparison threshold is selected, and the recognition probability of the text outline is improved on the premise of ensuring the reliability.
In particular, the invention also provides a database construction unit which is used for sending the character outlines which cannot be identified by the first comparison unit and the second comparison unit to the user side for confirmation, and storing the character outlines after confirmation into a special database of the user side for subsequent character outline identification, wherein in actual situations, due to writing differences, some character outlines with difficult identification exist, and the identification probability of the character outlines can be effectively improved through the process.
Detailed Description
In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.
Referring to fig. 1 and 2, fig. 1 is a schematic diagram of a topic conversion system based on a big data cloud platform according to the present embodiment, and fig. 2 is a schematic diagram of a data processing module according to the present embodiment, where the topic conversion system based on the big data cloud platform of the present invention includes:
the database module comprises a plurality of databases and a plurality of exclusive databases, wherein each database corresponds to different font types and is used for storing sample text outlines corresponding to the font types, and each exclusive database is used for storing sample text outlines exclusive to a user side;
the data interaction module is used for acquiring the topic text pictures uploaded to the cloud platform by the user side;
the data processing module comprises an image analysis unit, a first comparison unit, a second comparison unit and a database construction unit which are connected with each other,
the image analysis unit is connected with the data interaction module and is used for extracting text outlines in the topic text pictures, randomly screening the text outlines, determining font types to which the screened text outlines belong, determining the duty ratio of the font types, and judging an optimal database based on the font type with the highest duty ratio;
the first comparison unit is connected with the database module and is used for acquiring a plurality of character outlines extracted by the image analysis unit, comparing each character outline with a proprietary database of the user side and each character outline in the optimal database, and identifying character contents represented by each character outline according to comparison results;
the second comparison unit is connected with the database module and is used for acquiring the text outline of the text content which cannot be identified by the first comparison unit, sorting the similarity of the font data of each database and the optimal database in a descending order, selecting the databases one by one based on the sorting result, comparing the text outline of each text content which cannot be identified with the sample text outline in the selected databases, and identifying the text content represented by each text outline according to the comparison result;
the database construction unit is connected with the database module and the data interaction module, and is used for acquiring the text outlines which cannot be identified by the second comparison unit, sending the text outlines to a user side through the data interaction module, determining text contents represented by the text outlines, and after the user side confirms, storing the text outlines as sample text outlines into a dedicated database of the user side.
Specifically, the method for acquiring the text outline by the image analysis unit is not limited, and can be implemented by an existing OCR engine, and the method for acquiring the text outline can be acquired by segmenting an image, which is similar to the method for acquiring the text outline by segmenting in the prior art, and will not be described here again.
Specifically, the specific structure of the data processing module is not limited in this embodiment, and each unit may be configured using a logic unit, and the logic unit may be a field programmable logic unit, a microprocessor, a processor used in a computer, or the like.
Specifically, the specific structure of the database module is not limited in this embodiment, and the database is in a common data storage form, which is a mature prior art and will not be described herein.
Specifically, the specific structure of the data interaction module is not limited in this embodiment, and only needs to be connected with the cloud platform, which is a mature prior art and will not be described herein.
Specifically, the specific algorithm of the similarity is not limited in this embodiment, and algorithms commonly used in the text character recognition field are cosine similarity algorithm and euclidean distance similarity algorithm, and those skilled in the art can select the corresponding similarity algorithm according to specific needs to calculate the similarity between the text outline and the sample text outline, which is not described herein in detail in the prior art.
Specifically, the specific form of the cloud platform is not limited, and only the data uploaded by each user side needs to be received, which is the prior art and is not described herein.
Specifically, a proportion interval [20%,40% ] is arranged in the image analysis unit, and the proportion of the text outline screened out by the image analysis unit during random screening to the total text outline should belong to the proportion interval [20%,40% ] so as to ensure that the screened out data has characterization to the whole data, and meanwhile, avoid that the screened out data is too much to influence the data operation speed.
Specifically, the sample text outline data stored in each database in the database module may be obtained from an open source dictionary database, or may be obtained by constructing sample text outline data of different font types in advance by a person skilled in the art, and the constructed sample text outline of different font types may be obtained by crawling a large data crawler program, or may be obtained by other realizable methods.
Specifically, the image analysis unit calculates the similarity between the screened text outline and the sample text outline in each database, determines the sample text outline with the highest similarity, determines the database to which the sample text outline belongs, and determines the font type corresponding to the database as the font type to which the screened text outline belongs.
Specifically, the image analysis unit determines the font type to which the character outline has been screened, calculates the duty ratio P of each font type according to the formula (1),
in the formula (1), ni represents the number of the screened text outlines belonging to the ith font type, N0 represents the total amount of the screened text outlines, and i is an integer greater than 0;
the image analysis unit determines the font type with the highest duty ratio, and judges the database corresponding to the font type with the highest duty ratio as the optimal database.
Specifically, the image analysis unit determines the optimal database before recognizing the text outline, and in actual conditions, fonts corresponding to the text outline in the question picture uploaded by each user side have differences, so that the method selects the database corresponding to the font type based on the font type with the highest proportion in part of the text outlines as the optimal database, further reduces the influence of the font differences on the text outline recognition, and compares the text outline with data in the optimal database through the first comparison unit, thereby improving the efficiency and accuracy of text outline recognition.
Specifically, the first comparison unit compares each text outline with each text outline in the exclusive database of the user side and the optimal database, and identifies text content represented by each text outline according to the comparison result,
the first comparison unit compares the character outline with various text character outlines to calculate the coincidence degree of the character outline and the sample character outline, screens out the sample character outline with the highest coincidence degree, and if the highest coincidence degree corresponding to the sample character outline is higher than a preset first coincidence degree comparison threshold value, the first comparison unit recognizes that the character content represented by the character outline is identical with the character content represented by the sample character outline.
Specifically, the second comparison unit pre-stores the font data similarity E0 of any two databases, the font data similarity E0 is calculated according to the formula (2),
in the formula (2), N represents the number of sample text outlines in the databases, ei represents the similarity between the ith sample text outline in the first database and the ith sample text outline in the second database in the two databases.
Specifically, the second comparison unit identifies the text outline replacement databases of which the text content cannot be identified by the first comparison unit, and the databases are replaced based on the similarity of the font data of each database and the current optimal database, and the databases with high similarity with the optimal database are preferably selected as data comparison bases, so that the efficiency and the accuracy of text outline identification are improved.
Specifically, the second comparison unit does not compare the text outline with the sample text outline in the rest database after identifying the text content represented by the text outline.
Specifically, the second comparison unit compares the text outline with each text outline to calculate the coincidence ratio of the text outline and the sample text outline, and screens out the sample text outline with the highest coincidence ratio, and if the highest coincidence ratio corresponding to the sample text outline is higher than a preset second coincidence ratio comparison threshold, the second comparison unit identifies that the text content represented by the text outline is the same as the text content represented by the sample text outline.
Specifically, the second contact ratio comparison threshold is smaller than the first contact ratio comparison threshold.
Specifically, when the first and second coincidence level comparison thresholds are determined, a plurality of topic text pictures with the text contour amount of 10000 are selected, the topic text pictures are processed through an image analysis unit, the text contours obtained through the image analysis unit are obtained, the text contours of the samples in the optimal database are compared with each other, so as to calculate the coincidence level of the text contours and the sample text contours, the sample text contour with the highest coincidence level is screened out, the text contour with the highest coincidence level is identified, whether the identification result is accurate or not is determined after the text contour with the highest coincidence level is identified, the text contour with the highest coincidence level is screened out, the highest coincidence level of the text contours in the optimal database is recorded, the random variable probability density function is used as a random variable, a normal distribution curve is calculated, the 95% coincidence level curve is calculated according to the probability density function, the confidence level interval is calculated, the 95% coincidence level of the corresponding confidence interval is calculated as a second coincidence level interval, the maximum coincidence level is calculated, the confidence interval is calculated, the 95% coincidence level is calculated, and the threshold value is calculated.
Specifically, the first comparison unit and the second comparison unit have different coincidence ratio comparison thresholds, and the second comparison unit recognizes the character outline which cannot be recognized by the first comparison unit, so that the lower coincidence ratio comparison threshold is selected, and the recognition probability of the text outline is improved on the premise of ensuring the reliability.
Specifically, the invention also provides a database construction unit which is used for sending the character outlines which cannot be identified by the first comparison unit and the second comparison unit to the user side for confirmation, and storing the character outlines after confirmation into a dedicated database of the user side for subsequent character outline identification, wherein in actual situations, due to writing differences, some character outlines with difficult identification exist, and the identification probability of the character outlines can be effectively improved through the process.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.