Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Jul 22, 2024. It is now read-only.
/spiralPublic archive

A Python 3 module that provides functions for splitting identifiers found in source code files.

License

NotificationsYou must be signed in to change notification settings

casics/spiral

Repository files navigation

Spiral is a Python module that provides several different functions for splitting identifiers found in source code files. The nameSpiral is a loose acronym based on "SPlitters for IdentifieRs: A Library".

License: GPL v3PythonLatest releaseDOIDOI

Authors:Michael Hucka
Repository:https://github.com/casics/spiral
License: Unless otherwise noted, this content is licensed under theGPLv3 license.

🏁 Recent news and activities

April 2018: Version 1.1.0 fixesa bug that prevented importing Spiral, and another bug that causesetup.py to fail to install dependencies automatically. Additional enhancements include improved command-line help and internal code refactoring.

Table of Contents

☀ Introduction

Spiral is a Python 3 package that implements numerous identifier splitting algorithms.Identifier splitting (also known asidentifier name tokenization) is the task of breaking apart program identifier strings such asgetInt orreadUTF8stream into component tokens: [get,int] and [read,utf8,stream]. The need for splitting identifiers arises in a variety of contexts, including natural language processing (NLP) methods applied to source code analysis and program comprehension. Spiral provides some basic naive splitting algorithms, such as a straightforward camel-case splitter, as well as more elaborate heuristic splitters, such as a new algorithm calledRonin.

♥️ Please cite the Spiral paper and the version you use

Article citations arecritical for academic developers. If you use Spiral and you publish papers about your software, we ask that youplease cite the Spiral paper:

Hucka, M. (2018). Spiral: splitters for identifiers in source code files.Journal of Open Source Software, 3(24), 653,https://doi.org/10.21105/joss.00653

Please also indicate the specific version of Spiral you use, to improve other people's ability to reproduce your results. You can use the Zenodo DOIs we provide for this purpose:

✺ Installation instructions

For basic usage, the following is probably the simplest and most direct way to install Spiral on your computer:

sudo pip3 install git+https://github.com/casics/spiral.git

Alternatively, you can clone this GitHub repository and then runsetup.py:

git clone https://github.com/casics/spiral.gitcd spiralsudo python3 -m pip install.

The above should be all you need to run Spiral out of the box. If you plan on experimenting with alternative parameter values or alternative dictionaries, you will additionally need to install the following:

  • Platypus-Opt, an optimization library (if you want to optimize parameters).Important:This Platypus is not the same as theother package called "Platypus" in PyPI. Make sure to getPlatypus-Opt or install from thePlatypus repo.

  • Data modules fromNLTK, particularlynltk_words andntlk_wordnet from thenltk.corpus module and thenltk.stem module. These cannot be installed automatically bysetup.py—please refer to theNLTK installation instructions for instructions on how to do it.

▶︎ Basic operation

Spiral is extremely easy to use. To use a Spiral splitter in a Python program, simply import a splitter and then call it.

fromspiral.simple_splittersimportpure_camelcase_splitprint(pure_camelcase_split('TestString'))

Some splitters take optional parameters, and the more complex splitters have aninit() function that you can optionally call to set additional parameters or load data. Currently, only Ronin and Samurai have initialization functions, and callinginit() is optional—if you do not call it, Ronin and Samurai will call theirinit() functions automatically.

Here are some examples of using the Ronin splitter algorithm. The following input,

fromspiralimportroninforsin ['mStartCData','nonnegativedecimaltype','getUtf8Octets','GPSmodule','savefileas','nbrOfbugs']:print(ronin.split(s))

produces the following output:

['m','Start','C','Data']['nonnegative','decimal','type']['get','Utf8','Octets']['GPS','module']['save','file','as']['nbr','Of','bugs']

Spiral also includes a command-line program namedspiral in thebin subdirectory; it will split strings provided on the command line or in a file. (Note: Ronin and Samurai load internal data files when they start up. In normal use, called from an application program, the resulting startup delay will only happen once. In the command-line program, the data is reloaded at every invocation, causing a startup delay every time. The delay is not typical for normal Spiral usage.)

By the way, if your goal is to split identifiers obtained during source code mining and you need to filter out gibberish strings of characters before attempting to split them, you may findNostril (the Nonsense String Evaluator) useful.

🎯 Performance of Ronin

Splitting identifiers is deceptively difficult. It is a research-grade problem for which no perfect solution exists. Even in cases where the input consists of identifiers that strictly follow conventions such as camel case, ambiguities can arise. For example, correctly splittingJ2SEProjectTypeProfiler intoJ2SE,Project,Type,Profiler requires the reader to recognizeJ2SE as a unit. The task of splitting identifiers is made more difficult when there are no case transitions or other obvious boundaries in an identifier. And then there is the small problem that humans are often simply inconsistent!

Ronin is an advanced splitter that uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin includes a default table of term frequencies derived from an analysis of over 46,000 randomly selected software projects in GitHub that contained at least one Python source code file. The tokens were extracted using software from Spiral's parent project,CASICS (specifically, theextractor package), and the frequency table constructed using a procedure encoded in the small programcreate_frequency_fileincluded with Spiral. Ronin also has a number of parameters that need to be tuned; this can be done using the optimization programoptimize-roninin the dev/optimization subdirectory. The default parameter values were derived by optimizing performance against two data sets available from other research groups:

  1. TheLoyola University of Delaware Identifier Splitting Oracle (Ludiso)
  2. The INTT data set, extracted from thezip archive of INTT

Spiral includes copies of these data sets in thetests/data subdirectory. The parameters derived primarily by running against the INTT database of 18,772 identifiers and their splits. The following table summarizes the results:

Data setNumber of splits matched by RoninTotal in data setAccuracy
INTT17,28718,77292.09%
Ludiso2,2482,66384.42%

Many of the "failures" against these sets of identifiers are actually not failures, but cases where Ronin gets a more correct answer or where there is a legitimate difference in interpretation. Here are some examples:

IdentifierLudiso resultRonin split
a.ecartaecartaecart
ConvertToAUTF8StringConvertToAUTF8StringConvertToAUTF8String
MAPI_BCCMAPIBCCMAPIBCC
GetWSIsEnabledGetWSIsEnabledGetWSIsEnabled
freadfreadfread
FF_LOSS_COLORSPACEFFLOSSCOLORSPACEFFLOSSCOLORSPACE
m_bFilenameModembFilenameModembFilenameMode

Names likefread could be considered appropriate to leave alone, but thef infread actually stands for "file", and thus IMHO it makes sense to split it—splitting identifiers into meaningful tokens is the purpose of Ronin, after all. Conversely, tokens such asUtf8 should be left as units because they are meaningful. Differences involving some other terms such ascolor space are due to Ronin splitting terms that are typically considered separate words but are treated as one word in Ludiso, or vice versa. (The main entry in Wikipedia for "color space" is two words, while for "filename" it is one word.) This sometimes also arises because Ronin recognizes prefixes and sometimes does not split words because they would not be in normal written English, such asnonblocking instead ofnonblocking. Still other differences between Ronin's output and the expected splits given in Ludiso and INTT are due to inconsistencies in the Ludiso and INTT data sets. For example, sometimes a token within a larger identifier is split in one case but not in another. Finally, some other differences are cases where the Ludiso split seems to favor program-specific names. For example,QWheelEvent is split as [ QWheel,Event ] whereas Ronin will split it as [ Q,Wheel,Event ].

So, Ronin's performance may be better than the pure numbers imply. However, without checking every Ludiso case and manually reinterpreting the splits, we can't say for sure. All that aside, Ronin definitely gets many cases wrong.

⚙️ Other splitters in Spiral

Here is a list of all the splitters implemented in Spiral at this time:

Splitter nameOperation
delimiter_splitsplit only at characters$~_.:/@
digit_splitsplit only at digits
pure_camelcase_splitsplit at forward camel case transitions (lower to upper case)
safe_simple_splitsplit at hard delimiter characters and forward camel case only; won't split strings that don't follow strict camel case
simple_splitsplit at hard delimiter characters and forward camel case, even if a string doesn't follow strict camel case conventions
elementary_splitsplit by hard delimiters, forward camel case, and digits
heuristic_splitsplit by hard delimiters, forward camel case, and digits, but recognize special cases such asutf8,sha256, etc.
Samuraifrequency-based algorithm published in the literature
Roninfrequency-based algorithm based on Samurai but greatly extended (seeprevious section)

The following table illustrates the differences between the simpler splitters.

Input stringpure camelsafe simplesimpleelementaryheuristic
allloweralllowerallloweralllowerallloweralllower
ALLUPPERALLUPPERALLUPPERALLUPPERALLUPPERALLUPPER
a_delimitera_delimiteradelimiteradelimiteradelimiteradelimiter
a.delimitera.delimiteradelimiteradelimiteradelimiteradelimiter
a$delimitera$delimiteradelimiteradelimiteradelimiteradelimiter
a:delimitera:delimiteradelimiteradelimiteradelimiteradelimiter
a_fooBara_fooBarafooBarafooBarafooBarafooBar
fooBarfooBarfooBarfooBarfooBarfooBar
FooBarFooBarFooBarFooBarFooBarFooBar
FoobarFoobarFoobarFoobarFoobarFoobar
fooBARfooBARfooBARfooBARfooBARfooBAR
fooBARbiffooBARbiffooBARbiffooBARbiffooBARbiffooBARbif
fooBARzBiffooBARzBiffooBARzBiffooBARzBiffooBARzBiffooBARzBif
ABCfooABCfooABCfooABCfooABCfooABCfoo
ABCFooABCFooABCFooABCFooABCFooABCFoo
ABCFooBarABCFooBarABCFooBarABCFooBarABCFooBarABCFooBar
ABCfooBarABCfooBarABCfooBarABCfooBarABCfooBarABCfooBar
fooBar2dayfooBar2dayfooBar2dayfooBar2dayfooBar2dayfooBar2day
fooBar2DayfooBar2DayfooBar2DayfooBar2DayfooBar2DayfooBar2Day
fooBAR2dayfooBAR2dayfooBAR2dayfooBAR2dayfooBAR2dayfooBAR2day
fooBAR2DayfooBAR2DayfooBAR2DayfooBAR2DayfooBAR2DayfooBAR2Day
foo3000foo3000foo3000foo3000foo3000foo3000
99foo300099foo300099foo300099foo300099foo300099foo3000
foo2Barfoo2Barfoo2Barfoo2Barfoo2Barfoo2Bar
foo2bar2foo2bar2foo2bar2foo2bar2foo2bar2foo2bar2
Foo2Bar2Foo2Bar2Foo2Bar2Foo2Bar2Foo2Bar2Foo2Bar2
MyInt32MyInt32MyInt32MyInt32MyInt32MyInt32
MyInt42MyInt42MyInt42MyInt42MyInt42MyInt42
MyInt32VarMyInt32VarMyInt32VarMyInt32VarMyInt32VarMyInt32Var
2ndvar2ndvar2ndvar2ndvar2ndvar2ndvar
the2ndvarthe2ndvarthe2ndvarthe2ndvarthe2ndvarthe2ndvar
the2ndVarthe2ndVarthe2ndVarthe2ndVarthe2ndVarthe2ndVar
row10row10row10row10row10row10
utf8utf8utf8utf8utf8utf8
aUTF8varaUTF8varaUTF8varaUTF8varaUTF8varaUTF8var
J2SE4meJ2SE4meJ2SE4meJ2SE4meJ2SE4meJ2SE4me
IPv4addrIPv4addrIPv4addrIPv4addrIPv4addrIPv4addr

Ronin and Samurai are more advanced than any of the simple splitters above. Ronin in particular does everythingheuristic_splitter does, but handles far more difficult cases.

⚠️ Limitations

Ronin is the most advanced splitter in the Spiral package, but it is not perfect by any means. Certain limitations are known:

  • Ronin is tuned for splitting program identifiers, which isnot the same as splitting strings of concatenated words. With its default parameter values, it is not good at splitting strings likedriveourtrucks which are not composed of tokens commonly encountered in source code contexts. (By the way, reducing the value of parameterlow_freq_cutoff inronin.init() to something like10 makes Ronin more likely to split concatenated word strings likedriveourtrucks orsocietynamebank. However, this comes at the expense of worsening scores on the INTT and Ludiso data sets.)

  • Ronin was trained on source code repositories containing Python code. The default frequency table may not be optimal for splitting things that come from non-Python source files, although the test results on Ludiso and INTT show that it is pretty good on non-Python content.

  • Spiral includes a set of predefined special terms in the fileconstants.py so thatheuristic_split and the Ronin splitter can recognize certain character sequences that should be treated as units. Examples includeutf8,ipv4,checkbox,J2SE and others. The use of specialized dictionaries like this is an approach taken in other splitters (including Simon Butler'sINTT) and is the only known way to prevent incorrect splits of special terms, but it is also an obvious limitation: the list is surely incomplete (and can neverbe complete because new terms are invented all the time), and may even cause incorrect splits in some contexts.

  • The current algorithm implemented in Ronin arose organically after a lot of trial-and-error experimentation, and as a consequence, is rather gnarly and difficult to explain. The implementation could be made more efficient and the algorithm clarified by a nice about of refactoring.

📚 More information

The name "Ronin" is a play on the use of the name "Samurai" by Enslen et al. (2009) for their identifier splitting algorithm. The core loop of Ronin is based on Samurai, but it would be inappropriate to call this implementation Samurai too. In an effort to imply the lineage of this modified algorithm, I chose "Ronin" (a name referring to a drifter samurai without a master, during the Japanese feudal period).

I implemented Samurai based on the description of the algorithm published in the following paper, then modified it repeatedly in an attempt to improve performance. Ronin is the result. A goal for Ronin was to produce a splitter that had good performance even without using a local frequency table.

Enslen, E., Hill, E., Pollock, L., & Vijay-Shanker, K. (2009). Mining source code to automatically split identifiers for software analysis. In Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories (MSR'09)
Abstract: Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.

The Ludiso data set is described in the following paper:

Hill, E., Binkley, D., Lawrie, D., Pollock, L., & Vijay-Shanker, K. (2014). An empirical study of identifier splitting techniques. Empirical Software Engineering, 19(6), 1754–1780.
Abstract: Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.

The INTT splitter and data set are described in the following paper:

Butler, S., Wermelinger, M., Yu, Y., & Sharp, H. (2011). Improving the Tokenisation of Identifier Names. In ECOOP 2011 – Object-Oriented Programming (pp. 130–154). Springer, Berlin, Heidelberg.
Abstract: Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available.

⁇ Getting help and support

If you find an issue, please submit it inthe GitHub issue tracker for this repository.

♬ Contributing — info for developers

A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing listcasics-team@googlegroups.com.

Everyone is asked to read and respect thecode of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by theNational Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The installation of Spiral includes a dictionary containing words from theNTLK project, specifically thewords andwordnet dictionaries. Theword dictionary is public domain, but thewordnet dictionary is copyright 2006 by Princeton University and made available underlicense terms that permit free redistribution. For more information, please see Princeton University"About WordNet" (Princeton University, 2010).

The vector artwork of the segmented spiral illusion at the top of this page was obtained fromwww.123freevectors.com.

             

[8]ページ先頭

©2009-2025 Movatter.jp