Source Code

This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.

This approach can potentially improve the accuracy of QA models over source code.

The supported languages for code parsing are:

C (*)
C++ (*)
C# (*)
COBOL
Elixir
Go (*)
Java (*)
JavaScript (requires packageesprima)
Kotlin (*)
Lua (*)
Perl (*)
Python
Ruby (*)
Rust (*)
Scala (*)
TypeScript (*)

Items marked with (*) require the packagestree_sitter andtree_sitter_languages.It is straightforward to add support for additional languages usingtree_sitter,although this currently requires modifying LangChain.

The language used for parsing can be configured, along with the minimum number oflines required to activate the splitting based on syntax.

If a language is not explicitly specified,LanguageParser will infer one fromfilename extensions, if present.

%pip install-qU esprima esprima tree_sitter tree_sitter_languages

import warnings

warnings.filterwarnings("ignore")
from pprintimport pprint

from langchain_community.document_loaders.genericimport GenericLoader
from langchain_community.document_loaders.parsersimport LanguageParser
from langchain_text_splittersimport Language

API Reference:GenericLoader |LanguageParser |Language

loader= GenericLoader.from_filesystem(
"./example_data/source_code",
    glob="*",
    suffixes=[".py",".js"],
    parser=LanguageParser(),
)
docs= loader.load()

len(docs)

for documentin docs:
    pprint(document.metadata)

{'content_type': 'functions_classes',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'simplified_code',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}
{'content_type': 'functions_classes',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}
{'content_type': 'simplified_code',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}

print("\n\n--8<--\n\n".join([document.page_contentfor documentin docs]))

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

--8<--

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

--8<--

# Code for: class MyClass:


# Code for: def main():


if __name__ == "__main__":
    main()

--8<--

class MyClass {
  constructor(name) {
    this.name = name;
  }

  greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");
  const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

main();

The parser can be disabled for small files.

The parameterparser_threshold indicates the minimum number of lines that the source code file must have to be segmented using the parser.

loader= GenericLoader.from_filesystem(
"./example_data/source_code",
    glob="*",
    suffixes=[".py"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),
)
docs= loader.load()

len(docs)

print(docs[0].page_content)

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")


def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()


if __name__ == "__main__":
    main()

Splitting

Additional splitting could be needed for those functions, classes, or scripts that are too big.

loader= GenericLoader.from_filesystem(
"./example_data/source_code",
    glob="*",
    suffixes=[".js"],
    parser=LanguageParser(language=Language.JS),
)
docs= loader.load()

from langchain_text_splittersimport(
    Language,
    RecursiveCharacterTextSplitter,
)

API Reference:Language |RecursiveCharacterTextSplitter

js_splitter= RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

result= js_splitter.split_documents(docs)

len(result)

print("\n\n--8<--\n\n".join([document.page_contentfor documentin result]))

class MyClass {
  constructor(name) {
    this.name = name;

--8<--

}

--8<--

greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");

--8<--

const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

--8<--

main();

Adding Languages using Tree-sitter Template

Expanding language support using the Tree-Sitter template involves a few essential steps:

Creating a New Language File:
- Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).
- Model this file based on the structure and parsing logic of existing language files likecpp.py.
- You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).
Parsing Language Specifics:
- Mimic the structure used in thecpp.py file, adapting it to suit the language you are incorporating.
- The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.
Testing the Language Parser:
- For thorough validation, generate a test file specific to the new language. Createtest_language.py in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).
- Follow the example set bytest_cpp.py to establish fundamental tests for the parsed elements in the new language.
Integration into the Parser and Text Splitter:
- Incorporate your new language within thelanguage_parser.py file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.
- Also, confirm that your language is included intext_splitter.py in class Language for proper parsing.

By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.

Best of luck!

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Source Code

Splitting

Adding Languages using Tree-sitter Template

Related

Movatterモバイル変換

Splitting​

Adding Languages using Tree-sitter Template​

Related​

Splitting

Adding Languages using Tree-sitter Template

Related