huggingface/doc-builderPublic

NotificationsYou must be signed in to change notification settings
Fork38
Star122

The package used to build the documentation of our Hugging Face repos

License

Apache-2.0 license

122 stars 38 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 735 Commits
.github/workflows		.github/workflows
kit		kit
scripts		scripts
src/doc_builder		src/doc_builder
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

doc-builder

This is the package we use to build the documentation of our Hugging Face repos.

doc-builder

Installation

You can install from PyPi with

pip install hf-doc-builder

To install from source, clone this repository then

cd doc-builderpip install -e.

Previewing

To preview the docs, use the following command:

doc-builder preview {package_name} {path_to_docs}

For example:

doc-builder preview datasets~/Desktop/datasets/docs/source/

**preview command only works with existing doc files. When you add a completely new file, you need to update_toctree.yml & restartpreview command (ctrl-c to stop it & calldoc-builder preview ... again).

**preview command does not work with Windows.

Doc building

To build the documentation of a given package, use the following command:

#Add --not_python_module if not building doc for a python libdoc-builder build {package_name} {path_to_docs} --build_dir {build_dir}

For instance, here is how you can build the Datasets documentation (requirespip install datasets[dev]) if you have cloned the repo in~/git/datasets:

doc-builder build datasets~/git/datasets/docs/source --build_dir~/tmp/test-build

This will generate MDX files that you can preview like any Markdown file in your favorite editor. To have a look at the documentation in HTML, you need to install node version 14 or higher. Then you can run (still with the example on Datasets)

doc-builder build datasets~/git/datasets/docs/source --build_dir~/tmp/test-build --html

which will build HTML files in~/tmp/test-build. You can then inspect those files in your browser.

doc-builder can also automatically convert some of the documentation guides or tutorials into notebooks. This requires two steps:

add[[open-in-colab]] in the tutorial for which you want to build a notebook
add--notebook_dir {path_to_notebook_folder} to the build command.

Writing in notebooks

You can write your docs in jupyter notebooks & use doc-builder to: turn jupyter notebooks into mdx files.

In some situations, such as course & tutorials, it makes more sense to write in jupyter notebooks (& use doc-builder converter) rather than writing in mdx files directly.

The process is:

In yourbuild_main_documentation.yml &build_pr_documentation.yml enable the flagconvert_notebooks: true.
After this flag is enabled, doc-builder will convert all .ipynb files inpath_to_docs to mdx files.

Moreover, you can locally convert .ipynb files into mdx files.

doc-builder notebook-to-mdx {path to notebook file or folder containing notebook files}

Templates for GitHub Actions

doc-builder provides templates for GitHub Actions, so you can build your documentation with every pull request, push to some branch etc. To use them in your project, simply create the following three files in the.github/workflows/ directory:

build_main_documentation.yml: responsible for building the docs for themain branch, releases etc.
build_pr_documentation.yml: responsible for building the docs on each PR.
upload_pr_documentation.yml: responsible for uploading the PR artifacts to the Hugging Face Hub.
delete_doc_comment_trigger.yml: responsible for removing the comments from theHuggingFaceDocBuilder bot that provides a URL to the PR docs.

Within each workflow, the main thing to include is a pointer from theuses field to the corresponding workflow indoc-builder. For example, this is what the PR workflow looks like in thedatasets library:

name:Build PR Documentationon:pull_request:concurrency:group:${{ github.workflow }}-${{ github.head_ref || github.run_id }}cancel-in-progress:truejobs:build:uses:huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main# Runs this doc-builder workflowwith:commit_sha:${{ github.event.pull_request.head.sha }}pr_number:${{ github.event.number }}package:datasets# Replace this with your package name

Note the use of special arguments likepr_number andpackage under thewith field. You can find the various options by inspecting each of thedoc-builderworkflow files.

Enabling multilingual documentation

doc-builder can also convert documentation that's been translated from the English source into one or more languages. To enable the conversion, the documentation directories should be structured as follows:

doc_folder├── en│   ├── _toctree.yml│   ├── _redirects.yml│   ...└── es    ├── _toctree.yml    ├── _redirects.yml    ...

Note that each language directory has its own table of contents file_toctree.yml and that all languages are arranged under a singledoc_folder directory - see thecourse repo for an example. You can then build the individual language subsets as follows:

doc-builder build {package_name} {path_to_docs} --build_dir {build_dir} --language {lang_id}

To automatically build the documentation for all languages via the GitHub Actions templates, simply provide thelanguages argument to your workflow, with a space-separated list of the languages you wish to build, e.g.languages: en es.

Redirects

You can optionally provide_redirects.yml for "old links". The yml file should look like:

how_to:getting_startedpackage_reference/classes:package_reference/main_classes# old_local: new_local

Fixing and testing doc-builder

If you are working on a fix or an update of the doc-builder tool itself, you will eventually want to test it in the CI of another repository (transformers, diffusers, courses, etc.). To do so you should set thedoc_builder_revision argument in your workflow file to point to your branch. Here is an example of what it would look like in thetransformers.js project:

jobs:build:uses:huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@my-test-branchwith:repo_owner:xenovacommit_sha:${{ github.sha }}pr_number:${{ github.event.number }}package:transformers.jspath_to_docs:transformers.js/docs/sourcepre_command:cd transformers.js && npm install && npm run docs-apiadditional_args:--not_python_moduledoc_builder_revision:my-test-branch# <- add this line

Once the docs build is complete in your project, you can drop that change.

Writing documentation for Hugging Face libraries

doc-builder expects Markdown so you should write any new documentation in".mdx" files for tutorials, guides, API documentations. For docstrings, we follow theGoogle format with the main difference that you should use Markdown instead of restructured text (hopefully, that will be easier!)

Values that should be put incode should either be surrounded by backticks: `like so`. Note that argument namesand objects like True, None or any strings should usually be put incode.

Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown:

```# first line of code# second line# etc```

We follow thedoctest syntax for the examples to automatically testthe results stay consistent with the library.

Internal link to object

Syntax:

[`XXXClass`] or [`~XXXClass`] // for class[`XXXClass.method`] or [`~XXXClass.method`] // for method

Example:here &here (as used inside docstring).

When mentioning a class, function or method, it is recommended to use the following syntax for internal links so that our toolautomatically adds a link to its documentation: [`XXXClass`] or [`function`]. This requires the class orfunction to be in the main package.

If you want to create a link to some internal class or function, you need toprovide its path. For instance, in the Transformers documentation [`file_utils.ModelOutput`] will create a link to the documentation ofModelOutput. This link will havefile_utils.ModelOutput in the description. To get rid of the path and only keep the name of the object you arelinking to in the description, add a ~: [`~file_utils.ModelOutput`] will generate a link withModelOutput in the description.

The same works for methods, so you can either use [`XXXClass.method`] or [`~XXXClass.method`].

External link to object

Syntax:

[`XXXLibrary.XXXClass`] or [`~XXXLibrary.XXXClass`] // for class[`XXXLibrary.XXXClass.method`] or [`~XXXLibrary.XXXClass.method`] // for method

Example:here linking object fromaccelerate insidetransformers.

Tip

To write a block that you'd like to see highlighted as a note or warning, place your content between the followingmarkers.

Syntax:

>[!TIP]>Here is a tip. Go to this url[website](www.tip.com)>>Second line

<Tip>Write your note here</Tip>

Example:here

For warnings, change the introduction to:

Syntax:

>[!WARNING]

`<Tipwarning={true}>`

Example:here

Framework Content

If your documentation has a block that is framework-dependent (PyTorch vs TensorFlow vs Flax), you can use thefollowing syntax:

Syntax:

<frameworkcontent><pt>PyTorch content goes here</pt><tf>TensorFlow content goes here</tf><flax>Flax content goes here</flax></frameworkcontent>

Example:here

Note: all frameworks are optional (you can write a PyTorch-only block for instance) and the order does not matter.

Options

Show alternatives (let's say code blocks for different version of a library etc.) in a way where a user can select an option and see the selected option content:

Syntax:

<hfoptionsid="some id"><hfoptionid="id for option 1">{YOUR MARKDOWN}</hfoption><hfoptionid="id for option 2">{YOUR MARKDOWN}</hfoption>... however many<hfoption> tags</hfoptions>

Example:here

Note: for multiple<hfoptions> in a same page, you may consider using same id so that when a user selects one option it affects all other hfoptions blocks. If you don't want this behaviour, use different ids.

Anchor link

Anchor links for markdown headings are generated automatically (with the following rule: 1. lowercase, 2. replace space with dash-, 3. strip [^a-z0-9-]):

Syntax:

## My awesome section// the anchor link is: `my-awesome-section`

Example:here

Moreover, there is a way to customize the anchor link.

Syntax:

## My awesome section[[some-section]]// the anchor link is: `some-section`

Example:here

LaTeX

Latex display mode.$$...$$

Syntax:

$$Y = X * \textbf{dequantize}(W); \text{quantize}(W)$$

Example:here

Latex inline mode.\\( ... )\\

Syntax:

\\( Y = X * \textbf{dequantize}(W); \text{quantize}(W) )\\

Example:here

Code Blocks

Code blocks are written using a regular markdown syntax ```. However, there is a special flag you can put in your mdx files to change the wrapping style of the resulting html from overflow/scrollbar to wrap.

Syntax:

<!-- WRAP CODE BLOCKS -->

Example:here

Inference Snippet

TheInferenceSnippet component is used to render an interactive interface for AI model inference. It useshuggingface/huggingface.js under the hood to get the snippets.

Props

Below is a description of the props that can be passed to this component:

pipeline (string, required):
Specifies the type of pipeline to be used for inference. Common values include"text-generation","text-classification", etc.
providersMapping (mapping of {modelId: string, providerModelId: string}, required):
A mapping which keys are provider names and values are objects withmodelId andproviderModelId.Example:{"fireworks-ai": {modelId: "deepseek-ai/DeepSeek-R1", providerModelId: "accounts/fireworks/models/deepseek-r1", novita: {modelId: "deepseek-ai/DeepSeek-V3-0324", providerModelId: "deepseek/deepseek-v3-0324"}}
conversational (boolean, optional):
If set totrue, the component will enable conversational mode, allowing for multi-turn interactions fortext-generation models.

Example Usage

<InferenceSnippetpipeline="text-generation"conversationalprovidersMapping={{"fireworks-ai": {modelId:"deepseek-ai/DeepSeek-R1", providerModelId:"accounts/fireworks/models/deepseek-r1"},novita: {modelId:"deepseek-ai/DeepSeek-V3-0324",providerModelId:"deepseek/deepseek-v3-0324"}  }}/>

<InferenceSnippetpipeline="text-generation"conversationalprovidersMapping={{"fireworks-ai": {modelId:"deepseek-ai/DeepSeek-R1", providerModelId:"accounts/fireworks/models/deepseek-r1"}  }}/>

<InferenceSnippetpipeline="text-to-image"providersMapping={{"black-forest-labs": {modelId:"black-forest-labs/FLUX.1-dev", providerModelId:"flux-dev"},"replicate": {modelId:"black-forest-labs/FLUX.1-dev", providerModelId:"black-forest-labs/flux-dev"},"fal-ai": {modelId:"black-forest-labs/FLUX.1-dev", providerModelId:"fal-ai/flux/dev"},  }}/>

Adding new inference provider

Step 1: get latesthuggingface/huggingface.js by running the command below:

cd kitnpm run update-inference-providers

Step 2: add an icon for the new provider inkit/src/lib/InferenceSnippet/InferenceSnippet.svelte.

Writing API documentation (Python)

Autodoc

To show the full documentation of any object of the python library you are documenting, use the[[autodoc]] marker.

Syntax:

[[autodoc]] SomeObject

Example:here

If the object is a class, this will include every public method of it that is documented. If for some reason you wish for a methodnot to be displayed in the documentation, you can do so by specifying which methods should be in the docs, here is an example:

Syntax:

[[autodoc]] XXXTokenizer    - build_inputs_with_special_tokens    - get_special_tokens_mask    - create_token_type_ids_from_sequences    - save_vocabulary

Example:here

If you just want to add a method that is not documented (for instance magic method like__call__ are not documentedby default) you can put the list of methods to add in a list that containsall:

Syntax:

## XXXTokenizer[[autodoc]] XXXTokenizer    - all    - __call__

Example:here

Code Blocks from file references

You can create a code-block by referencing a file excerpt with<literalinclude> (sphinx-inspired) syntax.There should be json between<literalinclude> open & close tags.

Syntax:

<literalinclude>{"path": "./data/convert_literalinclude_dummy.txt", # relative path"language": "python", # defaults to " (empty str)"start-after": "START python_import",  # defaults to start of file"end-before": "END python_import",  # defaults to end of file"dedent": 7 # defaults to 0}</literalinclude>

Writing source documentation

Description

For a class or function description string, use markdown withall the custom syntax of doc-builder.

Example:here

Arguments

Arguments of a function/class/method should be defined with theArgs: (orArguments: orParameters:) prefix, followed by a line return andan indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and itsdescription:

Syntax:

    Args:        n_layers (`int`): The number of layers of the model.

Example:here

If the description is too long to fit in one line, another indentation is necessary before writing the descriptionafter the argument.

Syntax:

    Args:        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):            Indices of input sequence tokens in the vocabulary.            Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and            [`~PreTrainedTokenizer.__call__`] for details.            [What are input IDs?](../glossary#input-ids)

Example:here

You can check the full example it comes fromhere

Attributes

If a class is similar to that of a dataclass but the parameters do not align to the available attributes of the class, such as in the below example,Attributes instance should be rewritten as**Attributes** in order to have the documentation properly render these. Otherwise it will assume thatAttributes is synonymous toParameters.

Syntax:

  class SomeClass:      """      Docstring-     Attributes:+     **Attributes**:          - **attr_a** (`type_a`) -- Doc a          - **attr_b** (`type_b`) -- Doc b      """      def __init__(self, param_a, param_b):          ...

Parmeter typing and default value

For optional arguments or arguments with defaults we follow the following syntax. Imagine we have a function with thefollowing signature:

def my_function(x: str = None, a: float = 1):

then its documentation should look like this:

Syntax:

    Args:        x (`str`, *optional*):            This argument controls ...        a (`float`, *optional*, defaults to 1):            This argument is used to ...

Example:here

Note that we always omit the "defaults to `None`" when None is the default for any argument. Also note that evenif the first line describing your argument type and its default gets long, you can't break it on several lines. You canhowever write as many lines as you want in the indented description (see the example above withinput_ids).

If your argument has for type a class defined in the package, you can use the syntax we saw earlier to link to itsdocumentation:

    Args:         config ([`BertConfig`]):            Model configuration class with all the parameters of the model.            Initializing with a config file does not load the weights associated with the model, only the            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.

Returns

The return block should be introduced with theReturns: prefix, followed by a line return and an indentation.The first line should be the type of the return, followed by a line return. No need to indent further for the elementsbuilding the return.

Here's an example for a single value return:

Syntax:

    Returns:        `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.

Example:here

Here's an example for tuple return, comprising several objects:

Syntax:

    Returns:        `tuple(torch.FloatTensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs:        - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.FloatTensor` of shape `(1,)` --          Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.        - **prediction_scores** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) --          Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

Example:here

Yields

Similarly,Yields is also supported.

Syntax:

Yields:    `tuple[str, io.BufferedReader]`:        2-tuple (path_within_archive, file_object).        File object is opened in binary mode.

Example:here

Raises

You can also documentRaises.

Syntax:

    Args:         config ([`BertConfig`]):            Model configuration class with all the parameters of the model.            Initializing with a config file does not load the weights associated with the model, only the            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.    Raises:        `pa.ArrowInvalidError`: if the arrow data casting fails        TypeError: if the target type is not supported according, e.g.            - point1            - point2        [`HTTPError`](https://2.python-requests.org/en/master/api/#requests.HTTPError) if credentials are invalid        [`HTTPError`](https://2.python-requests.org/en/master/api/#requests.HTTPError) if connection got lost    Returns:        `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.

Example:here

Directives for Added, Changed, Deprecated

There are directives forAdded,Changed, &Deprecated.Syntax:

    Args:        cache_dir (`str`, *optional*): Directory to cache data.        config_name (`str`, *optional*): Name of the dataset configuration.            It affects the data generated on disk: different configurations will have their own subdirectories and            versions.            If not provided, the default configuration is used (if it exists).            <Added version="2.3.0">            `name` was renamed to `config_name`.            </Added>        name (`str`): Configuration name for the dataset.            <Deprecated version="2.3.0">            Use `config_name` instead.            </Deprecated>

Example:here

Developing svelte locally

We use svelte components for doc UI (Tip component,Docstring component, etc.).

Follow these steps to develop svelte locally:

Create this file if it doesn't already exist:doc-builder/kit/src/routes/_toctree.yml. Contents should be:

- sections:   - local: index    title: Index page  title: Index page

Create this file if it doesn't already exist:doc-builder/kit/src/routes/index.mdx. Contents should be whatever you'd like to test. For example:

<script lang="ts">import Tip from "$lib/Tip.svelte";import Youtube from "$lib/Youtube.svelte";import Docstring from "$lib/Docstring.svelte";import CodeBlock from "$lib/CodeBlock.svelte";import CodeBlockFw from "$lib/CodeBlockFw.svelte";</script><Tip>  [Here](https://myurl.com)</Tip>## Some headingAnd some text [Here](https://myurl.com)Physics is the natural science that studies matter,[a] its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force.[2] Physics is one of the most fundamental scientific disciplines, with its main goal being to understand how the universe behaves.[b][3][4][5] A scientist who specializes in the field of physics is called a physicist.

Install dependencies & run dev mode

cd doc-builder/kitnpm cinpm run dev -- --open

Start developing. See svelte files indoc-builder/kit/src/lib for reference. The flow should be:
1. Create a svelte component indoc-builder/kit/src/lib
2. Import it & test it indoc-builder/kit/src/routes/index.mdx