|
| 1 | +Module pgml.collection |
| 2 | +====================== |
| 3 | + |
| 4 | +Variables |
| 5 | +--------- |
| 6 | + |
| 7 | + |
| 8 | +`log` |
| 9 | +: Collection class to store tables for documents, chunks, models, splitters, and embeddings |
| 10 | + |
| 11 | +Classes |
| 12 | +------- |
| 13 | + |
| 14 | +`Collection(pool: psycopg_pool.pool.ConnectionPool, name: str)` |
| 15 | +: The function initializes an object with a connection pool and a name, and creates several tables |
| 16 | + while registering a text splitter and a model. |
| 17 | + |
| 18 | +:param pool:`pool` is an instance of`ConnectionPool` class which manages a pool of database |
| 19 | + connections |
| 20 | +:type pool: ConnectionPool |
| 21 | +:param name: The`name` parameter is a string that represents the name of an object being |
| 22 | + initialized. It is used as an identifier for the object within the code |
| 23 | +:type name: str |
| 24 | + |
| 25 | +### Methods |
| 26 | + |
| 27 | +`generate_chunks(self, splitter_id: int = 1) ‑> None` |
| 28 | +: This function generates chunks of text from unchunked documents using a specified text splitter. |
| 29 | +
|
| 30 | + :param splitter_id: The ID of the splitter to use for generating chunks, defaults to 1 |
| 31 | + :type splitter_id: int (optional) |
| 32 | + |
| 33 | +`generate_embeddings(self, model_id: Optional[int] = 1, splitter_id: Optional[int] = 1) ‑> None` |
| 34 | +: This function generates embeddings for chunks of text using a specified model and inserts them into |
| 35 | + a database table. |
| 36 | +
|
| 37 | + :param model_id: The ID of the model to use for generating embeddings, defaults to 1 |
| 38 | + :type model_id: Optional[int] (optional) |
| 39 | + :param splitter_id: The `splitter_id` parameter is an optional integer that specifies the ID of the |
| 40 | + data splitter to use for generating embeddings. If not provided, it defaults to 1, defaults to 1 |
| 41 | + :type splitter_id: Optional[int] (optional) |
| 42 | + |
| 43 | +`get_models(self) ‑> List[Dict[str, Any]]` |
| 44 | +: The function retrieves a list of dictionaries containing information about models from a database |
| 45 | + table. |
| 46 | + :return: The function `get_models` is returning a list of dictionaries, where each dictionary |
| 47 | + represents a model and contains the following keys: "id", "task", "name", and "parameters". The |
| 48 | + values associated with these keys correspond to the respective fields in the database table |
| 49 | + specified by `self.models_table`. |
| 50 | + |
| 51 | +`get_text_splitters(self) ‑> List[Dict[str, Any]]` |
| 52 | +: This function retrieves a list of dictionaries containing information about text splitters from a |
| 53 | + database. |
| 54 | + :return: The function `get_text_splitters` is returning a list of dictionaries, where each |
| 55 | + dictionary contains the `id`, `name`, and `parameters` of a text splitter. |
| 56 | + |
| 57 | +`register_model(self, task: Optional[str] = 'embedding', model_name: Optional[str] = 'intfloat/e5-small', model_params: Optional[Dict[str, Any]] = {}) ‑> None` |
| 58 | +: This function registers a model in a database if it does not already exist. |
| 59 | +
|
| 60 | + :param task: The type of task the model is being registered for, with a default value of |
| 61 | + "embedding", defaults to embedding |
| 62 | + :type task: Optional[str] (optional) |
| 63 | + :param model_name: The name of the model being registered, defaults to intfloat/e5-small |
| 64 | + :type model_name: Optional[str] (optional) |
| 65 | + :param model_params: model_params is a dictionary that contains the parameters for the model being |
| 66 | + registered. These parameters can be used to configure the model for a specific task. The dictionary |
| 67 | + can be empty if no parameters are needed |
| 68 | + :type model_params: Optional[Dict[str, Any]] |
| 69 | + :return: the id of the registered model. |
| 70 | + |
| 71 | +`register_text_splitter(self, splitter_name: Optional[str] = 'RecursiveCharacterTextSplitter', splitter_params: Optional[Dict[str, Any]] = {}) ‑> None` |
| 72 | +: This function registers a text splitter with a given name and parameters in a database table if it |
| 73 | + does not already exist. |
| 74 | +
|
| 75 | + :param splitter_name: The name of the text splitter being registered. It is an optional parameter |
| 76 | + and defaults to "RecursiveCharacterTextSplitter" if not provided, defaults to |
| 77 | + RecursiveCharacterTextSplitter |
| 78 | + :type splitter_name: Optional[str] (optional) |
| 79 | + :param splitter_params: splitter_params is a dictionary that contains parameters for a text |
| 80 | + splitter. These parameters can be used to customize the behavior of the text splitter. The function |
| 81 | + takes this dictionary as an optional argument and if it is not provided, an empty dictionary is used |
| 82 | + as the default value |
| 83 | + :type splitter_params: Optional[Dict[str, Any]] |
| 84 | + :return: the id of the splitter that was either found in the database or inserted into the database. |
| 85 | + |
| 86 | +`upsert_documents(self, documents: List[Dict[str, Any]], text_key: Optional[str] = 'text', id_key: Optional[str] = 'id') ‑> None` |
| 87 | +: The function `upsert_documents` inserts or updates documents in a database table based on their ID, |
| 88 | + text, and metadata. |
| 89 | +
|
| 90 | + :param documents: A list of dictionaries, where each dictionary represents a document to be upserted |
| 91 | + into a database table. Each dictionary should contain metadata about the document, as well as the |
| 92 | + actual text of the document |
| 93 | + :type documents: List[Dict[str, Any]] |
| 94 | + :param text_key: The key in the dictionary that corresponds to the text of the document, defaults to |
| 95 | + text |
| 96 | + :type text_key: Optional[str] (optional) |
| 97 | + :param id_key: The `id_key` parameter is an optional string parameter that specifies the key in the |
| 98 | + dictionary of each document that contains the unique identifier for that document. If this key is |
| 99 | + present in the dictionary, its value will be used as the document ID. If it is not present, a hash |
| 100 | + of the document, defaults to id |
| 101 | + :type id_key: Optional[str] (optional) |
| 102 | + :param verbose: A boolean parameter that determines whether or not to print verbose output during |
| 103 | + the upsert process. If set to True, additional information will be printed to the console during the |
| 104 | + upsert process. If set to False, only essential information will be printed, defaults to False |
| 105 | + |
| 106 | +`vector_search(self, query: str, query_parameters: Optional[Dict[str, Any]] = {}, top_k: int = 5, model_id: int = 1, splitter_id: int = 1) ‑> List[Dict[str, Any]]` |
| 107 | +: This function performs a vector search on a database using a query and returns the top matching |
| 108 | + results. |
| 109 | +
|
| 110 | + :param query: The search query string |
| 111 | + :type query: str |
| 112 | + :param query_parameters: Optional dictionary of additional parameters to be used in generating |
| 113 | + the query embeddings. These parameters are specific to the model being used and can be used to |
| 114 | + fine-tune the search results. If no parameters are provided, default values will be used |
| 115 | + :type query_parameters: Optional[Dict[str, Any]] |
| 116 | + :param top_k: The number of search results to return, sorted by relevance score, defaults to 5 |
| 117 | + :type top_k: int (optional) |
| 118 | + :param model_id: The ID of the model to use for generating embeddings, defaults to 1 |
| 119 | + :type model_id: int (optional) |
| 120 | + :param splitter_id: The `splitter_id` parameter is an integer that identifies the specific |
| 121 | + splitter used to split the documents into chunks. It is used to retrieve the embeddings table |
| 122 | + associated with the specified splitter, defaults to 1 |
| 123 | + :type splitter_id: int (optional) |
| 124 | + :return: a list of dictionaries containing search results for a given query. Each dictionary |
| 125 | + contains the following keys: "score", "text", and "metadata". The "score" key contains a float |
| 126 | + value representing the similarity score between the query and the search result. The "text" key |
| 127 | + contains the text of the search result, and the "metadata" key contains any metadata associated |
| 128 | + with the search result |