Movatterモバイル変換

defread_sql(query:str,             billing_project_id:Optional[str] =None,             from_file:bool =False,             reauth:bool =False,             use_bqstorage_api:bool =False) -> pd.DataFrame

Load data from BigQuery using a query. Just a wrapper aroundpandas.read_gbq.

Args

query: Valid SQL Standard Query to basedosdados.
billing_project_id: Project that will be billed. Find your Project IDhere.
from_file: Uses the credentials from file, located in~/.basedosdados/credentials/.
reauth: Re-authorize Google Cloud Project in case you need to changeuser or reset configurations.
use_bqstorage_api: Use the BigQuery Storage API to download queryresults quickly, but at an increased cost.More info.To use this API, first enable it in the Cloud Console:Enable API.You must also have thebigquery.readsessions.create permission onthe project you are billing queries to.

Returns

Query result as a pandas DataFrame.

read_table

python

defread_table(dataset_id:str,               table_id:str,               billing_project_id:Optional[str] =None,               query_project_id:str ="basedosdados",               limit:Optional[int] =None,               from_file:bool =False,               reauth:bool =False,               use_bqstorage_api:bool =False) -> pd.DataFrame

Load data from BigQuery usingdataset_id andtable_id.

Args

dataset_id: Dataset id available in basedosdados. It should always comewithtable_id.
table_id: Table id available inbasedosdados.dataset_id. It shouldalways come withdataset_id.
billing_project_id: Project that will be billed. Find your Project IDhere.
query_project_id: Which project the table lives. You can change this ifyou want to query different projects.
limit: Number of rows to read from table.
from_file: Uses the credentials from file, located in~/.basedosdados/credentials/.
reauth: Re-authorize Google Cloud Project in case you need to changeuser or reset configurations.
use_bqstorage_api: Use the BigQuery Storage API to download queryresults quickly, but at an increased cost.More info.To use this API, first enable it in the Cloud Console:Enable API.You must also have thebigquery.readsessions.create permission onthe project you are billing queries to.

Returns

Query result as a pandas DataFrame.

download

python

defdownload(savepath:Union[str, Path],             query:Optional[str] =None,             dataset_id:Optional[str] =None,             table_id:Optional[str] =None,             billing_project_id:Optional[str] =None,             query_project_id:str ="basedosdados",             limit:Optional[int] =None,             from_file:bool =False,             reauth:bool =False,             compression:str ="GZIP") ->None

Download table or query result from basedosdados BigQuery (or other).

Using aquery:
download('select * from basedosdados.br_suporte.diretorio_municipios limit 10')
Usingdataset_id & table_id:
download(dataset_id='br_suporte', table_id='diretorio_municipios')

You can also add arguments to modify save parameters:

download(dataset_id='br_suporte', table_id='diretorio_municipios', index=False, sep='|')

Args

savepath: File path to save the result. Only supports.csv.
query: Valid SQL Standard Query to basedosdados. If query is available,dataset_id andtable_id are not required.
dataset_id: Dataset id available in basedosdados. It should always comewithtable_id.
table_id: Table id available inbasedosdados.dataset_id. It shouldalways come withdataset_id.
billing_project_id: Project that will be billed. Find your Project IDhere.
query_project_id: Which project the table lives. You can change this ifyou want to query different projects.
limit: Number of rows.
from_file: Uses the credentials from file, located in~/.basedosdados/credentials/.
reauth: Re-authorize Google Cloud Project in case you need to changeuser or reset configurations.
compression: Compression type. OnlyGZIP is available for now.

Raises

Exception: If eithertable_id,dataset_id orquery are empty.

Functions to get metadata from BD's API.

get_datasets

python

defget_datasets(dataset_id:Optional[str] =None,                 dataset_name:Optional[str] =None,                 page:int =1,                 page_size:int =10,                 backend:Optional[Backend] =None) ->list[dict]

Get a list of available datasets, either bydataset_id ordataset_name.

Args

dataset_id: Dataset slug in Google BigQuery (GBQ).
dataset_name: Dataset name in Base dos Dados metadata.
page: Page for pagination.
page_size: Page size for pagination.
backend: Backend instance, injected automatically.

Returns

List of datasets.

get_tables

python

defget_tables(dataset_id:Optional[str] =None,               table_id:Optional[str] =None,               table_name:Optional[str] =None,               page:int =1,               page_size:int =10,               backend:Optional[Backend] =None) ->list[dict]

Get a list of available tables, either bydataset_id,table_id ortable_name.

Args

dataset_id: Dataset slug in Google BigQuery (GBQ).
table_id: Table slug in Google BigQuery (GBQ).
table_name: Table name in Base dos Dados metadata.
page: Page for pagination.
page_size: Page size for pagination.
backend: Backend instance, injected automatically.

Returns

List of tables.

get_columns

python

defget_columns(table_id:Optional[str] =None,                column_id:Optional[str] =None,                columns_name:Optional[str] =None,                page:int =1,                page_size:int =10,                backend:Optional[Backend] =None) ->list[dict]

Get a list of available columns, either bytable_id,column_id orcolumn_name.

Args

table_id: Table slug in Google BigQuery (GBQ).
column_id: Column slug in Google BigQuery (GBQ).
column_name: Column name in Base dos Dados metadata.
page: Page for pagination.
page_size: Page size for pagination.
backend: Backend instance, injected automatically.

Returns

List of columns.

search

python

defsearch(q:Optional[str] =None,           page:int =1,           page_size:int =10,           backend:Optional[Backend] =None) ->list[dict]

Search for datasets, querying all available metadata for the termq.

Args

q: Search term.
page: Page for pagination.
page_size: Page size for pagination.
backend: Backend instance, injected automatically.

Returns

List of datasets and metadata.

Module for manage dataset to the server.

Dataset Objects

python

classDataset(Base)

Manage datasets in BigQuery.

init

python

def__init__(dataset_id:str, **kwargs)

Initializes a new instance of the class with the specified dataset ID.

Args

dataset_id: The identifier of the dataset. Hyphens in the ID will be replaced with underscores.
**kwargs: Additional keyword arguments to be passed to the superclass initializer.

dataset_config

python

@property@lru_cachedefdataset_config() ->dict[str,Any]

Dataset config file.

publicize

python

defpublicize(mode:str ="all", dataset_is_public:bool =True) ->None

Changes IAM configuration to turn BigQuery dataset public.

Args

mode: Which dataset to create [prod|staging|all].
dataset_is_public: Control if prod dataset is public or not. Bydefault, staging datasets likedataset_id_staging are notpublic.

exists

python

defexists(mode:str ="staging") ->bool

Check if dataset exists.

create

python

defcreate(mode:str ="all",           if_exists:str ="raise",           dataset_is_public:bool =True,           location:Optional[str] =None) ->None

Creates BigQuery datasets givendataset_id.

It can create two datasets:

<dataset_id> (mode =prod)
<dataset_id>_staging (mode =staging)

Ifmode isall, it creates both.

Args

mode: Which dataset to create [prod|staging|all].
if_exists: What to do if dataset exists
- raise: Raises Conflict exception
- replace: Drop all tables and replace dataset
- update: Update dataset description
- pass: Do nothing
dataset_is_public: Control if prod dataset is public or not. Bydefault, staging datasets likedataset_id_staging are notpublic.
location: Location of dataset data. List of possible region names:BigQuery locations

Raises

Warning: Dataset already exists and if_exists is set toraise

delete

python

defdelete(mode:str ="all") ->None

Deletes dataset in BigQuery. Toggle mode to choose which dataset todelete.

Args

mode: Which dataset to delete [prod|staging|all]

update

python

defupdate(mode:str ="all", location:Optional[str] =None) ->None

Update dataset description. Toggle mode to choose which dataset toupdate.

Args

mode: Which dataset to update [prod|staging|all]
location: Location of dataset data. List of possible region names:BigQuery locations

Class for manage tables in Storage and BigQuery.

Table Objects

python

classTable(Base)

Manage tables in Google Cloud Storage and BigQuery.

init

python

def__init__(dataset_id:str, table_id:str, **kwargs)

Initializes a new instance of the class with the specified dataset and table identifiers.

Args

dataset_id: The identifier of the dataset. Hyphens will be replaced with underscores.
table_id: The identifier of the table. Hyphens will be replaced with underscores.
**kwargs: Additional keyword arguments to be passed to the superclass initializer.

Attributes

table_id: The sanitized table identifier (hyphens replaced with underscores).
dataset_id: The sanitized dataset identifier (hyphens replaced with underscores).
table_full_name: Dictionary containing fully qualified table names for different environments:
- 'prod': Production BigQuery table name.
- 'staging': Staging BigQuery table name.
- 'all': Deep copy of the table_full_name dictionary.

table_config

python

@property@lru_cache(256)deftable_config() ->dict[str,Any]

Load table config.

table_exists

python

deftable_exists(mode:str) ->bool

Check if table exists in BigQuery.

Args

mode: Which dataset to check [prod|staging].

create

python

defcreate(path:Optional[Union[str, Path]] =None,           source_format:str ="csv",           csv_delimiter:str =",",           csv_skip_leading_rows:int =1,           csv_allow_jagged_rows:bool =False,           if_table_exists:str ="raise",           if_storage_data_exists:str ="raise",           if_dataset_exists:str ="pass",           dataset_is_public:bool =True,           location:Optional[str] =None,           chunk_size:Optional[int] =None,           biglake_table:bool =False,           set_biglake_connection_permissions:bool =True) ->None

Creates a BigQuery table in the staging dataset.

If a path is provided, data is automatically saved in storage,and a datasets folder and BigQuery location are created, in addition tocreating the table and its configuration files.

The new table is located at<dataset_id>_staging.<table_id> inBigQuery.

Data can be found in Storage at<bucket_name>/staging/<dataset_id>/<table_id>/* and is used to buildthe table.

The following data types are supported:

Comma-Delimited CSV
Apache Avro
Apache Parquet

Data can also be partitioned following the Hive partitioning scheme<key1>=<value1>/<key2>=<value2>. For example,year=2012/country=BR. The partition is automatically detected bysearching forpartitions in thetable_config.yaml file.

Args

path: The path to the file to be uploaded to create the table.
source_format: The format of the data source. Only 'csv', 'avro',and 'parquet' are supported. Defaults to 'csv'.
csv_delimiter: The separator for fields in a CSV file. Theseparator can be any ISO-8859-1 single-byte character. Defaultsto ','.
csv_skip_leading_rows: The number of rows at the top of a CSV filethat BigQuery will skip when loading the data. Defaults to 1.
csv_allow_jagged_rows: Indicates if BigQuery should allow extravalues that are not represented in the table schema. Defaults toFalse.
if_table_exists: Determines what to do if the table already exists:
- raise: Raises a Conflict exception
- replace: Replaces the table
- pass: Does nothing
if_storage_data_exists: Determines what to do if the data alreadyexists on your bucket:
- raise: Raises a Conflict exception
- replace: Replaces the table
- pass: Does nothing
if_dataset_exists: Determines what to do if the dataset alreadyexists:
- raise: Raises a Conflict exception
- replace: Replaces the dataset
- pass: Does nothing
dataset_is_public: Controls if the prod dataset is public or not. Bydefault, staging datasets likedataset_id_staging are notpublic.
location: The location of the dataset data. List of possible regionnames:BigQuery locations
chunk_size: The size of a chunk of data whenever iterating (inbytes). This must be a multiple of 256 KB per the APIspecification. If not specified, the chunk_size of the blobitself is used. If that is not specified, a default value of 40MB is used.
biglake_table: Sets this as a BigLake table. BigLake tables allowend-users to query from external data (such as GCS) even if theydon't have access to the source data. IAM is managed like anyother BigQuery native table. SeeBigLake introfor more on BigLake.
set_biglake_connection_permissions: If set toTrue, attempts togrant the BigLake connection service account access to thetable's data in GCS.

update

python

defupdate(mode:str ="prod",           custom_schema:Optional[list[dict[str,str]]] =None) ->None

Updates BigQuery schema and description.

Args

mode: Table of which table to update [prod].
not_found_ok: What to do if table is not found.

publish

python

defpublish(if_exists:str ="raise",            custom_publish_sql:Optional[str] =None,            custom_schema:Optional[list[dict[str,str]]] =None) ->None

Creates BigQuery table at production dataset.

Table should be located at<dataset_id>.<table_id>.

It creates a view that uses the query from<metadata_path>/<dataset_id>/<table_id>/publish.sql.

Make sure that all columns from the query also exist at<metadata_path>/<dataset_id>/<table_id>/table_config.sql, includingthe partitions.

Args

if_exists: What to do if table exists.
- raise: Raises Conflict exception
- replace: Replace table
- pass: Do nothing

delete

python

defdelete(mode:str ="all") ->None

Deletes table in BigQuery.

Args

mode: Table of which table to delete [prod|staging].

append

python

defappend(filepath:Union[str, Path],           partitions:Optional[Union[str,dict[str,str]]] =None,           if_exists:str ="replace",           chunk_size:Optional[int] =None,           **upload_args) ->None

Appends new data to existing BigQuery table.

As long as the data has the same schema. It appends the data in thefilepath to the existing table.

Args

filepath: Where to find the file that you want to upload to create atable with.
partitions: Hive structured partition as a string or dict.
- str:<key>=<value>/<key2>=<value2>
- dict:dict(key=value, key2=value2)
if_exists: What to do if data with same name exists in storage.
- raise: Raises Conflict exception
- replace: Replace table
- pass: Do nothing
chunk_size: The size of a chunk of data whenever iterating (inbytes). This must be a multiple of 256 KB per the APIspecification. If not specified, the chunk_size of the blobitself is used. If that is not specified, a default value of 40MB is used.

Class for managing the files in Google Cloud Storage.

Storage Objects

python

classStorage(Base)

Manage files on Google Cloud Storage.

init

python

def__init__(dataset_id:str, table_id:str, **kwargs)

Initializes the storage upload class with the specified dataset and table identifiers.

Args

dataset_id: The identifier of the dataset. Hyphens will be replaced with underscores.
table_id: The identifier of the table. Hyphens will be replaced with underscores.
**kwargs: Additional keyword arguments to pass to the superclass initializer.

Attributes

bucket: The storage bucket object used for staging.
dataset_id: The normalized dataset identifier.
table_id: The normalized table identifier.

init

python

definit(replace:bool =False, very_sure:bool =False) ->None

Initialize bucket and folders.

Folder should be:

raw : contains raw data
staging : preprocessed data ready to upload to BigQuery

Args

replace: Whether to replace if bucket already exists.
very_sure: Are you aware that everything will be erased if youreplace the bucket?

Raises

Warning:very_sure argument is still False.

upload

python

defupload(path:Union[str, Path],           mode:str ="all",           partitions:Optional[Union[str,dict[str,str]]] =None,           if_exists:str ="raise",           chunk_size:Optional[int] =None,           **upload_args) ->None

Upload to storage at<bucket_name>/<mode>/<dataset_id>/<table_id>.

You can:

Add a single file:path = <file_path>
Add a folder:path = <folder_path>. The folder should only containfiles, not folders.
Add partitioned files:path = <folder_path>. This folder must followthe hive partitioning scheme, e.g.<table_id>/<key>=<value>/<key2>=<value2>/<partition>.csv

Remember all files must follow a single schema. Otherwise, thingsmight fail in the future.

Modes:

raw : raw files from datasource
staging : pre-treated files ready to upload to BigQuery
header: header of the tables
auxiliary_files: auxiliary files from each table
architecture: architecture sheet of the tables
all: if no treatment is needed, useall.

Args

path: Where to find the file or folder to upload to storage.
mode: Folder of which dataset to update[raw|staging|header|auxiliary_files|architecture|all]
partitions: If adding a single file, use this to add it to aspecific partition. Can be a string or dict.
if_exists: What to do if data exists.
- raise: Raises Conflict exception
- replace: Replace table
- pass: Do nothing
chunk_size: The size of a chunk of data when iterating (in bytes).
upload_args: Extra arguments accepted bygoogle.cloud.storage.blob.Blob.upload_from_file

download

python

defdownload(filename:str ="*",             savepath:Union[Path,str] = Path("."),             partitions:Optional[Union[str,dict[str,str]]] =None,             mode:str ="staging",             if_not_exists:str ="raise") ->None

Download files from Google Storage from pathmode/dataset_id/table_id/partitions/filename and replicate folderhierarchy on save.

Modes:

raw : raw files from datasource
staging : pre-treated files ready to upload to BigQuery
header: header of the tables
auxiliary_files: auxiliary files from each table
architecture: architecture sheet of the tables

You can use thepartitions argument to choose files from a partition.

Args

filename: Specify which file to download. If"*", downloads allfiles within the bucket folder. Defaults to"*".
savepath: Where to save the data on your computer. Must be a path toa directory.
partitions: If downloading a single file, use this to specify thepartition path from which to download. Can be a string<key>=<value>/<key2>=<value2> or dictdict(key=value, key2=value2).
mode: Folder of which dataset to update.[raw|staging|header|auxiliary_files|architecture]
if_not_exists: What to do if data not found.
- raise: Raises FileNotFoundError.
- pass: Do nothing and exit the function.

Raises

FileNotFoundError: If the given path<mode>/<dataset_id>/<table_id>/<partitions>/<filename> could not be found or there areno files to download.

delete_file

python

defdelete_file(filename:str,                mode:str,                partitions:Optional[Union[str,dict[str,str]]] =None,                not_found_ok:bool =False) ->None

Delete file from path<bucket_name>/<mode>/<dataset_id>/<table_id>/<partitions>/<filename>.

Args

filename: Name of the file to be deleted.
mode: Folder of which dataset to update[raw|staging|header|auxiliary_files|architecture|all]
partitions: Hive structured partition as a string<key>=<value>/<key2>=<value2> or dictdict(key=value, key2=value2).
not_found_ok: What to do if file not found.

delete_table

python

defdelete_table(mode:str ="staging",                 bucket_name:Optional[str] =None,                 not_found_ok:bool =False) ->None

Delete a table from storage, sends request in batches.

Args

mode: Folder of which dataset to update[raw|staging|header|auxiliary_files|architecture]
bucket_name: The bucket name from which to delete the table. IfNone, defaults to the bucket initialized when instantiating theStorage object.
not_found_ok: What to do if table not found.

Raises

FileNotFoundError: If the requested table could not be found.

copy_table

python

defcopy_table(source_bucket_name:str ="basedosdados",               destination_bucket_name:Optional[str] =None,               mode:str ="staging",               new_table_id:Optional[str] =None) ->None

Copy table from a source bucket to your bucket, sends request inbatches.

Args

source_bucket_name: The bucket name from which to copy data. You canchange it to copy from another external bucket.
destination_bucket_name: The bucket name where data will be copiedto. If None, defaults to the bucket initialized wheninstantiating the Storage object. You can check it with theStorage().bucket property.
mode: Folder of which dataset to update[raw|staging|header|auxiliary_files|architecture]
new_table_id: New table id to be copied to. If None, defaults to thetable id initialized when instantiating the Storage object.

Base dos Dados

PRODUTOS

Mecanismo de busca Datalake público BD Pro BD Edu

SERVIÇOS

Diagnóstico de Maturidade Arquitetura de Dados Portal de Dados Painel Gerencial Chatbot Formação

RECURSOS

Documentação Vídeos Newsletter Blog Estudos de caso Perguntas frequentes

INSTITUCIONAL

Quem somos Transparência Carreiras Termos e Privacidade Contato Apoie o projeto

® 2025 Base dos Dados

Termos de uso Política de privacidade