pandas.read_html #

pandas.read_html(io,*,match='.+',flavor=None,header=None,index_col=None,skiprows=None,attrs=None,parse_dates=False,thousands=',',encoding=None,decimal='.',converters=None,na_values=None,keep_default_na=True,displayed_only=True,extract_links=None,dtype_backend=<no_default>,storage_options=None)[source]#

Read HTML tables into alist ofDataFrame objects.

Parameters:

iostr, path object, or file-like object

String, path object (implementingos.PathLike[str]), or file-likeobject implementing a stringread() function.The string can represent a URL or the HTML itself. Note thatlxml only accepts the http, ftp and file url protocols. If you have aURL that starts with'https' you might try removing the's'.

Deprecated since version 2.1.0:Passing html literal strings is deprecated.Wrap literal string/bytes input inio.StringIO/io.BytesIO instead.

matchstr or compiled regular expression, optional

The set of tables containing text matching this regex or string will bereturned. Unless the HTML is extremely simple you will probably need topass a non-empty string here. Defaults to ‘.+’ (match any non-emptystring). The default value will return all tables contained on a page.This value is converted to a regular expression so that there isconsistent behavior between Beautiful Soup and lxml.

flavor{“lxml”, “html5lib”, “bs4”} or list-like, optional

The parsing engine (or list of parsing engines) to use. ‘bs4’ and‘html5lib’ are synonymous with each other, they are both there forbackwards compatibility. The default ofNone tries to uselxmlto parse and if that fails it falls back onbs4 +html5lib.

headerint or list-like, optional

The row (or list of rows for aMultiIndex) to use tomake the columns headers.

index_colint or list-like, optional

The column (or list of columns) to use to create the index.

skiprowsint, list-like or slice, optional

Number of rows to skip after parsing the column integer. 0-based. If asequence of integers or a slice is given, will skip the rows indexed bythat sequence. Note that a single element sequence means ‘skip the nthrow’ whereas an integer means ‘skip n rows’.

attrsdict, optional

This is a dictionary of attributes that you can pass to use to identifythe table in the HTML. These are not checked for validity before beingpassed to lxml or Beautiful Soup. However, these attributes must bevalid HTML table attributes to work correctly. For example,

attrs={'id':'table'}

is a valid attribute dictionary because the ‘id’ HTML tag attribute isa valid HTML attribute forany HTML tag as perthis document.

attrs={'asdf':'table'}

isnot a valid attribute dictionary because ‘asdf’ is not a validHTML attribute even if it is a valid XML attribute. Valid HTML 4.01table attributes can be foundhere. Aworking draft of the HTML 5 spec can be foundhere. It contains thelatest information on table attributes for the modern web.

parse_datesbool, optional

Seeread_csv() for more details.

thousandsstr, optional

Separator to use to parse thousands. Defaults to','.

encodingstr, optional

The encoding used to decode the web page. Defaults toNone.``None``preserves the previous encoding behavior, which depends on theunderlying parser library (e.g., the parser library will try to usethe encoding provided by the document).

decimalstr, default ‘.’

Character to recognize as decimal point (e.g. use ‘,’ for Europeandata).

convertersdict, default None

Dict of functions for converting values in certain columns. Keys caneither be integers or column labels, values are functions that take oneinput argument, the cell (not column) content, and return thetransformed content.

na_valuesiterable, default None

Custom NA values.

keep_default_nabool, default True

If na_values are specified and keep_default_na is False the default NaNvalues are overridden, otherwise they’re appended to.

displayed_onlybool, default True

Whether elements with “display: none” should be parsed.

extract_links{None, “all”, “header”, “body”, “footer”}

Table elements in the specified section(s) with <a> tags will have theirhref extracted.

Added in version 1.5.0.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’

Back-end data type applied to the resultantDataFrame(still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backedDataFrame(default).
"pyarrow": returns pyarrow-backed nullableArrowDtypeDataFrame.

Added in version 2.0.

storage_optionsdict, optional

Extra options that make sense for a particular storage connection, e.g.host, port, username, password, etc. For HTTP(S) URLs the key-value pairsare forwarded tourllib.request.Request as header options. For otherURLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs areforwarded tofsspec.open. Please seefsspec andurllib for moredetails, and for more examples on storage options referhere.

Added in version 2.1.0.

Returns:

dfs: A list of DataFrames.

Movatterモバイル変換

pandas.read_html#

pandas.read_html #