- Notifications
You must be signed in to change notification settings - Fork63
feat: Refactor DataFrame.to_pandas_batches and update TableWidget#2250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered byReviewNB |
441bb94 to33fd6e3Comparebigframes/display/html.py Outdated
| def_calculate_rowspans(dataframe:pd.DataFrame)->list[list[int]]: | ||
| """Calculates the rowspan for each cell in a MultiIndex DataFrame. | ||
| Args: | ||
| dataframe (pd.DataFrame): | ||
| The DataFrame for which to calculate index rowspans. | ||
| Returns: | ||
| list[list[int]]: | ||
| A list of lists, where each inner list corresponds to an index level | ||
| and contains the rowspan for each row at that level. A value of 0 | ||
| indicates that the cell should not be rendered (it's covered by a | ||
| previous rowspan). | ||
| """ | ||
| ifnotisinstance(dataframe.index,pd.MultiIndex): | ||
| # If not a MultiIndex, no rowspans are needed for the index itself. | ||
| # Return a structure that indicates each index cell should be rendered once. | ||
| return [[1]*len(dataframe.index)]ifdataframe.index.nlevels>0else [] | ||
| rowspans:list[list[int]]= [] | ||
| forlevel_idxinrange(dataframe.index.nlevels): | ||
| current_level_spans:list[int]= [] | ||
| current_value=None | ||
| current_span=0 | ||
| foriinrange(len(dataframe.index)): | ||
| value=dataframe.index.get_level_values(level_idx)[i] | ||
| ifvalue==current_value: | ||
| current_span+=1 | ||
| current_level_spans.append(0)# Mark as covered by previous rowspan | ||
| else: | ||
| # If new value, finalize previous span and start a new one | ||
| ifcurrent_span>0: | ||
| # Update the rowspan for the start of the previous span | ||
| current_level_spans[i-current_span]=current_span | ||
| current_value=value | ||
| current_span=1 | ||
| current_level_spans.append(0)# Placeholder, will be updated later | ||
| # Finalize the last span | ||
| ifcurrent_span>0: | ||
| current_level_spans[len(dataframe.index)-current_span]=current_span | ||
| rowspans.append(current_level_spans) | ||
| returnrowspans | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
While this makes for a somewhat cleaner view, I think we'd be better off not implementing rowpans. For example sorting by the various index columns is less intuitive if the index isn't rendered the same as other columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A new implementation is used without rowspans.
bigframes/display/html.py Outdated
| table_html.append(' <tr style="text-align: left;">') | ||
| # Add index headers | ||
| fornameindataframe.index.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let's try to actually refer to the original BigFrames DataFrame when deciding which index columns to render or not. In particular, we should not render the index if the BigFrame DataFrame has a NULL index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
My new version correctly refers to the original BigFrames DataFrame's _block.has_index property. If has_index is False (indicating a "NULL index"), the _prepare_dataframe_for_display method ensures that a default "Row" column is inserted for display instead of attempting to render a non-existent index. My recent refactoring preserved this established logic.
bigframes/display/anywidget.py Outdated
| orderable_columns=traitlets.List(traitlets.Unicode(), []).tag(sync=True) | ||
| _initial_load_complete=traitlets.Bool(False).tag(sync=True) | ||
| _batches:Optional[blocks.PandasBatches]=None | ||
| _batches:blocks.PandasBatches|None=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This syntax was added in Python 3.10.https://peps.python.org/pep-0604/
Currently, we are supporting Python 3.9
python-bigquery-dataframes/setup.py
Line 155 in3c54c68
| python_requires=">=3.9", |
Please wait to use this syntax until we drop Python 3.9. I have filed b/470438395 to track this.
bigframes/display/anywidget.py Outdated
| ifhasattr(self,"_setting_html"): | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd prefer we initialize this to false ininit instead of relying on hasattr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Also, consider multi threading. We probably need a lock to fully prevent two table htmls being set at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Here is what I change to: (1) self._setting_html = False and self._setting_html_lock = threading.Lock() were added to the TableWidget.init method
(2) The _set_table_html method was refactored to use with self._setting_html_lock: and the self._setting_html flag to prevent race conditions during HTML rendering. The flag is reset in the finally block to ensure it's always released.
| # TODO(b/463715504): Support sorting by index columns. | ||
| df_to_display=df_to_display.sort_values( | ||
| by=self.sort_column,ascending=self.sort_ascending | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would prefer a context manager instead of a try/except/finally block. That way we can be more explicit about what the intention is. Are you sure you want to competely skip setting the html if another thread tries to set it? If not, just do a with block with a lock here. If so, please explain why it's safe to skip.
Aside: I would actually expect the opposite here, where later set calls would interrupt any set call that is currently running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A threading.Lock was used as a context manager (with self._setting_html_lock:) within _set_table_html. The logic now explicitly uses the _setting_html flag within this locked context to control whether HTML rendering proceeds or is skipped, ensuring thread safety and clear intent.
| def_get_page_data(self)->pd.DataFrame: | ||
| """Get the data for the current page, handling unknown row count.""" | ||
| # This loop is to handle auto-correction of page number when row count is unknown | ||
| whileTrue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This looks like it's intended for retries, but this is a very unsafe way to do that, if is the intention.
Also, I don't understand how this change relates to adding columns for mutiindex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No, it is not related to multiindex display. I mix so many features into the same pr. I have refactor them out. And move this one to a separate new branch.
a7783d7 to15993a3Compare
Uh oh!
There was an error while loading.Please reload this page.
This branch introduces significant enhancements to the interactive table widget, particularly for displaying complex data structures like nested STRUCTs and ARRAYs, which implicitly improves the visual handling of multi-index DataFrames.
html.py)anywidget.py,table_widget.js,table_widget.css)TODO(b/469861913): Nested columns from structs (e.g., 'struct_col.name') are not currently sortable
See an example here: screen/48TZkxRbmuWNYBB
Fixes #<459515995> 🦕