Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork7.9k
Add support for handling timedelta#19236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Thanks@theOehrly looks like a great amount of work. I've taken a quick look, and this looks like a good proof-of-concept. Overall I am concerned about feature drift if we have two ways of dealing with dates and times, datetime and timedelta. There is alot of duplicated code, and that usually means a re-architecture could be considered for better maintainability. Finding ticks for datetime and timedelta is the same, at least for the first month, so did you consider simply making timedeltas be datetime (ie dt = epoch + timedelta) under the hood (i.e. date2num would just take timedeltas as well as datetimes), and then changing the formatters, and perhaps the locator defaults? For instance, it looks to me like most of the new locators and formatters could be just specialized versions of the old ones with the |
I had basically tried to keep date and timedelta separate. But yes, it turned out to have a lot of duplicate code which I hadn't quite expected to this extend. In hindsight it's not that much of a surprise really. |
Sure - well keep the old version as it looks like it mostly works! Is there anything api-like we should discuss? I'll try and bring this up at a developer call, probably next week... As a selling point, I'd suggest turning a few of your tests into graphical examples and at least adding them to the start of the PR. It helps folks know exactly what you have done and see why this is a useful addition. |
I don't think I have any api related questions currently but will ask if anything comes up. Keep the old version? How do you mean that. Should I for now just add addtional classes based on the date locators/formatters? So that there are two implementations for now? I will create some graphical examples. That's certainly necessary, |
... I just meant save a branch or make sure its really clear in the commit tree. |
Ok, sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Lets get the date2num settled first. I'm not 100% sure I understand the other points. It could indeed be the case that we need special locators and formatters, but hopefully not ;-)
Thanks...
lib/matplotlib/dates.py Outdated
if np.issubdtype(d.dtype, np.timedelta64): | ||
# numpy converts this to datetime but not the way we want | ||
d = d.astype('timedelta64[us]') | ||
elif not np.issubdtype(d.dtype, np.datetime64): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't think you need this - you end up doing it below:
ifnp.issubdtype(d.dtype,np.timedelta64): | |
# numpy converts this to datetime but not the way we want | |
d=d.astype('timedelta64[us]') | |
elifnotnp.issubdtype(d.dtype,np.datetime64): | |
ifnotnp.issubdtype(d.dtype,np.datetime64): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The problem is the following:
>>> d = np.timedelta64(1, 's')>>> d.astype('datetime64[us]')numpy.datetime64('1970-01-01T00:00:00.000001')
This conversion should not accidentally happen. So np.timedelta64 objects should be kept out of that block that deals with converting the non-native numpy objects. This is probably better though:
ifnp.issubdtype(d.dtype,np.timedelta64): | |
# numpy converts this to datetime but not the way we want | |
d=d.astype('timedelta64[us]') | |
elifnotnp.issubdtype(d.dtype,np.datetime64): | |
ifnot (np.issubdtype(d.dtype,np.datetime64)or | |
np.issubdtype(d.dtype,np.timedelta64)): |
This then requires a minor change when converting timedelta to datetime below.
lib/matplotlib/dates.py Outdated
# numpy converts this to datetime but not the way we want | ||
d = d.astype('timedelta64[us]') | ||
elif not np.issubdtype(d.dtype, np.datetime64): | ||
# handle everything that is not np.timedelta64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
# handle everything that is not np.timedelta64 | |
# handle everything that is not np.datetime64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
# handle everything that is not np.timedelta64 | |
# handle everything that is not np.datetime64/np.timedelta64 |
lib/matplotlib/dates.py Outdated
# try conversion to datetime or timedelta | ||
for _dtype in ('datetime64[us]', 'timedelta64[us]'): | ||
try: | ||
d = d.astype(_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't really understand this. It doesn't fail for native numpy. Are you trying to say it fails as pandas? Is that just a pandas bug? Can you provide an example of a pandas NaT series that fails? Because this works fine with pure numpy:
d=np.arange(0,24,dtype='timedelta64[s]')d[20]=np.timedelta64('NaT')d.astype('timedelta64[us]')
If this is really just a pandas bug, I'd kind of prefer they fix it rather than us making more special cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
All of the following work:
np.array((datetime.datetime(2021, 1, 10),)).astype('datetime64[us]')np.array((datetime.timedelta(days=1),)).astype('timedelta64[us]')np.array((pandas.Timestamp('2017-01-01T12:13'),)).astype('datetime64[us]')np.array((pandas.Timedelta(days=1),)).astype('timedelta64[us]')
And we could doisinstance(d[0], datetime.datetime)
. But we can't do this for the two pandas datatypes as we can't import pandas.#4000 (comment)
Therefore the try... except block. We can't check if the data is Timestamp or Timedelta, so we try it.
This will work for all cases except if the data containspandas.NaT
or is completely unsupported. Only then the else block will be used. There's no way to check if an object ispandas.NaT
without importing pandas to do that apparentlypandas-dev/pandas#9253 (comment)
Your example withnp.timedelta64
should never reach this part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry I just meant can you provideme with an example that fails? I'm not familiar enough w pandas to know exactly what you are trying to do.
Overall I spent quite a bit of effort simplifying this code. If at all possible it would be nice to keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, sorry. I'm not sure I understand what you mean by an example that "fails". Do you mean data that cannot be converted with .astype? Then an example would be:
import pandas as pd# doesn't worktd = (pd.Timedelta(days=1), pd.Timedelta(days=2), pd.NaT)date2num(td)# doesn't work eitherdate = (pandas.Timestamp(year=2020, month=1, day=10), pandas.Timestamp(year=2020, month=1, day=11), pd.NaT)date2num(date)
PS: sorry for the delay, I responded (or thought I did) and managed not to respond at the same time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks!
PS: sorry for the delay, I responded (or thought I did) and managed not to respond at the same time...
Happens to me all the time - so many excellent email responses in the Drafts folder for months.
theOehrlyJan 14, 2021 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm pretty sure that two NaT's won't happen for pandas. I skimmed over the roadmap and a linked proposal for a new type system in pandas. It looks like the plan is to actually only have one "not-a-value" type for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The value of pandas.NaT is actually defined as the minimum value of numpy's int64. This can be used to identify and convert it. I have tried it and it is arguably much cleaner and certainly more concise than the current converter function I proposed.
What does numpy do? If its the same, then that is totally fine. If its different, I guess there is a tiny risk someone will someday have a very large negative timedelta64, but I agree that risk seems small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
As far as I can tell, numpy can/does not convert panda's nat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure but numpy has a timedelta64('NAT')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
At least internally they're doing the same
/* The special not-a-time (NaT) value */#define NPY_DATETIME_NAT NPY_MIN_INT64
I've tried to clean this up a bit. Now there's no trial and error involved anymore to figure out dtypes. The special conversion for pandas.NaT is more concise now. I think that np.iinfo('int64').min should be the same always? So comparing against that is a reliable way to do this in my opinion. |
I have two questions. First, is there a reason why conversion is always done to Second, about the locators. If a minimum and a maximum number of ticks is specified and multiple intervals can give a number of ticks within that range, which interval should be chosen? The one that gives more ticks or the one that gives less ticks? |
All the fussing in
I guess more? I think its always better to err on the side of giving he user too many ticks that they can then weed out by adjustment, rather than too few, and they don't know what ticks are there. But I'm not 100% sure. Some of the number of ticks logic is a bit arbitrary. |
I do have another question. This time about the RRuleLocator, specifically about this part here
This needs to be done a bit differently for timedeltas to allow ticks on interval multiples. But I want to be sure to fully understand why this is done this way specifically for dates. Why is the delta added/subtracted? Why does the range need to be extended at all? Is it just conevenient to use the delta or is there some very specific reason for that? |
I don't particularly know. It looks like it is just to set the clipping? Maybe the rule needs to be set wider than the You might try simplifying this in a different PR (i.e. as a smaller change) and see if it breaks any of the tests. |
I hadn't intended to change this for dates, but to have a subclass for timedeltas that overwrites this (partly at least). But I'm not sure about that yet. Maybe a cleaner way can be found. I guess I will have to play around with this some more. |
I came across two things that are a bit of an issue potentially. Both are cause by converting timedelta to datetime by adding the epoch. First, this "conversion" means that with the default epoch the largest valid negative timedelta is much smaller than the largest valid positive timedelta. It's somewhat obvious but I only thought of this now. Basically, the range of timedelta is constrained to the maximum datetime value which is 9999 years. We set the zero point at 1970 with the default epoch. Therefore the range of valid timedeltas is -1970 years to 8029 years. This is not very intuitive for a user I think. The second problem concerns the locators and conversion to floating point values. The existing locators for microseconds, seconds, minutes and hours work fine with timedeltas. But only as long as the epoch is set to a full day. I thought about having the epoch that is used as an offset not be the epoch that is set by the user. Instead maybe use a fixed offset of "5000-01-01" to fix both these problems. I haven't spend much time trying this yet though. So far, it didn't work well. Any thoughts on these two things? |
I have tried to work around datetime's issues a couple of times, but I don't think its worth the effort. As@efiring will point out, python dates are not particularly reliable missing leap seconds and other esoterica over long timescales. I don't think timedeltas on the scale of hundreds of years are particularly useful thing to support except as best we can via The situation before changing the epoch to 1970, which is quite recent, was that we only allowed positive years, and folks have managed. If this were to be truly improved, I think someone would need to write a numpy-version of rrule that allowed much bigger date range. |
I certainly prefer if I don't need to work around limitations set by python's datetime. I also don't think that supporting ranges of millions of days is very important. But I wanted to know your opinion on this. As for not setting the epoch with minutes or seconds. Do you think this can be fixed by simply adding a comment in the documentation for set_epoch. Or should the timedelta locators just always work, even if somebody were set the internal epoch to something like "2020-05-18T21:15:05.125"? I see no reason why that is necessary and why anyone would do that. |
Sorry I'm not 100% following here - are strange epochs hard to handle somehow? |
It's not about weird epochs but about having no zero point (literally). I'll try to explain it with an example.
The data value is what the user wants to have plotted. The values for hours, minutes,... differ between data value and internal value if the epoch is 'weird'. You can still get ticks every say 15 minutes just fine. Locating on days works fine because they need special treatement and cannot use rrules byranges anyways. Part of the problem is that reusing the date functionalities means that a function is not aware wether it is working with date or timedelta values. The best idea I had so far is to only allow modification of year, month and day of the epoch. Some other ideas I had (none of which I consider good):
|
@theOehrly sorry to let this languish. I was just thinking about time delta today and realized this was still here. It seems you got hung up on strange epochs. Ihope no one is planning to set the epoch to anything other than Jan 1 XXXX T00:00:00.0000. If they do, then they are welcome to whatever breakage they get. If your point is that maybe we shouldn't allow arbitrary epoch values, that is probably true, but it is a little late for that. |
@jklymak Yes, this was just a major unknown here for me. If strange epochs need to be supported, then the whole approach as it currently is, is probably not feasible. Everything is still a bit convoluted, though. It's not immediately obvious that timedelta is just treated like a date internally. |
Well certainly you could go back to the approach of having a whole new |
@jklymak The problem is that I also agree with your original concern that the completely separate approach introduces too much duplicate code. I'm not fully convinced by either version to be honest. But currently I don't really have enough time to do some useful work on this. I don't really understand what you mean by defining what should happen with longer timedeltas. How they are represented? As days, weeks, years? I think days is the largest possible unit that makes sense. Weeks could be used too maybe, but in my opinion that's less intuitive. Everything longer than that doesn't have a fixed length anymore and would therefore be ambiguous. |
Hi@theOehrly just pinging again about this. If you feel swamped, perhaps I would take it over (keeping you as a co-author)? I think weshould add something like this, but I'd have to play with it some as you have. However, its not urgent, so if you plan to take another pass at it, that would be fine as well. |
@jklymak Hi, I actually haven't forgotten about this. I'm currently finishing my bachelor's thesis and should have substantially more time again from early February on. I was thinking about taking another look at this, among other things. We should probably define a bit clearer what we expect this to achieve, what should be supported and what not. And then I can have a fresh look at it. You'd probably prefer this to be split into smaller chunks, too, instead of one large PR. That might be somewhat tricky. But I guess things like support for pandas NaT without importing pandas could be separated out at least. |
Great! Just wanted to make sure it was on your radar. I agree that making datetime.timedelta and numpy.timedelta64 work should be the priority. However I don't think the length here is too onerous, it's just the very similar code that seems to say we need a refactor somewhere. |
Hi@jklymak, I've just started to look into this again. You said this some time ago:
Thinking a bit about it, I suppose that the largest "unit" that we can support for If someone wants If we can only support Did you have any more ideas or think of anything that we should consider here? |
@jklymak, just pinging you again as a quick reminder that this is still here. I think we "just" need to determine what the proper way to implement this is. Most importantly for that, we need to agree on what this should be able to achieve. When we have that figured out, I could also try to do a bit of a concept specification where the constraints and expectations are made a bit more clear if you think that'd be helpful (don't know how I'd do that but we could figure that out too). |
Sorry this has languished. I recently wanted this facility, and wished that we had it.@theOehrly are you happy to review this version, or should we start over? If I were starting over, I may just suggest we convert to float, where the float is in the units of the timedelta64 and then just plot as normal floats. But I can also see the appeal of having days/hours/minutes.... |
theOehrly commentedSep 27, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@jklymak I can review, update or rewrite this PR if you'd like to have this feature implemented in this way. I should be able to get this done sometime in October, if you want. I guess the question is, whether you want days/hours/minutes/... or just the float version that you mentioned. I suppose there are a few things to consider here:
One note about this, from my (limited) experience with people who only use Python + Matplotlib as a tool in the scientific community and industry but are not very software affine, they are often very much unaware that there is a difference between I have already created a module that patches full timedelta support into Matplotlib two years ago (see here:Timple). In case that Matplotlib decides to implement the float-only version, I'll still maintain that module for the foreseeable future. It doesn't have many users on its own (I think), but my main projectFastF1 depends on it and a large part of those users indirectly use Timple (with autolocators and autoformatters) because a large part of the data there is Timedelta values. |
I agree with your summary above. I added this to the weekly Matplotlib meeting.@theOehrly you are definitely invited to attend if you like.https://hackmd.io/l9vkn_T4RSmk147H_ZPPBA My thinking for floats versus full datetime-like support is that it is quite easy to make timedelta64 a datetime64 by adding a starting datetime64. Then the users get all the flexibility of our datetime64 support with no added code.
Matplotlib can't do that automatically, because we don't know when the user started their experiment. The difficulty with this in the past has also been that some timedeltas are indeed nanosecond scale, and because we drop from datetime64 to datetime, we lose the resolution. The advantage of co-ercing float by default is that it is the users getsomething rather than an error. If their data is indeed at the nanoscale, they need never mess with our datetime code. Changing units is also straight forward, if a bit annoying, and is the way users could coerce to float anyhow:
|
@jklymak I should be able to participate in the meeting. And I'll try to give this some more thought until then. |
6dd0aa2
to7be8e76
ComparetheOehrly commentedOct 23, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
For now, I have only rebased onto the current main. Nothing else has been changed so far. Also, here's the answer to one of the questions that came up during the meeting, about how this is handled by Pandas at the moment. It was noted that their implementation for timedelta plotting is very concise. Here's the reason:
|
5b1274e
to767c55d
Compare@jklymak update on the progress of this PR The implementation of all features is done. As far as features are comparable, there is basically feature parity with the datetime support. Tests are all done, from my point of view. Coverage is near 100%. This part is ready for review. If you have time, you can or someone else can already have a look at the code and tests. Documentation is still a bit of a work in progress. I've started to restructure the whole documentation for I haven't added any type hints so far. Type hints through String formats for the The partially custom formatting codes for timedelta are done. They are a subset with extensions of the standard datetime formatting codes. The supported codes are fully documented, with notes where they differ from the datetime standard. Some feedback here is appreciated as well. We discussed that locating ticks on weeks is not desired in most cases, and that we will stick with days as the largest unit. Accordingly, I've removed weekly locating from the auto-locators. It is still available to be manually configured through the manual It was suggested, that the timedelta support is marked as preliminary/experimental part of the API for now. How would you like this to be represented in the documentation? Should I add a |
Uh oh!
There was an error while loading.Please reload this page.
PR Summary
This pull request adds full support for plotting timedelta values similar to the support that is available for plotting datetime values.
It includes converters, conversion functions, formatters and locators.
This allows for plotting
datetime.timedelta
,numpy.timedelta64
and 'pandas.Timedelta'.Locators and formatters provide nice ticks as multiples of default units of time (days, hours, ...)
PR Checklist
pytest
passes).flake8
on changed files to check).flake8-docstrings
and runflake8 --docstring-convention=all
).doc/users/next_whats_new/
(follow instructions in README.rst there).doc/api/next_api_changes/
(follow instructions in README.rst there).Examples
Example 1
Random timeseries data over a timedelta range of 100 days to 102 days
Example 2
The same as example 1 but using
ConciseTimedeltaFormatter
instead ofAutoTimedeltaFormatter
More detail
Conversion of values
Locators
Fixed/AutoTimedeltaLocator similar to locators for datetime.
Formatting
Function for timedelta string formatting with a template string. This is not natively available (contrary to formatting datetime). Therefore I created my own implementation of this. I tried to stay as close to the c standard implementation for datetime as possible with timedelta.
Example: "%d days, %H:%M" --> '12 days, 14:34'
Fixed/Auto/ConciseFormatter similar to datetime formatters
General
General documentation does exist but certainly needs improvements, more examples and some things are still missing.
I have tried to create basic tests for all of the new functionality but I assume some additional tests may be necessary.
Reasons why I think this is an improvement