
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2011-04-08 03:37 bylopgok, last changed2022-04-11 14:57 byadmin. This issue is nowclosed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| filecmp-lru-cache-3.3.diff | nadeem.vawda,2011-04-08 16:54 | Limit the size of filecmp._cache using LRU replacement | review | |
| filecmp-lru-cache-2.7.diff | nadeem.vawda,2011-04-08 17:09 | Corrected version of 2.7 patch | ||
| Messages (21) | |||
|---|---|---|---|
| msg133290 -(view) | Author: jeff deifik (lopgok) | Date: 2011-04-08 03:37 | |
I have a program which calls filecmp.cmp a lot.It runs out of memory.I read the source to filecmp, and then I periodically setfilecmp._cache = {}Without doing this, filecmp's cache uses up all the memory in the computer.There needs to be a documented interface to clear the cache.I suggest a functiondef clear_cache: _cache = {}Without a documented interface, there is no standard way to clear thecache. It is possible different versions of python will requiredifferent methods to clear the cache, which will reduce python codeportability and is a bad idea.Alternatively, one might disable the caching code.One shouldn't have to look at the source code of a library functionto see why it is consuming memory. | |||
| msg133300 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-08 09:16 | |
I've looked at the code for Python 3, and there isn't anything there thatprevents this from happening there, either. So the fix should be appliedto 3.2 and 3.3 as well.An alternative approach would be to limit the size of the cache, so thatthe caller doesn't need to explicitly clear the cache. Something alongthe lines of functools.lru_cache() should do the trick. I don't thinkit'll be possible to use lru_cache() itself, though - it doesn't providea mechanism to invalidate cache entries when they become stale (and inany case, it doesn't exist in 2.7). | |||
| msg133304 -(view) | Author: R. David Murray (r.david.murray)*![]() | Date: 2011-04-08 11:09 | |
Putting in a size limit is reasonable. We did this for fnmatch not that long ago (issue 7846). That was in fact the inspiration for lru_cache. | |||
| msg133326 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-08 16:54 | |
Patch for 3.3 and 3.2 | |||
| msg133327 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-08 16:55 | |
Patch for 2.7. | |||
| msg133328 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-08 17:09 | |
Oops, there was a typo in the 2.7 patch ("import _thread" instead of"import thread"). Corrected patch attached. | |||
| msg133386 -(view) | Author: Éric Araujo (eric.araujo)*![]() | Date: 2011-04-09 13:49 | |
Why use an ordered dict instead of functools.lru_cache? | |||
| msg133394 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-09 14:40 | |
Because the lru_cache decorator doesn't provide any way to invalidatestale cache entries.Perhaps I should factor out the duplicated code into a separate classthat can then also be exposed to users of the stdlib. But that would onlyapply to 3.3, so the uglier fix is still necessary for older versions. | |||
| msg133398 -(view) | Author: Raymond Hettinger (rhettinger)*![]() | Date: 2011-04-09 15:57 | |
I question whether this should be backported. Please discuss with the RM. | |||
| msg133401 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-09 16:46 | |
> I question whether this should be backported. Please discuss with the RM.Will do. Are you referring specifically to 2.7, or to 3.1 and 3.2 as well? | |||
| msg133470 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-04-10 15:33 | |
Georg? Benjamin? Do you think this fix should be backported? | |||
| msg136009 -(view) | Author: Georg Brandl (georg.brandl)*![]() | Date: 2011-05-15 06:37 | |
-1 on backporting. | |||
| msg136029 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-05-15 12:31 | |
OK. I'll try to put together something cleaner just for 3.3, then. | |||
| msg137773 -(view) | Author: Raymond Hettinger (rhettinger)*![]() | Date: 2011-06-06 19:59 | |
Nadeem, I want to review this but won't have a chance to do it right away. Offhand, it seems like we could use the existing functools.lru_cache() for this if the stats were included as part of the key: cache[f1, f2, s1, s2]=outcome.Also, I want to take a fresh look at the cache strategy (saving diffs of two files vs saving file contents individually) and think about whether than makes any sense at all for real world use cases (is there a common need to compare the same file pairs over and over again or is the typical use the comparison of many different file pairs). There may even be a better way to approach the underlying problem using hashes of entire files (md5, sha1, etc). | |||
| msg137774 -(view) | Author: jeff deifik (lopgok) | Date: 2011-06-06 20:17 | |
There are many possible solutions to this problem.Personally, I think mine is the simplest, though it changes the API.However, there have been several suggestions on simple fixes that don't change the API, all of which fix the resource leak.Doing nothing will not fix the resource leak.How about a simple fix right now, using a lru cache, fixing all versions of Python, and perhaps come up with a superior solution at a later date? | |||
| msg137775 -(view) | Author: Raymond Hettinger (rhettinger)*![]() | Date: 2011-06-06 20:28 | |
We will do something. The next release isn't for a while, so there is time to put thought into it rather than making an immediate check-in. | |||
| msg137782 -(view) | Author: Nadeem Vawda (nadeem.vawda)*![]() | Date: 2011-06-07 00:30 | |
> Also, I want to take a fresh look at the cache strategy (saving diffs> of two files vs saving file contents individually) and think about> whether than makes any sense at all for real world use cases> (is there a common need to compare the same file pairs over and over> again or is the typical use the comparison of many different file> pairs). There may even be a better way to approach the underlying> problem using hashes of entire files (md5, sha1, etc).I like that idea. A hash-based approach could speed up the detection ofnon-equal files quite a bit. | |||
| msg139083 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2011-06-25 15:14 | |
New changeset11568c59d9d4 by Raymond Hettinger in branch '2.7':Issue 11802: filecmp cache was growing without bound.http://hg.python.org/cpython/rev/11568c59d9d4 | |||
| msg139085 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2011-06-25 15:21 | |
New changeset2bacaf6a80c4 by Raymond Hettinger in branch '3.2':Issue 11802: filecmp cache was growing without bound.http://hg.python.org/cpython/rev/2bacaf6a80c4New changeset8f4466619e1c by Raymond Hettinger in branch 'default':Issue 11802: filecmp cache was growing without bound.http://hg.python.org/cpython/rev/8f4466619e1c | |||
| msg139086 -(view) | Author: Raymond Hettinger (rhettinger)*![]() | Date: 2011-06-25 15:24 | |
Made a simple fix to keep the cache from growing without bound.Leaving this open for 3.3 as a feature request to implement a more sophisticated strategy using file hashes or somesuch. | |||
| msg139155 -(view) | Author: Raymond Hettinger (rhettinger)*![]() | Date: 2011-06-26 09:07 | |
After more thought, will just close this report. If a new project emerges to improve the design of filecmp, it can be done in a separate tracker entry. | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:15 | admin | set | github: 56011 |
| 2011-06-26 09:07:25 | rhettinger | set | status: open -> closed resolution: later -> fixed messages: +msg139155 |
| 2011-06-25 15:24:50 | rhettinger | set | messages: +msg139086 |
| 2011-06-25 15:21:29 | python-dev | set | messages: +msg139085 |
| 2011-06-25 15:15:00 | python-dev | set | nosy: +python-dev messages: +msg139083 |
| 2011-06-07 00:30:18 | nadeem.vawda | set | messages: +msg137782 |
| 2011-06-06 20:28:43 | rhettinger | set | messages: +msg137775 |
| 2011-06-06 20:17:36 | lopgok | set | messages: +msg137774 |
| 2011-06-06 19:59:26 | rhettinger | set | assignee:nadeem.vawda ->rhettinger resolution: later messages: +msg137773 |
| 2011-05-15 12:31:07 | nadeem.vawda | set | assignee:nadeem.vawda stage: patch review -> needs patch messages: +msg136029 versions: - Python 3.1, Python 2.7, Python 3.2 |
| 2011-05-15 06:37:36 | georg.brandl | set | messages: +msg136009 |
| 2011-04-10 15:33:48 | nadeem.vawda | set | nosy: +georg.brandl,benjamin.peterson messages: +msg133470 |
| 2011-04-09 16:46:20 | nadeem.vawda | set | messages: +msg133401 |
| 2011-04-09 15:57:16 | rhettinger | set | messages: +msg133398 |
| 2011-04-09 14:40:49 | nadeem.vawda | set | messages: +msg133394 |
| 2011-04-09 13:49:57 | eric.araujo | set | messages: +msg133386 versions: + Python 3.1 |
| 2011-04-09 13:49:09 | eric.araujo | set | files: -filecmp-lru-cache-2.7.diff |
| 2011-04-08 17:09:59 | nadeem.vawda | set | files: +filecmp-lru-cache-2.7.diff messages: +msg133328 |
| 2011-04-08 16:56:51 | nadeem.vawda | set | stage: needs patch -> patch review |
| 2011-04-08 16:55:59 | nadeem.vawda | set | files: +filecmp-lru-cache-2.7.diff messages: +msg133327 |
| 2011-04-08 16:54:38 | nadeem.vawda | set | files: +filecmp-lru-cache-3.3.diff keywords: +patch messages: +msg133326 |
| 2011-04-08 15:55:46 | eric.araujo | set | nosy: +rhettinger,eric.araujo |
| 2011-04-08 11:09:17 | r.david.murray | set | nosy: +r.david.murray messages: +msg133304 |
| 2011-04-08 09:16:42 | nadeem.vawda | set | versions: + Python 3.2, Python 3.3 nosy: +nadeem.vawda messages: +msg133300 type: resource usage stage: needs patch |
| 2011-04-08 03:37:23 | lopgok | create | |