- Notifications
You must be signed in to change notification settings - Fork673
Description
Description of the problem, including code/CLI snippet
At least several REST resources are returning duplicate objects. I have noticed this on both projects and users.
This may be the expected behavior of GitLab itself, but perhaps this Python package which handles pagination could also handle deduplication based onid
.
Expected Behavior
I would expect no duplicate objects when using a.list(get_all=True, iterator=True)
even if objects of that type are created while in the middle of all the pages.
Actual Behavior
If callinggl.projects.list(get_all=True, iterator=True)
and a project is created (or the same with users and likely all other object types as well), you'll get a duplicate object.
end-user mitigation and thoughts
It would be nice if end users didn't have to dedupe themselves.
The below code is overkill but has info I was using while trying to understand the problem.
What I have found is that I do get that warning log about an exact match being returned. I have never seen theAssertionError
raised. I also tracked the indices for information. In every instance it's been at indexx99
andx00
(right on a page boundary).
This makes sense as a new project or user is created we've already missed it and everything shifts by one index.
WARNING Duplicate project id 31393 at index 1099 and 1100WARNING Duplicate project id 30028 at index 2099 and 2100WARNING Duplicate project id 22457 at index 7899 and 7900WARNING Duplicate user id 222 at index 10299 and 10300
If deduplication is implemented within python-gitlab itself it wouldn't need to keep track of all object ids, just the previous page's object ids, since this only occurs on page boundaries.
defget_stuff(manager:CRUDMixin,**kwargs):things= []things_by_id= {}obj_type=manager.__class__.__name__.removesuffix("Manager").lower()fori,thinginenumerate(manager.list(iterator=True,**kwargs)):ifthing.idinthings_by_id:existing_idx,existing_thing=things_by_id.get(thing.id)ifexisting_thing==thing:logger.warning("Duplicate %s id %s at index %d and %d",obj_type,thing.id,existing_idx,i)continueelse:p1=Path(tempfile.gettempdir())/f"{obj_type}_{existing_thing.id}_idx_{existing_idx}"p2=Path(tempfile.gettempdir())/f"{obj_type}_{thing.id}_idx_{i}"withp1.open("wt")asf:print(json.dumps(existing_thing.attributes,indent=2,sort_keys=True),file=f)withp2.open("wt")asf:print(json.dumps(thing.attributes,indent=2,sort_keys=True),file=f)raiseAssertionError(f"Duplicate{obj_type} id{thing.id} at index{existing_idx} and{i}; look at{str(p1)} and{str(p2)}" )things_by_id[thing.id]= (len(things),thing)things.append(thing)# TODO: this would be better done w/ rich or somethingiflen(things)%100==0:iflen(things)%200==0:click.secho("...",nl=False,fg="yellow",bold=True)else:click.secho("...",nl=False,fg="green",bold=True)iflen(things)%1000==0:click.secho(f"\n{len(things)}{obj_type}s",fg="blue")click.secho(f"\n{len(things)}{obj_type}s total",fg="blue",bold=True)returnthings
Specifications
- python-gitlab version:
python-gitlab==4.10.0
- API version you are using (v3/v4):
v4
- Gitlab server version (or gitlab.com):
16.11.6-ee