Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commita45c78e

Browse files
committed
Rearrange pg_dump's handling of large objects for better efficiency.
Commitc0d5be5 caused pg_dump to create a separate BLOB metadata TOCentry for each large object (blob), but it did not touch the ancientdecision to put all the blobs' data into a single "BLOBS" TOC entry.This is bad for a few reasons: for databases with millions of blobs,the TOC becomes unreasonably large, causing performance issues;selective restore of just some blobs is quite impossible; and wecannot parallelize either dump or restore of the blob data, since ourarchitecture for that relies on farming out whole TOC entries toworker processes.To improve matters, let's group multiple blobs into each blob metadataTOC entry, and then make corresponding per-group blob data TOC entries.Selective restore using pg_restore's -l/-L switches is then possible,though only at the group level. (Perhaps we should provide a switchto allow forcing one-blob-per-group for users who need preciseselective restore and don't have huge numbers of blobs. This patchdoesn't do that, instead just hard-wiring the maximum number of blobsper entry at 1000.)The blobs in a group must all have the same owner, since the TOC entryformat only allows one owner to be named. In this implementationwe also require them to all share the same ACL (grants); the archiveformat wouldn't require that, but pg_dump's representation ofDumpableObjects does. It seems unlikely that either restrictionwill be problematic for databases with huge numbers of blobs.The metadata TOC entries now have a "desc" string of "BLOB METADATA",and their "defn" string is just a newline-separated list of blob OIDs.The restore code has to generate creation commands, ALTER OWNERcommands, and drop commands (for --clean mode) from that. We wouldneed special-case code for ALTER OWNER and drop in any case, so thealternative of keeping the "defn" as directly executable SQL codefor creation wouldn't buy much, and it seems like it'd bloat thearchive to little purpose.Since we require the blobs of a metadata group to share the same ACL,we can furthermore store only one copy of that ACL, and then makepg_restore regenerate the appropriate commands for each blob. Thissaves space in the dump file not only by removing duplicative SQLcommand strings, but by not needing a separate TOC entry for eachblob's ACL. In turn, that reduces client-side memory requirements forhandling many blobs.ACL TOC entries that need this special processing are labeled as"ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL,continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actuallyhave to make such a distinction, but it saves a few cycles duringrestore for the easy case, and it seems like a good idea to not changethe TOC contents unnecessarily.The data TOC entries ("BLOBS") are exactly the same as before,except that now there can be more than one, so we'd better give themidentifying tag strings.Also, commitc0d5be5 put the new BLOB metadata TOC entries intoSECTION_PRE_DATA, which perhaps is defensible in some ways, butit's a rather odd choice considering that we go out of our way totreat blobs as data. Moreover, because parallel restore handlesthe PRE_DATA section serially, this means we'd only get part of theparallelism speedup we could hope for. Move these entries intoSECTION_DATA, letting us parallelize the lo_create calls not just thedata loading when there are many blobs. Add dependencies to ensurethat we won't try to load data for a blob we've not yet created.As this stands, we still generate a separate TOC entry for any commentor security label attached to a blob. I feel comfortable in believingthat comments and security labels on blobs are rare, so this patchshould be enough to get most of the useful TOC compression for blobs.We have to bump the archive file format version number, since existingversions of pg_restore wouldn't know they need to do something specialfor BLOB METADATA, plus they aren't going to work correctly withmultiple BLOBS entries or multiple-large-object ACL entries.The directory and tar-file format handlers need some workfor multiple BLOBS entries: they used to hard-wire the file nameas "blobs.toc", which is replaced here with "blobs_<dumpid>.toc".The 002_pg_dump.pl test script also knows about that and requiresminor updates. (I had to drop the test for manually-compressedblobs.toc files with LZ4, because lz4's obtuse command linedesign requires explicit specification of the output file namewhich seems impractical here. I don't think we're losing anyuseful test coverage thereby; that test stanza seems completelyduplicative with the gzip and zstd cases anyway.)In passing, centralize management of the lo_buf used to hold datawhile restoring blobs. The code previously had each format handlercreate lo_buf, which seems rather pointless given that the formathandlers all make it the same way. Moreover, the format handlersnever use lo_buf directly, making this setup a failure from aseparation-of-concerns standpoint. Let's move the responsibility intopg_backup_archiver.c, which is the only module concerned with lo_buf.The reason to do this in this patch is that it allows a centralizedfix for the now-false assumption that we never restore blobs inparallel. Also, get rid of dead code in DropLOIfExists: it's been along time since we had any need to be able to restore to a pre-9.0server.Discussion:https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com
1 parent5eac8ce commita45c78e

File tree

11 files changed

+530
-261
lines changed

11 files changed

+530
-261
lines changed

‎src/bin/pg_dump/common.c

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ static DumpId lastDumpId = 0;/* Note: 0 is InvalidDumpId */
4747
* expects that it can move them around when resizing the table. So we
4848
* cannot make the DumpableObjects be elements of the hash table directly;
4949
* instead, the hash table elements contain pointers to DumpableObjects.
50+
* This does have the advantage of letting us map multiple CatalogIds
51+
* to one DumpableObject, which is useful for blobs.
5052
*
5153
* It turns out to be convenient to also use this data structure to map
5254
* CatalogIds to owning extensions, if any. Since extension membership
@@ -700,6 +702,30 @@ AssignDumpId(DumpableObject *dobj)
700702
}
701703
}
702704

705+
/*
706+
* recordAdditionalCatalogID
707+
* Record an additional catalog ID for the given DumpableObject
708+
*/
709+
void
710+
recordAdditionalCatalogID(CatalogIdcatId,DumpableObject*dobj)
711+
{
712+
CatalogIdMapEntry*entry;
713+
boolfound;
714+
715+
/* CatalogId hash table must exist, if we have a DumpableObject */
716+
Assert(catalogIdHash!=NULL);
717+
718+
/* Add reference to CatalogId hash */
719+
entry=catalogid_insert(catalogIdHash,catId,&found);
720+
if (!found)
721+
{
722+
entry->dobj=NULL;
723+
entry->ext=NULL;
724+
}
725+
Assert(entry->dobj==NULL);
726+
entry->dobj=dobj;
727+
}
728+
703729
/*
704730
* Assign a DumpId that's not tied to a DumpableObject.
705731
*

‎src/bin/pg_dump/pg_backup_archiver.c

Lines changed: 80 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -512,7 +512,20 @@ RestoreArchive(Archive *AHX)
512512
* don't necessarily emit it verbatim; at this point we add an
513513
* appropriate IF EXISTS clause, if the user requested it.
514514
*/
515-
if (*te->dropStmt!='\0')
515+
if (strcmp(te->desc,"BLOB METADATA")==0)
516+
{
517+
/* We must generate the per-blob commands */
518+
if (ropt->if_exists)
519+
IssueCommandPerBlob(AH,te,
520+
"SELECT pg_catalog.lo_unlink(oid) "
521+
"FROM pg_catalog.pg_largeobject_metadata "
522+
"WHERE oid = '","'");
523+
else
524+
IssueCommandPerBlob(AH,te,
525+
"SELECT pg_catalog.lo_unlink('",
526+
"')");
527+
}
528+
elseif (*te->dropStmt!='\0')
516529
{
517530
if (!ropt->if_exists||
518531
strncmp(te->dropStmt,"--",2)==0)
@@ -528,12 +541,12 @@ RestoreArchive(Archive *AHX)
528541
{
529542
/*
530543
* Inject an appropriate spelling of "if exists". For
531-
* large objects, we have a separate routine that
544+
*old-stylelarge objects, we have a routine that
532545
* knows how to do it, without depending on
533546
* te->dropStmt; use that. For other objects we need
534547
* to parse the command.
535548
*/
536-
if (strncmp(te->desc,"BLOB",4)==0)
549+
if (strcmp(te->desc,"BLOB")==0)
537550
{
538551
DropLOIfExists(AH,te->catalogId.oid);
539552
}
@@ -1290,7 +1303,7 @@ EndLO(Archive *AHX, Oid oid)
12901303
**********/
12911304

12921305
/*
1293-
* Called by a format handler beforeanyLOsare restored
1306+
* Called by a format handler beforea group ofLOsis restored
12941307
*/
12951308
void
12961309
StartRestoreLOs(ArchiveHandle*AH)
@@ -1309,7 +1322,7 @@ StartRestoreLOs(ArchiveHandle *AH)
13091322
}
13101323

13111324
/*
1312-
* Called by a format handler afterallLOsare restored
1325+
* Called by a format handler aftera group ofLOsis restored
13131326
*/
13141327
void
13151328
EndRestoreLOs(ArchiveHandle*AH)
@@ -1343,6 +1356,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
13431356
AH->loCount++;
13441357

13451358
/* Initialize the LO Buffer */
1359+
if (AH->lo_buf==NULL)
1360+
{
1361+
/* First time through (in this process) so allocate the buffer */
1362+
AH->lo_buf_size=LOBBUFSIZE;
1363+
AH->lo_buf= (void*)pg_malloc(LOBBUFSIZE);
1364+
}
13461365
AH->lo_buf_used=0;
13471366

13481367
pg_log_info("restoring large object with OID %u",oid);
@@ -2988,19 +3007,20 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
29883007
{
29893008
/*
29903009
* Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it
2991-
* is considered a data entry. We don't need to check fortheBLOBS
2992-
*entry orold-style BLOB COMMENTS, because they will have hadDumper
2993-
*=true ... but we do need to check new-style BLOB ACLs, comments,
3010+
* is considered a data entry. We don't need to check for BLOBS or
3011+
* old-style BLOB COMMENTS entries, because they will have hadDumper =
3012+
* true ... but we do need to check new-style BLOB ACLs, comments,
29943013
* etc.
29953014
*/
29963015
if (strcmp(te->desc,"SEQUENCE SET")==0||
29973016
strcmp(te->desc,"BLOB")==0||
3017+
strcmp(te->desc,"BLOB METADATA")==0||
29983018
(strcmp(te->desc,"ACL")==0&&
2999-
strncmp(te->tag,"LARGE OBJECT",13)==0)||
3019+
strncmp(te->tag,"LARGE OBJECT",12)==0)||
30003020
(strcmp(te->desc,"COMMENT")==0&&
3001-
strncmp(te->tag,"LARGE OBJECT",13)==0)||
3021+
strncmp(te->tag,"LARGE OBJECT",12)==0)||
30023022
(strcmp(te->desc,"SECURITY LABEL")==0&&
3003-
strncmp(te->tag,"LARGE OBJECT",13)==0))
3023+
strncmp(te->tag,"LARGE OBJECT",12)==0))
30043024
res=res&REQ_DATA;
30053025
else
30063026
res=res& ~REQ_DATA;
@@ -3035,12 +3055,13 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
30353055
if (!(ropt->sequence_data&&strcmp(te->desc,"SEQUENCE SET")==0)&&
30363056
!(ropt->binary_upgrade&&
30373057
(strcmp(te->desc,"BLOB")==0||
3058+
strcmp(te->desc,"BLOB METADATA")==0||
30383059
(strcmp(te->desc,"ACL")==0&&
3039-
strncmp(te->tag,"LARGE OBJECT",13)==0)||
3060+
strncmp(te->tag,"LARGE OBJECT",12)==0)||
30403061
(strcmp(te->desc,"COMMENT")==0&&
3041-
strncmp(te->tag,"LARGE OBJECT",13)==0)||
3062+
strncmp(te->tag,"LARGE OBJECT",12)==0)||
30423063
(strcmp(te->desc,"SECURITY LABEL")==0&&
3043-
strncmp(te->tag,"LARGE OBJECT",13)==0))))
3064+
strncmp(te->tag,"LARGE OBJECT",12)==0))))
30443065
res=res&REQ_SCHEMA;
30453066
}
30463067

@@ -3607,18 +3628,35 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
36073628
}
36083629

36093630
/*
3610-
* Actually print the definition.
3631+
* Actually print the definition. Normally we can just print the defn
3632+
* string if any, but we have three special cases:
36113633
*
3612-
*Really crude hack for suppressing AUTHORIZATION clause that old pg_dump
3634+
*1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump
36133635
* versions put into CREATE SCHEMA. Don't mutate the variant for schema
36143636
* "public" that is a comment. We have to do this when --no-owner mode is
36153637
* selected. This is ugly, but I see no other good way ...
3638+
*
3639+
* 2. BLOB METADATA entries need special processing since their defn
3640+
* strings are just lists of OIDs, not complete SQL commands.
3641+
*
3642+
* 3. ACL LARGE OBJECTS entries need special processing because they
3643+
* contain only one copy of the ACL GRANT/REVOKE commands, which we must
3644+
* apply to each large object listed in the associated BLOB METADATA.
36163645
*/
36173646
if (ropt->noOwner&&
36183647
strcmp(te->desc,"SCHEMA")==0&&strncmp(te->defn,"--",2)!=0)
36193648
{
36203649
ahprintf(AH,"CREATE SCHEMA %s;\n\n\n",fmtId(te->tag));
36213650
}
3651+
elseif (strcmp(te->desc,"BLOB METADATA")==0)
3652+
{
3653+
IssueCommandPerBlob(AH,te,"SELECT pg_catalog.lo_create('","')");
3654+
}
3655+
elseif (strcmp(te->desc,"ACL")==0&&
3656+
strncmp(te->tag,"LARGE OBJECTS",13)==0)
3657+
{
3658+
IssueACLPerBlob(AH,te);
3659+
}
36223660
else
36233661
{
36243662
if (te->defn&&strlen(te->defn)>0)
@@ -3639,18 +3677,31 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
36393677
te->owner&&strlen(te->owner)>0&&
36403678
te->dropStmt&&strlen(te->dropStmt)>0)
36413679
{
3642-
PQExpBufferDatatemp;
3680+
if (strcmp(te->desc,"BLOB METADATA")==0)
3681+
{
3682+
/* BLOB METADATA needs special code to handle multiple LOs */
3683+
char*cmdEnd=psprintf(" OWNER TO %s",fmtId(te->owner));
36433684

3644-
initPQExpBuffer(&temp);
3645-
_getObjectDescription(&temp,te);
3685+
IssueCommandPerBlob(AH,te,"ALTER LARGE OBJECT ",cmdEnd);
3686+
pg_free(cmdEnd);
3687+
}
3688+
else
3689+
{
3690+
/* For all other cases, we can use _getObjectDescription */
3691+
PQExpBufferDatatemp;
36463692

3647-
/*
3648-
* If _getObjectDescription() didn't fill the buffer, then there is no
3649-
* owner.
3650-
*/
3651-
if (temp.data[0])
3652-
ahprintf(AH,"ALTER %s OWNER TO %s;\n\n",temp.data,fmtId(te->owner));
3653-
termPQExpBuffer(&temp);
3693+
initPQExpBuffer(&temp);
3694+
_getObjectDescription(&temp,te);
3695+
3696+
/*
3697+
* If _getObjectDescription() didn't fill the buffer, then there
3698+
* is no owner.
3699+
*/
3700+
if (temp.data[0])
3701+
ahprintf(AH,"ALTER %s OWNER TO %s;\n\n",
3702+
temp.data,fmtId(te->owner));
3703+
termPQExpBuffer(&temp);
3704+
}
36543705
}
36553706

36563707
/*
@@ -4749,6 +4800,9 @@ CloneArchive(ArchiveHandle *AH)
47494800
/* clone has its own error count, too */
47504801
clone->public.n_errors=0;
47514802

4803+
/* clones should not share lo_buf */
4804+
clone->lo_buf=NULL;
4805+
47524806
/*
47534807
* Connect our new clone object to the database, using the same connection
47544808
* parameters used for the original connection.

‎src/bin/pg_dump/pg_backup_archiver.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,12 @@
6868
#defineK_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0)/* add
6969
* compression_algorithm
7070
* in header */
71+
#defineK_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0)/* BLOB METADATA entries
72+
* and multiple BLOBS */
7173

7274
/* Current archive version number (the format we can output) */
7375
#defineK_VERS_MAJOR 1
74-
#defineK_VERS_MINOR15
76+
#defineK_VERS_MINOR16
7577
#defineK_VERS_REV 0
7678
#defineK_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV)
7779

@@ -448,6 +450,9 @@ extern void InitArchiveFmt_Tar(ArchiveHandle *AH);
448450
externboolisValidTarHeader(char*header);
449451

450452
externvoidReconnectToServer(ArchiveHandle*AH,constchar*dbname);
453+
externvoidIssueCommandPerBlob(ArchiveHandle*AH,TocEntry*te,
454+
constchar*cmdBegin,constchar*cmdEnd);
455+
externvoidIssueACLPerBlob(ArchiveHandle*AH,TocEntry*te);
451456
externvoidDropLOIfExists(ArchiveHandle*AH,Oidoid);
452457

453458
voidahwrite(constvoid*ptr,size_tsize,size_tnmemb,ArchiveHandle*AH);

‎src/bin/pg_dump/pg_backup_custom.c

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
140140
ctx= (lclContext*)pg_malloc0(sizeof(lclContext));
141141
AH->formatData= (void*)ctx;
142142

143-
/* Initialize LO buffering */
144-
AH->lo_buf_size=LOBBUFSIZE;
145-
AH->lo_buf= (void*)pg_malloc(LOBBUFSIZE);
146-
147143
/*
148144
* Now open the file
149145
*/
@@ -342,7 +338,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te)
342338
}
343339

344340
/*
345-
* Called by the archiver when starting to saveallBLOB DATA (not schema).
341+
* Called by the archiver when starting to save BLOB DATA (not schema).
346342
* This routine should save whatever format-specific information is needed
347343
* to read the LOs back into memory.
348344
*
@@ -402,7 +398,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid)
402398
}
403399

404400
/*
405-
* Called by the archiver when finishing savingallBLOB DATA.
401+
* Called by the archiver when finishing saving BLOB DATA.
406402
*
407403
* Optional.
408404
*/
@@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH)
902898
* share knowledge about where the data blocks are across threads.
903899
* _PrintTocData has to be careful about the order of operations on that
904900
* state, though.
905-
*
906-
* Note: we do not make a local lo_buf because we expect at most one BLOBS
907-
* entry per archive, so no parallelism is possible.
908901
*/
909902
}
910903

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp