Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Don't distinct where it's not needed.#4924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
mr-russ wants to merge2 commits intoduplicati:master
base:master
Choose a base branch
Loading
frommr-russ:reducedistinct

Conversation

@mr-russ
Copy link
Contributor

DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.

DISTINCT and UNION will remove duplicates, requiringa forced sort, unique and temporary table step.  It can'tbe flattened or optimized in other ways.  So they are removedto allow SQLite to do optimization of searches.
@ts678
Copy link
Collaborator

I'm wondering how these specific ones were chosen and by what process they're guaranteed to not change behavior badly.
Long ago I tested two (possibly just the DISTINCT part rather than final result) on an example database and result changed.

@Jojo-1000
Copy link
Contributor

I started to run through all of these queries on a copy of my main backup database (with 100kB blocksize, so lots of blocks) to see what improvement is there. It seems like in some cases it can be 30% faster while in others it is just the same.

what process they're guaranteed to not change behavior badly

What I see in a lot of places is aNOT IN (SELECT DISTINCT ...). I am pretty sure at least those are save to change (IN should not care about duplicates). The others need some careful consideration, which is why I am trying to see if it is worth it in terms of performance.

@duplicatibot
Copy link

This pull request has been mentioned onDuplicati. There might be relevant details there:

https://forum.duplicati.com/t/reducing-time-spent-deleting-blocks/16403/9

SELECT DISTINCT ""Path"" AS ""RealPath"",
? || length(""RealPath"") || ? || row_number() OVER () ||
CASE WHEN substr(""RealPath"", length(""RealPath"")) = ? THEN ? ELSE ? END) AS ""Obfuscated"" FROM ""File""",
Platform.IsClientPosix?"/":"X:\\",Util.DirectorySeparatorString,Util.DirectorySeparatorString,".bin");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This query has a missing( before line 3 and lacks a 3rd dir separator in parameters.
Combined execution time on my main backup database of old 3 queries was 3938ms, new is 6030ms.

deletecmd.ExecuteNonQuery(string.Format(@"CREATE TEMP TABLE ""{0}"" (""ID"" INTEGER PRIMARY KEY)",blocksetidstable));
deletecmd.ExecuteNonQuery(string.Format(@"INSERT OR IGNORE INTO ""{0}"" (""ID"") {1}",blocksetidstable,bsIdsSubQuery));
bsIdsSubQuery=string.Format(@"SELECTDISTINCT""ID"" FROM ""{0}"" ",blocksetidstable);
bsIdsSubQuery=string.Format(@"SELECT ""ID"" FROM ""{0}"" ",blocksetidstable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No difference in performance

UNION ALL
SELECT""BlocksetID"" FROM ""BlocklistHash""
WHERE ""Hash"" IN (SELECT ""Hash"" FROM ""Block"" WHERE ""VolumeID"" IN ({volIdsSubQuery}))";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Test with 4 random volumes
Old: 98ms
New: 86ms (contains duplicates, but should not matter)

In combined temp table insert:
Old: 100ms
New: 90ms


varreal_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM ""BlocklistHash""",0);
varunique_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECTDISTINCT""BlocksetID"", ""Index"" FROM ""BlocklistHash"")",0);
varunique_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")",0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In my dataset both queries returned the same, but this specifically compares unique entries to all entries and (BlocksetID, Index) is not guaranteed unique in the table

foreach(varfilesetidincmd.ExecuteReaderEnumerable(@"SELECT ""ID"" FROM ""Fileset"" ").Select(x=>x.ConvertValueToInt64(0,-1)))
{
varexpandedCmd=string.Format(@"SELECT COUNT(*) FROM (SELECTDISTINCT""Path"" FROM ({0}) UNIONSELECT DISTINCT ""Path"" FROM ({1}))",LocalDatabase.LIST_FILESETS,LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
varexpandedCmd=string.Format(@"SELECT COUNT(DISTINCT ""Path"") FROM (SELECT ""Path"" FROM ({0}) UNIONALL SELECT ""Path"" FROM ({1}))",LocalDatabase.LIST_FILESETS,LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 13398ms
New: 14160ms

" UNION "+
" UNIONALL"+
@"SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))"+
")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No difference in performance

@" UNIONALL"+
@"SELECT ""Path"" FROM ""File"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))"+
@") ORDER BY ""Path"" ",
string.Join(",",items.Select(x=>"?"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 12914ms
New: 6123ms

@" UNION "+
@" UNIONALL"+
@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""Metadataset"" WHERE ""ID"" IN ( SELECT ""MetadataID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))))"+
@")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 19458ms
New: 13540ms

}

cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECTDISTINCT""Path"" FROM ""{2}"") ",tablename,combined,filenamestable),filesetId);
cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT ""Path"" FROM ""{2}"") ",tablename,combined,filenamestable),filesetId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 13285ms
New: 10527ms

using(varcmd=m_connection.CreateCommand())
{
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECTDISTINCT""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" "+query,m_tablename),args);
cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" "+query,m_tablename),args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Both run quickly in my test, but ID is unique so there should be no difference in results

@Jojo-1000
Copy link
Contributor

Jojo-1000 commentedJun 23, 2023
edited
Loading

I tried to run each query individually to see how much faster it is, but I gave up in the middle so I don't have times for the others. However, inmy test case that runs all database queries there was no significant performance improvement from this PR.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

1 more reviewer

@Jojo-1000Jojo-1000Jojo-1000 left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@mr-russ@ts678@Jojo-1000@duplicatibot

[8]ページ先頭

©2009-2025 Movatter.jp