DISTINCT and UNION will remove duplicates, requiring a forced sort, unique and temporary table step. It can't be flattened or optimized in other ways. So they are removed to allow SQLite to do optimization of searches.

Don't distinct where it's not needed.

d1f95da

DISTINCT and UNION will remove duplicates, requiringa forced sort, unique and temporary table step.  It can'tbe flattened or optimized in other ways.  So they are removedto allow SQLite to do optimization of searches.

mr-russ force-pushed thereducedistinct branch from9e1e98e tod1f95daCompare

April 20, 2023 13:29

Quote table name.

c56e0c6

Copy link

Collaborator

ts678 commentedJun 22, 2023

I'm wondering how these specific ones were chosen and by what process they're guaranteed to not change behavior badly.
Long ago I tested two (possibly just the DISTINCT part rather than final result) on an example database and result changed.

Copy link

Contributor

Jojo-1000 commentedJun 22, 2023

I started to run through all of these queries on a copy of my main backup database (with 100kB blocksize, so lots of blocks) to see what improvement is there. It seems like in some cases it can be 30% faster while in others it is just the same.

what process they're guaranteed to not change behavior badly

What I see in a lot of places is aNOT IN (SELECT DISTINCT ...). I am pretty sure at least those are save to change (IN should not care about duplicates). The others need some careful consideration, which is why I am trying to see if it is worth it in terms of performance.

Copy link

duplicatibot commentedJun 23, 2023

This pull request has been mentioned onDuplicati. There might be relevant details there:

https://forum.duplicati.com/t/reducing-time-spent-deleting-blocks/16403/9

Jojo-1000 reviewed

Jun 23, 2023

View reviewed changes

Duplicati/Library/Main/Database/LocalBugReportDatabase.cs

		SELECT DISTINCT ""Path"" AS ""RealPath"",
		? \|\| length(""RealPath"") \|\| ? \|\| row_number() OVER () \|\|
		CASE WHEN substr(""RealPath"", length(""RealPath"")) = ? THEN ? ELSE ? END) AS ""Obfuscated"" FROM ""File""",
		Platform.IsClientPosix?"/":"X:\\",Util.DirectorySeparatorString,Util.DirectorySeparatorString,".bin");

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This query has a missing( before line 3 and lacks a 3rd dir separator in parameters.
Combined execution time on my main backup database of old 3 queries was 3938ms, new is 6030ms.

Duplicati/Library/Main/Database/LocalDatabase.cs

		deletecmd.ExecuteNonQuery(string.Format(@"CREATE TEMP TABLE ""{0}"" (""ID"" INTEGER PRIMARY KEY)",blocksetidstable));
		deletecmd.ExecuteNonQuery(string.Format(@"INSERT OR IGNORE INTO ""{0}"" (""ID"") {1}",blocksetidstable,bsIdsSubQuery));
		bsIdsSubQuery=string.Format(@"SELECTDISTINCT""ID"" FROM ""{0}"" ",blocksetidstable);
		bsIdsSubQuery=string.Format(@"SELECT ""ID"" FROM ""{0}"" ",blocksetidstable);

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No difference in performance

Duplicati/Library/Main/Database/LocalDatabase.cs

		UNION ALL
		SELECT""BlocksetID"" FROM ""BlocklistHash""
		WHERE ""Hash"" IN (SELECT ""Hash"" FROM ""Block"" WHERE ""VolumeID"" IN ({volIdsSubQuery}))";

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Test with 4 random volumes
Old: 98ms
New: 86ms (contains duplicates, but should not matter)

In combined temp table insert:
Old: 100ms
New: 90ms

Duplicati/Library/Main/Database/LocalDatabase.cs


		varreal_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM ""BlocklistHash""",0);
		varunique_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECTDISTINCT""BlocksetID"", ""Index"" FROM ""BlocklistHash"")",0);
		varunique_count=cmd.ExecuteScalarInt64(@"SELECT Count(*) FROM (SELECT ""BlocksetID"", ""Index"" FROM ""BlocklistHash"")",0);

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In my dataset both queries returned the same, but this specifically compares unique entries to all entries and (BlocksetID, Index) is not guaranteed unique in the table

Duplicati/Library/Main/Database/LocalDatabase.cs

		foreach(varfilesetidincmd.ExecuteReaderEnumerable(@"SELECT ""ID"" FROM ""Fileset"" ").Select(x=>x.ConvertValueToInt64(0,-1)))
		{
		varexpandedCmd=string.Format(@"SELECT COUNT(*) FROM (SELECTDISTINCT""Path"" FROM ({0}) UNIONSELECT DISTINCT ""Path"" FROM ({1}))",LocalDatabase.LIST_FILESETS,LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);
		varexpandedCmd=string.Format(@"SELECT COUNT(DISTINCT ""Path"") FROM (SELECT ""Path"" FROM ({0}) UNIONALL SELECT ""Path"" FROM ({1}))",LocalDatabase.LIST_FILESETS,LocalDatabase.LIST_FOLDERS_AND_SYMLINKS);

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 13398ms
New: 14160ms

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

		" UNION "+
		" UNIONALL"+
		@"SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))"+
		")",

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No difference in performance

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

		@" UNIONALL"+
		@"SELECT ""Path"" FROM ""File"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))"+
		@") ORDER BY ""Path"" ",
		string.Join(",",items.Select(x=>"?"))

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 12914ms
New: 6123ms

Duplicati/Library/Main/Database/LocalListAffectedDatabase.cs

		@" UNION "+
		@" UNIONALL"+
		@" SELECT ""Name"" FROM ""Remotevolume"" WHERE ""ID"" IN ( SELECT ""VolumeID"" FROM ""Block"" WHERE ""ID"" IN ( SELECT ""BlockID"" FROM ""BlocksetEntry"" WHERE ""BlocksetID"" IN ( SELECT ""BlocksetID"" FROM ""Metadataset"" WHERE ""ID"" IN ( SELECT ""MetadataID"" FROM ""FileLookup"" WHERE ""ID"" IN ( SELECT ""FileID"" FROM ""FilesetEntry"" WHERE ""FilesetID"" IN ( SELECT ""ID"" FROM ""Fileset"" WHERE ""VolumeID"" IN ( SELECT ""ID"" FROM ""RemoteVolume"" WHERE ""Name"" IN ({0}))))))))"+
		@")",

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 19458ms
New: 13540ms

Duplicati/Library/Main/Database/LocalListChangesDatabase.cs

		}

		cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECTDISTINCT""Path"" FROM ""{2}"") ",tablename,combined,filenamestable),filesetId);
		cmd.ExecuteNonQuery(string.Format(@"INSERT INTO ""{0}"" (""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"") SELECT ""Path"", ""FileHash"", ""MetaHash"", ""Size"", ""Type"" FROM {1} A WHERE ""A"".""FilesetID"" = ? AND ""A"".""Path"" IN (SELECT ""Path"" FROM ""{2}"") ",tablename,combined,filenamestable),filesetId);

Copy link

Contributor

Jojo-1000Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Old: 13285ms
New: 10527ms

Duplicati/Library/Main/Database/LocalListDatabase.cs

		using(varcmd=m_connection.CreateCommand())
		{
		cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECTDISTINCT""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" "+query,m_tablename),args);
		cmd.ExecuteNonQuery(string.Format(@"CREATE TEMPORARY TABLE ""{0}"" AS SELECT ""ID"" AS ""FilesetID"", ""IsFullBackup"" AS ""IsFullBackup"" , ""Timestamp"" AS ""Timestamp"" FROM ""Fileset"" "+query,m_tablename),args);

Copy link

Contributor

Jojo-1000Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Both run quickly in my test, but ID is unique so there should be no difference in results

Copy link

Contributor

Jojo-1000 commentedJun 23, 2023•
edited
Loading

I tried to run each query individually to see how much faster it is, but I gave up in the middle so I don't have times for the others. However, inmy test case that runs all database queries there was no significant performance improvement from this PR.

Labels

None yet

Movatterモバイル変換

Don't distinct where it's not needed.#4924

Are you sure you want to change the base?

Don't distinct where it's not needed.#4924

Uh oh!

Conversation

mr-russ commentedApr 20, 2023

Uh oh!

ts678 commentedJun 22, 2023

Uh oh!

Jojo-1000 commentedJun 22, 2023

Uh oh!

duplicatibot commentedJun 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jojo-1000 commentedJun 23, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jojo-1000 commentedJun 23, 2023•
edited
Loading