NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitbbd5d65

committed

Update detail for new todo items.

1 parent7bbe216 commitbbd5d65Copy full SHA for bbd5d65

File tree

1 file changed

+252

-1

lines changed

doc/TODO.detail
- optimizer

1 file changed

+252

-1

lines changed

`‎doc/TODO.detail/optimizer‎`

Lines changed: 252 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -1059,7 +1059,7 @@ From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000`
`1059`	`1059`	`Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])`
`1060`	`1060`	`by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672`
`1061`	`1061`	`for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)`
`1062`		`-Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.15 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)`
	`1062`	`+Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.16 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)`
`1063`	`1063`	`Received: from localhost (majordom@localhost)`
`1064`	`1064`	`by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;`
`1065`	`1065`	`Thu, 20 Jan 2000 19:35:19 -0500 (EST)`
`@@ -1586,3 +1586,254 @@ support a couple gigs of RAM now.`
`1586`	`1586`
`1587`	`1587`	`************`
`1588`	`1588`
	`1589`	`+From pgsql-hackers-owner+M6019@hub.org Mon Aug 21 11:47:56 2000`
	`1590`	`+Received: from hub.org (root@hub.org [216.126.84.1])`
	`1591`	`+by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289`
	`1592`	`+for <pgman@candle.pha.pa.us>; Mon, 21 Aug 2000 11:47:55 -0400 (EDT)`
	`1593`	`+Received: from hub.org (majordom@localhost [127.0.0.1])`
	`1594`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383;`
	`1595`	`+Mon, 21 Aug 2000 11:47:51 -0400 (EDT)`
	`1596`	`+Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])`
	`1597`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243`
	`1598`	`+for <pgsql-hackers@postgresql.org>; Mon, 21 Aug 2000 11:47:37 -0400 (EDT)`
	`1599`	`+Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000`
	`1600`	`+Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000`
	`1601`	`+Received: from eros.si.fct.unl.pt (193.136.120.112)`
	`1602`	`+ by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000`
	`1603`	`+Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST)`
	`1604`	`+From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>`
	`1605`	`+X-Sender: tiago@eros.si.fct.unl.pt`
	`1606`	`+To: Tom Lane <tgl@sss.pgh.pa.us>`
	`1607`	`+cc: pgsql-hackers@postgresql.org`
	`1608`	`+Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,`
	`1609`	`+constant-->index scan`
	`1610`	`+In-Reply-To: <1731.966868649@sss.pgh.pa.us>`
	`1611`	`+Message-ID: <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt>`
	`1612`	`+MIME-Version: 1.0`
	`1613`	`+Content-Type: TEXT/PLAIN; charset=US-ASCII`
	`1614`	`+X-Mailing-List: pgsql-hackers@postgresql.org`
	`1615`	`+Precedence: bulk`
	`1616`	`+Sender: pgsql-hackers-owner@hub.org`
	`1617`	`+Status: ORr`
	`1618`	`+`
	`1619`	`+On Mon, 21 Aug 2000, Tom Lane wrote:`
	`1620`	`+`
	`1621`	`+> > One thing it might be interesting (please tell me if you think`
	`1622`	`+> > otherwise) would be to improve pg with better statistical information, by`
	`1623`	`+> > using, for example, histograms.`
	`1624`	`+>`
	`1625`	`+> Yes, that's been on the todo list for a while.`
	`1626`	`+`
	`1627`	`+ If it's ok and nobody is working on that, I'll look on that subject.`
	`1628`	`+ I'll start by looking at the analize portion of vacuum. I'm thinking in`
	`1629`	`+using arrays for the histogram (I've never used the array data type of`
	`1630`	`+postgres).`
	`1631`	`+ Should I use 7.0.2 or the cvs version?`
	`1632`	`+`
	`1633`	`+`
	`1634`	`+> Interesting article. We do most of what she talks about, but we don't`
	`1635`	`+> have anything like the ClusterRatio statistic. We need it --- that was`
	`1636`	`+> just being discussed a few days ago in another thread. Do you have any`
	`1637`	`+> reference on exactly how DB2 defines that stat?`
	`1638`	`+`
	`1639`	`+`
	`1640`	`+ I don't remember seeing that information spefically. From what I've`
	`1641`	`+read I can speculate:`
	`1642`	`+`
	`1643`	`+ 1. They have clusterratios for both indexes and the relation itself.`
	`1644`	`+ 2. They might use an index even if there is no "order by" if the table`
	`1645`	`+has a low clusterratio: just to get the RIDs, then sort the RIDs and`
	`1646`	`+fetch.`
	`1647`	`+ 3. One possible way to calculate this ratio:`
	`1648`	`+ a) for tables`
	`1649`	`+ SeqScan`
	`1650`	`+ if tuple points to a next tuple on the same page then its`
	`1651`	`+"good"`
	`1652`	`+ ratio = # good tuples / # all tuples`
	`1653`	`+ b) for indexes (high speculation ratio here)`
	`1654`	`+ foreach pointed RID in index`
	`1655`	`+ if RID is in same page of next RID in index than mark as`
	`1656`	`+"good"`
	`1657`	`+`
	`1658`	`+ I suspect that if a tuple size is big (relative to page size) than the`
	`1659`	`+cluster ratio is always low.`
	`1660`	`+`
	`1661`	`+ A tuple might also be "good" if it pointed to the next page.`
	`1662`	`+`
	`1663`	`+Tiago`
	`1664`	`+`
	`1665`	`+`
	`1666`	`+From pgsql-hackers-owner+M6152@hub.org Wed Aug 23 13:00:33 2000`
	`1667`	`+Received: from hub.org (root@hub.org [216.126.84.1])`
	`1668`	`+by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259`
	`1669`	`+for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:00:33 -0400 (EDT)`
	`1670`	`+Received: from hub.org (majordom@localhost [127.0.0.1])`
	`1671`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008;`
	`1672`	`+Wed, 23 Aug 2000 12:54:25 -0400 (EDT)`
	`1673`	`+Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])`
	`1674`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749`
	`1675`	`+for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 12:49:44 -0400 (EDT)`
	`1676`	`+Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000`
	`1677`	`+Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000`
	`1678`	`+Received: from eros.si.fct.unl.pt (193.136.120.112)`
	`1679`	`+ by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000`
	`1680`	`+Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST)`
	`1681`	`+From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>`
	`1682`	`+X-Sender: tiago@eros.si.fct.unl.pt`
	`1683`	`+To: Tom Lane <tgl@sss.pgh.pa.us>`
	`1684`	`+cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org`
	`1685`	`+Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,`
	`1686`	`+constant-->index scan`
	`1687`	`+In-Reply-To: <27971.967041030@sss.pgh.pa.us>`
	`1688`	`+Message-ID: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>`
	`1689`	`+MIME-Version: 1.0`
	`1690`	`+Content-Type: TEXT/PLAIN; charset=US-ASCII`
	`1691`	`+X-Mailing-List: pgsql-hackers@postgresql.org`
	`1692`	`+Precedence: bulk`
	`1693`	`+Sender: pgsql-hackers-owner@hub.org`
	`1694`	`+Status: ORr`
	`1695`	`+`
	`1696`	`+Hi!`
	`1697`	`+`
	`1698`	`+On Wed, 23 Aug 2000, Tom Lane wrote:`
	`1699`	`+`
	`1700`	`+> Yes, we know about that one. We have stats about the most common value`
	`1701`	`+> in a column, but no information about how the less-common values are`
	`1702`	`+> distributed. We definitely need stats about several top values not just`
	`1703`	`+> one, because this phenomenon of a badly skewed distribution is pretty`
	`1704`	`+> common.`
	`1705`	`+`
	`1706`	`+`
	`1707`	`+ An end-biased histogram has stats on top values and also on the least`
	`1708`	`+frequent values. So if a there is a selection on a value that is well`
	`1709`	`+bellow average, the selectivity estimation will be more acurate. On some`
	`1710`	`+research papers I've read, it's refered that this is a better approach`
	`1711`	`+than equi-width histograms (which are said to be the "industry" standard).`
	`1712`	`+`
	`1713`	`+ I not sure whether to use a table or a array attribute on pg_stat for`
	`1714`	`+the histogram, the problem is what could be expected from the size of the`
	`1715`	`+attribute (being a text). I'm very affraid of the cost of going through`
	`1716`	`+several tuples on a table (pg_histogram?) during the optimization phase.`
	`1717`	`+`
	`1718`	`+ One other idea would be to only have better statistics for special`
	`1719`	`+attributes requested by the user... something like "analyze special`
	`1720`	`+table(column)".`
	`1721`	`+`
	`1722`	`+Best Regards,`
	`1723`	`+Tiago`
	`1724`	`+`
	`1725`	`+`
	`1726`	`+`
	`1727`	`+From pgsql-hackers-owner+M6160@hub.org Thu Aug 24 00:21:39 2000`
	`1728`	`+Received: from hub.org (root@hub.org [216.126.84.1])`
	`1729`	`+by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662`
	`1730`	`+for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 00:21:38 -0400 (EDT)`
	`1731`	`+Received: from hub.org (majordom@localhost [127.0.0.1])`
	`1732`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951;`
	`1733`	`+Thu, 24 Aug 2000 00:06:58 -0400 (EDT)`
	`1734`	`+Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])`
	`1735`	`+by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775`
	`1736`	`+for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 23:56:57 -0400 (EDT)`
	`1737`	`+Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])`
	`1738`	`+by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973;`
	`1739`	`+Wed, 23 Aug 2000 23:56:35 -0400 (EDT)`
	`1740`	`+To: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>`
	`1741`	`+cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org`
	`1742`	`+Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan`
	`1743`	`+In-reply-to: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>`
	`1744`	`+References: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>`
	`1745`	`+Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>`
	`1746`	`+message dated "Wed, 23 Aug 2000 16:03:42 +0100"`
	`1747`	`+Date: Wed, 23 Aug 2000 23:56:35 -0400`
	`1748`	`+Message-ID: <20970.967089395@sss.pgh.pa.us>`
	`1749`	`+From: Tom Lane <tgl@sss.pgh.pa.us>`
	`1750`	`+X-Mailing-List: pgsql-hackers@postgresql.org`
	`1751`	`+Precedence: bulk`
	`1752`	`+Sender: pgsql-hackers-owner@hub.org`
	`1753`	`+Status: OR`
	`1754`	`+`
	`1755`	`+=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:`
	`1756`	`+> One other idea would be to only have better statistics for special`
	`1757`	`+> attributes requested by the user... something like "analyze special`
	`1758`	`+> table(column)".`
	`1759`	`+`
	`1760`	`+This might actually fall out "for free" from the cheapest way of`
	`1761`	`+implementing the stats. We've talked before about scanning btree`
	`1762`	`+indexes directly to obtain data values in sorted order, which makes`
	`1763`	`+it very easy to find the most common values. If you do that, you`
	`1764`	`+get good stats for exactly those columns that the user has created`
	`1765`	`+indexes on. A tad indirect but I bet it'd be effective...`
	`1766`	`+`
	`1767`	`+regards, tom lane`
	`1768`	`+`
	`1769`	`+From pgsql-hackers-owner+M6165@hub.org Thu Aug 24 05:33:02 2000`
	`1770`	`+Received: from hub.org (root@hub.org [216.126.84.1])`
	`1771`	`+by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309`
	`1772`	`+for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 05:33:01 -0400 (EDT)`
	`1773`	`+Received: from hub.org (majordom@localhost [127.0.0.1])`
	`1774`	`+by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670;`
	`1775`	`+Thu, 24 Aug 2000 05:33:00 -0400 (EDT)`
	`1776`	`+Received: from athena.office.vi.net (office-gwb.fulham.vi.net [194.88.77.158])`
	`1777`	`+by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216`
	`1778`	`+for <pgsql-hackers@postgresql.org>; Thu, 24 Aug 2000 05:19:03 -0400 (EDT)`
	`1779`	`+Received: from grommit.office.vi.net [192.168.1.200] (mail)`
	`1780`	`+by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian))`
	`1781`	`+id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100`
	`1782`	`+Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian))`
	`1783`	`+id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100`
	`1784`	`+Date: Thu, 24 Aug 2000 10:11:14 +0100`
	`1785`	`+From: Jules Bean <jules@jellybean.co.uk>`
	`1786`	`+To: Tom Lane <tgl@sss.pgh.pa.us>`
	`1787`	`+Cc: Tiago Ant?o <tra@fct.unl.pt>, pgsql-hackers@postgresql.org`
	`1788`	`+Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan`
	`1789`	`+Message-ID: <20000824101113.N17510@grommit.office.vi.net>`
	`1790`	`+References: <1731.966868649@sss.pgh.pa.us> <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt> <20000823133418.F17510@grommit.office.vi.net> <27971.967041030@sss.pgh.pa.us>`
	`1791`	`+Mime-Version: 1.0`
	`1792`	`+Content-Type: text/plain; charset=us-ascii`
	`1793`	`+Content-Disposition: inline`
	`1794`	`+User-Agent: Mutt/1.2i`
	`1795`	`+In-Reply-To: <27971.967041030@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Wed, Aug 23, 2000 at 10:30:30AM -0400`
	`1796`	`+X-Mailing-List: pgsql-hackers@postgresql.org`
	`1797`	`+Precedence: bulk`
	`1798`	`+Sender: pgsql-hackers-owner@hub.org`
	`1799`	`+Status: OR`
	`1800`	`+`
	`1801`	`+On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote:`
	`1802`	`+> Jules Bean <jules@jellybean.co.uk> writes:`
	`1803`	`+> > I have in a table a 'category' column which takes a small number of`
	`1804`	`+> > (basically fixed) values. Here by 'small', I mean ~1000, while the`
	`1805`	`+> > table itself has ~10 000 000 rows. Some categories have many, many`
	`1806`	`+> > more rows than others. In particular, there's one category which hits`
	`1807`	`+> > over half the rows. Because of this (AIUI) postgresql assumes`
	`1808`	`+> > that the query`
	`1809`	`+> >select ... from thistable where category='something'`
	`1810`	`+> > is best served by a seqscan, even though there is an index on`
	`1811`	`+> > category.`
	`1812`	`+>`
	`1813`	`+> Yes, we know about that one. We have stats about the most common value`
	`1814`	`+> in a column, but no information about how the less-common values are`
	`1815`	`+> distributed. We definitely need stats about several top values not just`
	`1816`	`+> one, because this phenomenon of a badly skewed distribution is pretty`
	`1817`	`+> common.`
	`1818`	`+`
	`1819`	`+ISTM that that might be enough, in fact.`
	`1820`	`+`
	`1821`	`+If you have stats telling you that the most popular value is 'xyz',`
	`1822`	`+and that it constitutes 50% of the rows (i.e. 5 000 000) then you can`
	`1823`	`+conclude that, on average, other entries constitute a mere 5 000`
	`1824`	`+000/999 ~~ 5000 entries, and it would be definitely be enough.`
	`1825`	`+(That's assuming you store the number of distinct values somewhere).`
	`1826`	`+`
	`1827`	`+`
	`1828`	`+> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN'`
	`1829`	`+> or something like that), a fairly effective workaround is to replace the`
	`1830`	`+> dummy entries with NULL. The system does account for NULLs separately`
	`1831`	`+> from real values, so you'd then get stats based on the most common`
	`1832`	`+> non-dummy value.`
	`1833`	`+`
	`1834`	`+I can't really do that. Even if I could, the distribution is very`
	`1835`	`+skewed -- so the next most common makes up a very high proportion of`
	`1836`	`+what's left. I forget the figures exactly.`
	`1837`	`+`
	`1838`	`+Jules`
	`1839`	`+`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitbbd5d65

File tree

1 file changed

1 file changed

`‎doc/TODO.detail/optimizer‎`

0 commit comments