ResultSet aggregates do not respect distinct option

Bug #217644 reported by James Henstridge
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Low
Michael Nelson
Storm
Fix Released
High
James Henstridge

Bug Description

The ResultSet.count() method is described as "Get the number of objects represented by this ResultSet.". However, it ignores the ResultSet._distinct flag.

Using the data from the tests/store/base.py tests, consider the following query:

    result = store.find(Foo, Foo.id == Link.foo_id)

This result has 6 rows (3 with foo_id=10, 2 with foo_id=20 and 1 with foo_id=30). Issuing result.count() generates a query like:

    SELECT COUNT(*) FROM foo WHERE foo.id = link.foo_id

We then configure it to be distinct:

    result.config(distinct=True)

The result set now has 3 items, but result.count() still returns 6 because it is ignoring the setting and using the same query. Some possible correct queries are:

    SELECT COUNT(DISTINCT foo.id) FROM foo WHERE foo.id = link.foo_id;
    SELECT COUNT(*) FROM (SELECT DISTINCT foo.id FROM foo WHERE foo.id = link.foo_id);
    SELECT COUNT(*) FROM foo WHERE foo.id IN (SELECT DISTINCT foo.id FROM foo WHERE foo.id = link.foo_id);

The first works great for a standard count of a table with a single primary key column. The second should work for multi-column primary keys. The third would probably be needed for counts other than "COUNT(*)".

This problem looks like it would also affect the avg() and sum() aggregates for distinct result sets, hence the title.

Related branches

Revision history for this message
Celso Providelo (cprov) wrote :

The same problem happens with 'limit' value, it's ignored for all aggregated functions.

{{{
    result = self.Person.select(limit=1)
    # EXECUTE: 'SELECT person.age, person.id, person.name, person.ts FROM person ORDER BY Person.name DESC LIMIT 1', ()
    self.assertEquals(len(list(result)), 1)
    # EXECUTE: 'SELECT COUNT(*) FROM person', ()
    self.assertEquals(result.count(), 2)
}}}

Changed in storm:
importance: Undecided → High
Revision history for this message
Celso Providelo (cprov) wrote :

It's worth noticing that this problem also afects sliced ResultSets, as in:

{{{
    result = self.Person.select()
   self.assertEquals(result.count(), 2)
   sliced_result = result[1:]
   self.assertEquals(len(list(sliced_result)), 1)
   self.assertEquals(sliced_result.count(), 2)
}}}

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Is there a way to work around this currently? I'm trying to do a distinct count, but can't find a solution that always works:

Try 1:
result.count(distinct=True) - doesn't work because an expr is needed when using the distinct=True option

Try 2:
columns, tables = result._find_spec.get_columns_and_tables()
result.count(expr=tables[0], distinct=True) - i thought this was it, it works, but only as long as your FROM clause contains only one table, otherwise it barfs with:
psycopg2.ProgrammingError: could not identify an equality operator for type [name of table[0]]

Try 3:
from storm.expr import SQLToken
id = SQLToken(tables[0] + '.id')
result.count(expr=id, distinct=True)
Almost works. The query generated would work perfectly except that it's put quotes around the table.col name:
psycopg2.ProgrammingError: column "DistroSeriesPackageCache.id" does not exist
LINE 1: SELECT COUNT(DISTINCT "DistroSeriesPackageCache.id") FROM Di...

Remove the quotes and the query works perfectly...

Try 4:
from storm.expr import Column
id = Column('id', tables[0])
result.count(expr=id, distinct=True)
This time it doesn't include the quotes, *great*, but it also doesn't include all the required tables for the join in the FROM clause (even though they are in the find_spec of the query)... so again, the SQL fails..

Any help appreciated! Thanks for reading :)

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Just answering my own question, I just found that fourth option above will work if the underlying query uses explicit joins (store.using...) rather than implicit joins.

So, at least that lets me continue :), hope it's useful info.

Thomas Herve (therve)
Changed in storm:
assignee: nobody → therve
status: New → In Progress
Revision history for this message
Sidnei da Silva (sidnei) wrote :

The related branch looks great, +1 for merging it. This bug should be closed after merging and a new one should be opened, even if it's just for documentation, regarding SUM and AVG. Maybe Celso or someone can provide a sample test for SUM and AVG?

Revision history for this message
Celso Providelo (cprov) wrote :

Sidnei,

Check the diff attached exposing the issues I've mentioned in the sqlobject test suite. I hope it helps.

Revision history for this message
Sidnei da Silva (sidnei) wrote :

Hi Celso,

Does that still apply after Thomas latest fixes (http://bazaar.launchpad.net/~therve/storm/217644-count-distinct/revision/285)?

I believe they cover your issue. Let me know if that's not the case.

Revision history for this message
Thomas Herve (therve) wrote :

I applied the test patch to my branch, it the tests run correctly. Thanks!

Revision history for this message
Celso Providelo (cprov) wrote :

Super thanks, guys. The issue is fixed.

Revision history for this message
Thomas Herve (therve) wrote :

The branch was merged, but I leave the issue opened if someone has a solution for avg and sum.

Changed in storm:
assignee: therve → nobody
importance: High → Low
Revision history for this message
Thomas Herve (therve) wrote :

and... the branch was reverted because I discovered some problems with it.

Revision history for this message
Brad Crittenden (bac) wrote :

Since the proposed solution for count() was reverted I ask that you consider raising the importance of this bug back to High and provide an estimate for when it might be fixed.

Revision history for this message
Thomas Herve (therve) wrote :

I set it back to High, but I can't provide you an estimate because I have no idea how to fix it. The previous solution breaks on many non-trivial queries, and I don't have anything better for now.

Changed in storm:
importance: Low → High
Revision history for this message
Brad Crittenden (bac) wrote :

I attempted to use the solution Michael Nelson suggested:

from storm.expr import Column
id = Column('id', tables[0])
result.count(expr=id, distinct=True)

This worked for many cases but failed when my result was a ResultSet created by the union of two intermediate ResultSets. It failed with:

ProgrammingError: missing FROM-clause entry for table "person"
   LINE 1: SELECT COUNT(DISTINCT Person.id) FROM ((SELECT Person.accoun...

This error was raised despite the fact the two intermediate ResultSets had Person in the FROM-clause.

After re-arranging my code to only issue a single query and avoid the union the above work-around worked.

Revision history for this message
James Henstridge (jamesh) wrote :

I think this could be implemented similar to what is done in ResultSet.__contains__():

If we don't have a simple query, create a new query with replace_columns(). The replaced columns should be:
 1. primary keys for any tables in the FindSpec.
 2. Any additional expressions in the FindSpec.
 3. An Alias() of the expression passed to the aggregate.

We then build a new query with:
    columns = []
    for is_expr, info in self._find_spec._cls_spec_info:
        if is_expr:
            columns.append(Alias(info, "_key%d" % len(columns)))
        else:
            for col in info.primary_key):
                columns.append(Alias(info, "_key%d" % len(columns)))
    alias_expr = Alias(expr, "_expr")
    columns.append(alias_expr)
    subquery = replace_columns(self._get_select(), columns)
    select = Select(Aggregate(alias_expr), tables=Alias(subquery, "_tmp"))
    result = self._store._connection.execute(select)

By including the primary keys and find expressions, we should ensure that DISTINCT and the various set operations behave correctly. It should also handle things like limit/offset correctly and maybe even groub by/having.

The alias_expr alias gives us a way to reference the value from the subquery (and any sub-subqueries). The code would need to handle the case of aggregates that don't take an expression (like COUNT(*) queries), but that is pretty simple.

Revision history for this message
James Henstridge (jamesh) wrote :

Attached is a branch that takes the above approach. The changes are a bit of a mess right now: the interface between ResultSet._aggregate and count, max, min, avg and sum needs to change a bit.

A bit of testing would be appreciated.

Changed in storm:
assignee: nobody → James Henstridge (jamesh)
Changed in storm:
milestone: none → 0.15
Revision history for this message
Stuart Bishop (stub) wrote :

Launchpad is currently relying on the 0.14 behavior. Current trunk does not pass tests, nor does it using lp:~jamesh/storm/bug-217644. I don't know if this is because Launchpad is relying on broken behavior or if jamesh's branch does not fully address the problem.

Changed in launchpad:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Michael Nelson (michael.nelson) wrote :

Hi Stuart,

I added a work-around for this bug in the DecoratedResultSet class. It may be that we can simply remove that work-around (so that DecoratedResultSet.count() will just be passed through the the underlying ResultSet.count()).

I didn't intend for the DecoratedResultSet to be used solely for this work-around, but I think it has been used in a few places in LP for just this reason.

affects: launchpad → launchpad-foundations
Revision history for this message
James Henstridge (jamesh) wrote :

Fix merged to trunk as r320.

Stuart managed to run the entire Launchpad test suite with this version of the fix, so I don't think it'll get reverted.

Changed in storm:
status: In Progress → Fix Committed
Revision history for this message
Stuart Bishop (stub) wrote :

Launchpad issue is now covered by Bug #405772

Changed in launchpad-foundations:
status: Triaged → Invalid
Revision history for this message
Brad Crittenden (bac) wrote :

We have a lot of XXX work-arounds for this bug in our code. A clean-up branch to remove them should be done before this bug is closed against Launchpad.

Jamu Kakar (jkakar)
Changed in storm:
status: Fix Committed → Fix Released
Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 217644] Re: ResultSet aggregates do not respect distinct option

On Fri, Jul 31, 2009 at 8:09 PM, Brad Crittenden<email address hidden> wrote:
> We have a lot of XXX work-arounds for this bug in our code.  A clean-up
> branch to remove them should be done before this bug is closed against
> Launchpad.

I think most of this was removed when I updated Launchpad to work with Storm 0.14.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Curtis Hovey (sinzui)
tags: added: tech-debt
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
assignee: nobody → Michael Nelson (michael.nelson)
milestone: none → 10.09
tags: added: qa-needstesting
Changed in launchpad-foundations:
status: Invalid → Fix Committed
Graham Binns (gmb)
tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.