Saturday, November 9, 2013

10 common stumbling blocks for SQLAlchemy newbies

I've been hanging around on #sqlalchemy for years now, helping other users where I could. People come there for help with many different kinds of problems, but after a while, you start seeing common patterns. Certain difficulties seem to persist amongst SQLAlchemy newbies. This is my attempt at documenting these issues and providing answers and solutions where applicable. These 10 issues are not in any real order of appearance frequency – I just wrote them down in the order in which they popped into my mind.

Updated 2013-09-09: Corrected facts in section 5.

1. Not finding the (right) tutorial

Occasionally, someone comes on the channel asking questions while referring to a tutorial outside of sqlalchemy.org. These tutorials are often out of date or just plain incorrect. Personally, when I look for a tutorial on something, the first place to look would be the official website of the project you're trying to learn to use. I would say that is common sense.

There's also another group of people, who are able to find the official documentation but miss the tutorials, despite them being the first links on both the ORM and Core sections.

While I don't think there's any important information really missing from the SQLAlchemy documentation, the top page could probably be laid out differently to emphasize the suggested starting points for new users.

2. Boilerplate constructors

The declarative metaclass provides a default constructor for model classes, making explicit constructors largely unnecessary. The issue is that many people just don't seem to know about this, and happily add explicit constructors to all their classes. I actually blame the SQLAlchemy documentation for not clearly pointing this out. I've only ever found a single mention of this feature in the documentation. In my opinion, its existence should be prominently advertised in the ORM tutorial.

So instead of creating a constructor that accepts all your fields as arguments, just skip the constructor and instantiate classes like Company(name='Foo', address='blah').

3. Believing that defining or modifying the model classes creates or modifies the tables themselves

A surprising amount of new users seem to believe that SQLAlchemy is capable of automatically modifying their tables to match their classes. I am not aware of any ORM that does such a thing automatically, least of all reliably.

If you change your declared classes, you need to explicitly alter the schema in the actual database too, by either:
  1. using the tool of your choice (usually psql / mysql / other command line tool or a graphical tool like phpMyAdmin etc.) to manually create/alter/drop the tables
  2. using a migration library like Alembic

4. Forgetting to import model modules before attempting to call metadata.create_all()

I must admit that I was bitten by this one when I was starting with SQLAlchemy. You already know that metadata.create_all() creates all your tables, right? Sure, but before that can happen, all the modules that declare tables must have been imported. I mean, how else would SQLAlchemy know what the tables are?

Personally I hadn't even thought about how that works, but then there are some people who seem to expect SQLAlchemy to scan the project structure for model classes! Admittedly, the metaclass machinery that puts declarative classes in the metadata may in the beginning seem so magical that you expect everything to "just work" by itself.

So remember kids, you need to import all your model modules so that the metadata can be filled in. Only then will the call to metadata.create_all() do anything useful.

5. Confusion between .one(), .first() and .scalar()

Every week I see someone pasting code that uses Query.one() and catching the NoResultException. Then others complain that Query.first() should return a scalar value instead of a KeyedTuple when a single column or expression has been selected. Query.scalar() is virtually unheard of.

So let me clear up the confusion once and for all:
  • Query.one() executes the query and raises MultipleResultsFound if the number of rows was more than 1, and NoResultFound no rows were found. Otherwise it returns the sole result row (as tuple)
  • Query.first() executes the query with LIMIT 1 and returns the result row as tuple, or None if no rows were found
  • Query.scalar() executes the query and raises MultipleResultsFound if the number of rows was more than 1. If no rows were found, it returns None. Otherwise it returns the first column of the sole result row.
Query.one() should be used if you want to select a single row and you assume there is exactly one row to be found. Query.first() should be used when you want just one row but you're not sure if one will be found. Query.scalar() should be used when you want, for example, just the maximum of a single column.

6. Not understanding scoped_session and sessionmaker

This is quite possibly the greatest source of confusion among SQLAlchemy users. Especially so for developers who have very little experience in threaded programming. First, a quick recap:
  • Session: a container of instances of mapped classes; among other things lets you construct queries dynamically and add/delete/update entities
  • sessionmaker: a session factory generator (in other words, a function that returns a function that returns a new Session for each call)
  • scoped_session: a threadlocal manager and proxy of sessions
If you're still confused, you're not alone! So let me try and clear this up a bit. The sqlalchemy.orm.session.Session class is where the real meat is. The other two are merely helpers. In trivial scripts, this one is all you need.

The sessionmaker function is a convenience function for instantiating Session objects with customized parameters. Most notably session makers can't be used as proxies – you have to call one to get a session first.

The scoped_session function creates a thread local proxy/manager of Session instances. What this means is that you can access the scoped session object as if it was like a Session instance. The scoped session "knows" (by virtue of threadlocal objects) which thread is asking for a session and so it always returns the same session for the same thread. This is very convenient for multithreaded applications because you don't have to worry about sessions accidentally crossing threads.

Some want to call the scoped session to get a Session instance. While that certainly works, it's almost always unnecessary. Instead:

DBSession = scoped_session(sessionmaker(engine))
companies = DBSession.query(Company).all()

7. Query properties

Certain web frameworks (I'm looking at you, Flask-SQLAlchemy!) promote the use of query properties that supposedly make querying easier. Trouble is, it is apparently not obvious from the example code how one would query for specific columns, aggregates or other expressions using said query property.

The solution is to use the session directly:

db.session.query(Company.address)

8. Attempting to access related classes through relationships in queries:

Someone asks something like this every week on #sqlalchemy:
<anon> anyone see anything wrong with this query: c = Category.query.join(Category.parent).filter(Category.name=='Social Issues', Category.parent.name=='Teens').first()
What is wrong with this query is of course that Category.parent is a relationship and thus it doesn't have a "name" attribute. I don't blame anyone who falls for this though. It would make sense, from the syntactic perspective, for this to work. It might automatically add a join to Category.parent in the query. The reason why this can't be done is beyond my knowledge of SQLAlchemy. Anyway, to fix the query, you add an alias and join that:

parent = aliased(Category)
c = Category.query.join(parent, Category.parent).\
    filter(Category.name=='Social Issues',
           parent.name=='Teens').first()

9. Wanting to commit the session to get the ID of an object so it can be used to insert related objects

Most nontrivial SQLAlchemy applications need to work with relationships. Imagine a situation like this: You need to create a Company and add an Employee to it. You already have a relationship called employees in the Company class. So how would you accomplish this task? The typical answer goes along the lines of:
  1. Create the Company instance
  2. Add the Company instance to the session
  3. Flush the session
  4. Create the Employee instance with company_id from the previously flushed Company instance
  5. Add the Employee instance to the session
  6. Commit the session
This is just about how you'd do it without SQLAlchemy. But with SQLAlchemy this can be done much easier:
  1. Create the Company instance
  2. Create the Employee instance
  3. Add the Company instance to the session
  4. Commit the session
When the session is flushed, the Company row is first inserted to the database. Then, by virtue of relationship cascade, the Employee instance is discovered and also inserted into the database. The RDBMS can't tell the difference, but for the developer, the latter approach is much nicer.

10. Running tests using SQLite instead of the production RDBMS

I've written a separate blog post about this, but tl;dr: always test against the same RDBMS that you deploy with.

Sunday, August 25, 2013

Unit testing SQLAlchemy apps

I help out a lot on the #sqlalchemy channel on Freenode IRC. When people ask questions there, one of the issues that comes up often is unit testing applications that use SQLAlchemy. Almost all developers instinctively use SQLite for testing due to its simplicity and the possibility to use an in-memory database that leaves no garbage behind to be cleaned up. It seems like a clean and easy solution, yes? Well, my answer is that if you don't plan on deploying with SQLite support, don't test on SQLite! Always use the production RDBMS for testing. Going to deploy with PostgreSQL? Then test with PostgreSQL! Why?

There are at least two good reasons why testing and deploying with the same RDBMS is a good idea. The first and foremost is that SQLite is vastly different from other RDBMS's. For one, it does not really enforce column types, so code that erroneously inputs data of the wrong type won't cause errors when it should. There are also many semantic differences on how embedded and your typical client-server RDBMS's work, so you may run into bugs that only occur in production while all the tests pass just fine. The second reason is that SQLite's rather modest design, which lets it fit into small memory spaces, is also a big hindrance since it can't support some more advanced database features like window functions or recursive queries. This shortcoming prevents you from taking full advantage of the features of your chosen RDBMS.

If I managed to convince you, then you'll probably be asking how testing should be done on RDBMS's other than SQLite. The answer boils down to whether your RDBMS supports two crucial features: nested transactions and transactional DDL. Nested transactions are savepoints within a transaction, to which you can roll back without losing any changes done before the savepoint. Transactional DDL means that in addition to normal data modification (INSERT, UPDATE, DELETE), schema changes are also transactional. That means they can be rolled back, which is a very nice thing to have when unit testing. According to the article linked to above, the following RDBMS's support transactional DDL: PostgreSQL, SyBase, Microsoft SQL Server, DB2, Informix and Firebird. Most notably, MySQL and Oracle do not support it.

If your RDBMS does support the aforementioned two features, then you can conveniently test your SQLAlchemy apps in the following manner:
  1. Make sure you have an empty database for unit testing
  2. Create the engine, create one connection and start a transaction in it
  3. Create the tables
  4. Optionally, insert test fixtures
  5. For every test, repeat:
    1. Create a savepoint
    2. Run the test
    3. Roll back to the savepoint
  6. Roll back the transaction
This way all the tests are isolated from each other. Every tests gets the same "blank slate" as far as the database state is concerned. After the tests are done, the database will look just as empty as before the test suite was run, even if the tests were interrupted.

So how to actually accomplish this in practice? I'll give you an example using PostgreSQL and nose. This example should be adaptable for other equally capable test runners. The following code should be placed in __init__.py of the root package of your tests.

from sqlalchemy.engine import create_engine
from sqlalchemy.orm.session import Session

from your.package import Base  # This is your declarative base class


def setup_module():
    global transaction, connection, engine

    # Connect to the database and create the schema within a transaction
    engine = create_engine('postgresql:///yourdb')
    connection = engine.connect()
    transaction = connection.begin()
    Base.metadata.create_all(connection)

    # If you want to insert fixtures to the DB, do it here


def teardown_module():
    # Roll back the top level transaction and disconnect from the database
    transaction.rollback()
    connection.close()
    engine.dispose()
 

class DatabaseTest(object):
    def setup(self):
        self.__transaction = connection.begin_nested()
        self.session = Session(connection)

    def teardown(self):
        self.session.close()
        self.__transaction.rollback()

To take advantage of this setup, your test class should inherit from DatabaseTest. If you override either setup() or teardown(), make sure you remember to call the superclass implementations too.

If you want to use this testing scheme with your web framework or other application framework, you should be aware of the requirement that the framework's SQLAlchemy extension/middleware/whatever must have a way to receive the connectable (Engine or Connection) as a Python object and not just as a connection URL.