Overview of the Google App Engine Persistence framework
- sön 24 aug, 2008 kl 08:38
- 2 kommentarer
- Databaser
The Google cloud computing environment, Google App Engine, currently provides a low cost (free) highly scalable runtime environment (python) for web applications. Along with App Engine specific APIs and support for their general service APIs, you have access to a distributed, scalable persistence engine, Datastore. A quick review of App Engine Datastore sets firm your impression that Google has again relied on the ”we build it better” framework approach. Not following a relational model or a webservice, Datastore started with goals of high availability, scalablity and performance. Because of quota limits on current free accounts, its hard to validate their success here; however, we can get a better understanding of the architecture of Datastore by checking out their internal ”Bigtable” implementation, which is the basis for Datastore.
Bigtable was designed by Google to provide reliable persistence that scales to petrabytes over thousands of machines. Bigtable is a column storage system (orthagonal to typical row storage rdbms, similar but faster then indexing each column in a row based db). Built in typical ”do it ourselves” Google fashion, Bigtable uses internal Google software such as the GFS filesystem, a custom scheduling system, SSTable persistence, and the Chubby distributed lock service. One goal for the data model behind Bigtable was to be not as limiting as a hashmap, but still simple enough to manage on flat files, so they came up with a a sorted multidimensional map indexed by row, column keys and a timestamp. This data model was not intended to provide a general column storage solution, but to fit their big requirements (like storing timestamped versions of web page content).
Row keys are lexiconicly partitioned into ”tablets”- a load balanced unit. Clients can take advantage of tablet partitioning achieve some measure of data locality. Columns are grouped within ”families” of the same or similar type. The column family must be created prior to data inserts (a schema like constraint); however, the actual columns themselves are dynamic and unbound. The Timestamp index for each row / column cell exposes a time dimension. Configurations dictate if this cell ”versioning” is created automatically, and / or for how deep.
Clients use an api, not sql for accessing data. Typical constraints including regex can be used for atomic row operations, but no user transaction mechanism is supported. Batch operations and a custom data transformation language manage more complex units of work, including full support for MapReduce jobs (as a source or target).
Now back to Datastore:
In the Datastore framework, persisted objects are called Entities and the attributes on those entities are called Properties. What’s interesting is that properties come in two flavors- fixed and dynamic, depending on which base class the entity comes from. A persistent model using python classes which inherit from the Model class can contain ”fixed” attributes which inherit from the Property class. Entities created from the Expando class can have both fixed properties and dynamic properties- a dynamic property look like a map, and is created when application sets a value. Two instances of same kind (entity class) can have different types for the same dynamic property, but a query against that dynamic property will only return values of the type used in the query constraint. Instances without this property (like a null value) are not returned.
Properties expand the fixed set of data types available to use in Datastore entities, and extend typing to include some of the GData types (interesting examples include BlobProperty, ListProperty, IMProperty, LinkProperty and GeoPtProperty). StringProperty (strings < 500 bytes) can be used for queries, and are indexed whileTextProperty (strings >= 500 bytes) are not indexed, and can not be used in queries. ReferenceProperty uses the key of the referred entity as the property value. Using this property is like using the instance- if entity is not in memory, it is automatically loaded from datastore (de-referenced). The referenced object can be deleted, so referring object should test to see if it exists. All referenced entities automatically get ”back reference” property, by default named ”modelname_set”, representing a query result of referrer entities. This provides an implicit (crude but simple) parent child relationship (more on these later). The de-referencing, back-referencing, type checking etc are only available with ”static” reference properties. Property names starting with ”_” are transient.
An example of an entity and its properties, from the Datastore docs:
class Pet(db.Model):
name = db.StringProperty(required=True)
type = db.StringProperty(required=True, choices=set(["cat", "dog", "bird"]))
birthdate = db.DateProperty()
weight_in_pounds = db.IntegerProperty()
spayed_or_neutered = db.BooleanProperty()
owner = db.UserProperty(required=True)
pet = Pet(name="Fluffy",
type="cat",
owner=users.get_current_user())
pet.weight_in_pounds = 24
# Text properties can store large values.
obj.text = db.Text(open("a_tale_of_two_cities.txt").read(), "utf-8")
One-to-many relationships use the ReferenceProperty like a foreign key on the many side: class Parent has one-to-many relationship to class Child by putting an attribute of type ReferenceProperty in child, and setting value to the Parent class. The constructor for ReferenceProperty includes a string which would be the property name of the collection from the Parent class point of view.
class Parent(db.Model):
name = db.StringProperty(required=True)
class Child(db.Model):
parent = db.ReferenceProperty(Parent, collection_name='kids')
name = db.StringProperty(required=True)
The parent’s implicit collection property is used to retrieve the children- the name of this property was defined by the child’s ReferenceProperty constructor (in example above, ‘kids’):
angelina = Parent()
herKids = angelina.kids();
Many-to-many relationships can be built with traditional link table approach: a class containing ReferenceProperty attributes for each side of the relationship. This is a bit expensive to traverse, requiring multiple queries for each access. Another ‘manual’ approach is to use a list of keys, usually owned by the larger size side of the relationship with keys referring to members on the smaller side of the relationship (this keeps list smaller). Drawback of this method is the manual lookup of the related values from the given keys.
Entities are created by calling the inherited Model.put() method or using the static db.put(entity). The create or update decision can be ignored: if instance exists, put is an update. Query results (a list) can be modified and entire list be saved with put. The put() returns the key of the entity for later use. Because it is common to pass this reference around, keys can be converted to strings for external use, then converted back to key objects for dereferencing. Entities are removed via the delete method or db.delete(); there are no delete semantics within the framework.
You can retrieve entities through the use of framework created objects which implement either the Query interface or the GQLQuery interface. For example, the Model class all() method returns an instance of an object which implements the Query interface. The main difference between the Query and GQLQuery interfaces is the approach for defining constraints. When using the Query interface, the developer defines filters (constraints) and sort ordering of results programatically while the GQLQuery interface accepts a SQL-like string. Once a query instance has been retrieved, the actual execution of the dataset query can be performed in a lazy fashion- the user uses an iterator or methods like fetch(). The query can be re-executed, and parameters can be bound again. The current free app engine accounts set a limit of 1000 result rows for any query- a significant framework constraint which must be factored into model / batching designs.
For each query used within an application there is a manually created or automatically generated ”index” (logical table containing keys of result in defined order). Basic query indexes are auto generated, while queries which define descending or multiple sort orders, or multiple filter types require a manual configuration in the index.yaml file. Similar to indexes on row-based databases, entity creates and updates are affected by the number of indexes against them. You can lock down the indexes created by turning off the auto-generation; executing a query not covered by an index will then throw an exception. There are some significant restrictions to the query mechanism- one being a limit of one property for inequality constraints (an index can only represent one inequality and maintain adjecent rows). Inequality queries also force a sort order on the property used in the inequality, and this must be the first sort order.
Datastore supports both programatically explicit and implicit transactions. Explicit transactions available through the db.run_in_transaction() call, which uses a unit of work function and binding information as parameters. Implicit transactions occur with updates to a composite of entities called an Entity Group. Entity Groups allow the Datastore framework to update multiple entities in the same transaction, in same area of distributed environment (datastore node). Terminology used in describing this entity composite: entities can have a parent and can be a parent to another entity. Entities related this way are ancestors within a path, starting with a root entity (entity without a parent). The path is used to qualify ancestor entities- entities can be created and accessed via this path even if parent entities have been removed or have never even been created (you only need the path ”key” of what would be the parent). Something to note: one to many relationships do not need Entity Groups (and their implicit transactions)- they can be modeled with ReferenceProperty attributes.
Transaction isolation is similar to ”Read Commited”; however, with a significant drawback- queries off indexes (basically any constraint not the primary key) behave like ”Read Uncommited”. The commit() operation internally has two steps, one to update the entity, one to update any indexes. One result of this could be a result set with entities which do not match the query constraint. An example: an entity which contains an attribute called ’status’, with values ”Running” or ”Paused” is currently in the ”Running” status. Transaction 1 updates this to status ”Paused”. Until the commit() is invoked, all other transactions see prior value of ”Running”. When Transaction 1 calls commit(), internally entity is updated to ”Paused” and other direct access of the entity will see the new ”Paused” value. Until the internal commit() operation completes step 2 (updating the indexes), it is possible for other transactions querying with something like ”where status = ‘Running’” will return this entity, with its status now with a ”Paused” value.
Some other notes about transactions: Datastore can retry transactions which roll back due to optimistic lock failures- useful with the occasional concurrency; however, concurrency must be through through when designing larger Entity Groups. Multiple updates of the same entity are not allowed within a single transaction, and queries can not be executed within an explicit transaction.
Tools for managing your data include a Data Viewer, where you can inspect your model objects and their values. If you need to update your schema, you can follow traditional operational practices with Software as a Service: disable access to the data, modify schema (here update model classes), then optionally update the existing data to match the new schema. This last step is a bit tricky because it is done programatically using the Datastore API, and thus is subject to limitations in result set size (1000 entities currently) and execution time. Best practice is to ”nibble” away at updates, repeating the update operation using a limited set via a where clause until all rows are updated.
lexiconicly sorted petrabytes are cute :-)
http://jsolutions.se/?p=237