This is a weekly progress report no. 2 for Project Grumpy. As reported previously, I am building a system to index portage packages and related metadata to make package maintainership a bit easier for developers. First, a few words about the document metadata storage. For this project, the plan is to use a document-oriented and schema-free database (MongoDB) instead of a regular relational database system (like SQLite or PostgreSQL). This also means that we can create a single document collection, where documents correspond to simply "category/package" and collection containing whole ebuild tree. Document itself in the collection, is just a JSON-formatted dictionary with following structure (beware, this is work in progress, so some things are still missing):: { # "package/category" (primary index, unique) '_id' : string, # Version of the schema, used internally (just in case) 'schema_ver' : integer, # Package category 'cat' : string, # Package name 'pkg' : string, ## Data from metadata.xml # List of herds maintaining this package 'herds' : [ string, ... ], # Long description of the package 'ldesc' : string, # List of maintainers (by email addresses) 'maintainers' : [ string, ... ], ## Data from ebuilds itself (but should be general) # Description "desc" : string, # Upstream url(s) (FIXME: Do we need list here?) 'homepage' : string, # Array of all the package versions and their specific info 'ebuilds' : [ # Package version (from category/package-version) 'version' : string, # Eapi version "eapi" : integer, # List of USE flags supported by this ebuild 'iuse' : [ string, ... ], # Package keywords ("x86", "~amd64", ...) 'keywords : [ string, ... ], # Licenses 'licence' : [ string, ... ], # Package slot 'slot' : string, # Need to figure out proper structure for these, so we can also # map out USE flags ;) 'depend' : TODO!!! 'rdepend' : TODO!!! ] } So how about querying the data? That's easy. (Please note we are using MongoDB shell). So, what if a developer wants to know which packages he is supposedly maintaining:: > db.ebuilds.find({'maintainers' : '...@gentoo.org' }) {... document data ...} # (Too much info :) ) > db.ebuilds.find({'maintainers' : '...@gentoo.org' }).count() 7 And the results come fast. I mean really fast. Ok, how about checking how many packages under 'dev-python' are using specific EAPI version:: > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count() 202 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count() 3 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count() 255 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count() 125 > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count() 0 > db.ebuilds.find({'cat' : 'dev-python' }).count() 504 > 202+3+255+125 - 504 81 Ahem.. looks like we have a "design issue" with our document structure. So back to the drawing board. Last week's progress report =========================== Last week's progress has been a bit slow, I have mostly played with document structure and played a bit with pkgcore's internals. Although I now have portage contents inside the database the document structure itself is far from ideal (as you can see from the example with EAPI counts given earlier). I have committed some of the stuff I have been working on into Grumpy's repo, so in case you are interested check it out from [1]. [1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary First a warning, the portage->mongodb syncer is slow. I mean really slow - it takes about 3 hours (or even more) on my laptop to fully scan the contents of portage and store the data in database. Plans for current week ====================== 1) Speed up the portage syncer 2) Improve document structure