aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPriit Laes <plaes@plaes.org>2010-06-09 20:38:07 +0300
committerPriit Laes <plaes@plaes.org>2010-06-09 20:38:07 +0300
commit9e43093fd353f98d791936fa3deeeec9da22cf27 (patch)
tree0842de8210e8057f8a6f0222d8b6889aa2c19c26
parentAdded utility for initial portage->database sync (diff)
downloadgsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.gz
gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.tar.bz2
gsoc2010-grumpy-9e43093fd353f98d791936fa3deeeec9da22cf27.zip
Added progress report for second week
-rw-r--r--docs/gsoc/02-report.txt121
-rw-r--r--utils/db_init.py1
2 files changed, 122 insertions, 0 deletions
diff --git a/docs/gsoc/02-report.txt b/docs/gsoc/02-report.txt
new file mode 100644
index 0000000..debf139
--- /dev/null
+++ b/docs/gsoc/02-report.txt
@@ -0,0 +1,121 @@
+This is a weekly progress report no. 2 for Project Grumpy.
+
+As reported previously, I am building a system to index portage packages
+and related metadata to make package maintainership a bit easier for
+developers.
+
+First, a few words about the document metadata storage. For this project, the
+plan is to use a document-oriented and schema-free database (MongoDB) instead
+of a regular relational database system (like SQLite or PostgreSQL).
+
+This also means that we can create a single document collection, where
+documents correspond to simply "category/package" and collection containing
+whole ebuild tree.
+
+Document itself in the collection, is just a JSON-formatted dictionary with
+following structure (beware, this is work in progress, so some things are
+still missing)::
+
+ {
+ # "package/category" (primary index, unique)
+ '_id' : string,
+
+ # Version of the schema, used internally (just in case)
+ 'schema_ver' : integer,
+
+ # Package category
+ 'cat' : string,
+
+ # Package name
+ 'pkg' : string,
+
+ ## Data from metadata.xml
+ # List of herds maintaining this package
+ 'herds' : [ string, ... ],
+ # Long description of the package
+ 'ldesc' : string,
+ # List of maintainers (by email addresses)
+ 'maintainers' : [ string, ... ],
+
+ ## Data from ebuilds itself (but should be general)
+ # Description
+ "desc" : string,
+ # Upstream url(s) (FIXME: Do we need list here?)
+ 'homepage' : string,
+
+ # Array of all the package versions and their specific info
+ 'ebuilds' : [
+ # Package version (from category/package-version)
+ 'version' : string,
+
+ # Eapi version
+ "eapi" : integer,
+ # List of USE flags supported by this ebuild
+ 'iuse' : [ string, ... ],
+ # Package keywords ("x86", "~amd64", ...)
+ 'keywords : [ string, ... ],
+ # Licenses
+ 'licence' : [ string, ... ],
+ # Package slot
+ 'slot' : string,
+
+ # Need to figure out proper structure for these, so we can also
+ # map out USE flags ;)
+ 'depend' : TODO!!!
+ 'rdepend' : TODO!!!
+ ]
+ }
+
+So how about querying the data? That's easy. (Please note we are using MongoDB
+shell). So, what if a developer wants to know which packages he is supposedly
+maintaining::
+
+ > db.ebuilds.find({'maintainers' : '...@gentoo.org' })
+ {... document data ...} # (Too much info :) )
+ > db.ebuilds.find({'maintainers' : '...@gentoo.org' }).count()
+ 7
+
+And the results come fast. I mean really fast.
+Ok, how about checking how many packages under 'dev-python' are using specific
+EAPI version::
+
+ > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count()
+ 202
+ > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count()
+ 3
+ > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count()
+ 255
+ > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count()
+ 125
+ > db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count()
+ 0
+ > db.ebuilds.find({'cat' : 'dev-python' }).count()
+ 504
+ > 202+3+255+125 - 504
+ 81
+
+Ahem.. looks like we have a "design issue" with our document structure. So
+back to the drawing board.
+
+Last week's progress report
+===========================
+
+Last week's progress has been a bit slow, I have mostly played with document
+structure and played a bit with pkgcore's internals. Although I now have
+portage contents inside the database the document structure itself is far from
+ideal (as you can see from the example with EAPI counts given earlier).
+
+I have committed some of the stuff I have been working on into Grumpy's repo,
+so in case you are interested check it out from [1].
+
+[1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary
+
+First a warning, the portage->mongodb syncer is slow. I mean really slow - it
+takes about 3 hours (or even more) on my laptop to fully scan the contents of
+portage and store the data in database.
+
+Plans for current week
+======================
+
+1) Speed up the portage syncer
+2) Improve document structure
diff --git a/utils/db_init.py b/utils/db_init.py
index c5d6e74..c71a5ef 100644
--- a/utils/db_init.py
+++ b/utils/db_init.py
@@ -43,6 +43,7 @@ def main(path):
eapi = pkg.eapi,
keywords = list(pkg.keywords) if pkg.keywords else [],
# TODO, need to figure out a proper queryable structure for these
+# iuse ??
# license = pkg.license,
# depends = pkg.depends,
# rdepends = pkg.rdepends