This week I worked on obtaining a diff of the Semantic data for a wiki page, this shows you what *semantics* got added and what *semantics* got removed on a page write. This is somewhat similar to what SemanticWatchList does but now in core and more efficient.
As a result SemanticWatchList shall be using this from core SMW in the newer versions and thus SMW+SWL will be more efficient to use together.
However, this wasn’t a plan to improve SWL initially. This is in result of my work behind improving SMW’s Special pages and maintaining more statistics; we will now have counts for usage of Properties in the database and thus Special:Properties and Special:UnusedProperties will just query this table instead of querying all the tables as it does now (we always keep realizing how stupid we were before :P).
I also plan to use this counts of Properties usage (since we already have them now:)) to make a few more special pages for SMW. I have in mind right now RecentSemanticChanges only; however this is currently out of scope for the project and so might be only suitable as some post-GSoC work.
Last few days I have done some major improvements to Special pages for SMW.
The old way of generating SpecialProperties involved unnecessary joins and limited result (some result were omitted), the new method is much simpler, I only query the required information in multiple simple queries which should be much much faster than the old way.
After this I also did some analysis on using Indexes in the db tables to see how MySQL worked with the new method and got amazing results.
The following query is done with no indexes (as is now), it runs very slow scanning about 10,724 rows to return only 250 of them. Slow Indeed 😉
mysql> explain select * from store3.smw_ids where smw_namespace=102 order by smw _sortkey limit 200,50;
| id | select_type | table | type | possible_keys | key | key_len | ref | ro
ws | Extra |
| 1 | SIMPLE | smw_ids | ALL | NULL | NULL | NULL | NULL | 10
724 | Using where; Using filesort |
1 row in set (0.00 sec)
Next is my new way using index (smw_namespace, smw_sortkey) runs faster by scanning only 1010 rows to return the same result of 250.
mysql> explain select * from store3.smw_ids where smw_namespace=102 order by smw
_sortkey limit 200,50;
| id | select_type | table | type | possible_keys | key | key_len | ref |
rows | Extra |
| 1 | SIMPLE | smw_ids | ref | NS_SK | NS_SK | 4 | const |
1010 | Using where |
1 row in set (0.00 sec)
While the new method is really fast we still need to cache stuff as this is still expensive for really large wikis
SMW now has about 28+ db tables, yes you heard it right and that’s where my bragging rights for DB Sharding comes from 😉 , read ahead to know more –
SMW now has about 28 db tables, this has been a major uplift from what we had initially. The more tables we have the less things to load into memory at a time, lesser storage requirement (no foreign keys) and faster querying (again no foreign keys to compare). Plus there’s an awesome feature that lets you to shard the db even more (the + in 28+), you can assign separate tables for each of your property (yes, you-the admin can control it now 😉 ), this is suggested for properties that are highly used in your wiki (now don’t you worry how to figure this out yet), This way you can have up to as many tables as you want. But still beware there might be bugs to fix.
Ok, now back to how do you know what properties will be used extensively in the future? this can be an estimate, say a property email is more likely to be used extensively so it should be marked as a fixed-property (yet to document on this stuff). But if you don’t want all this gibberish you can mark a property as fixed-property even when its in use, but then you would have to run a SQL script after that so your site can work perfectly ok (Btw this script is yet to be created, but is rumored to be fairly simple to create)
Ok, now the testing part. I have done some intensive testing using my own written tests for SMW plus on a local wiki (which has a massive import from semanticweb.org) and bugs are identified mostly but there might be a few of them still hiding for you. If you are reading this and want to help us to test stuff please do it! All you need to do is setup a local wiki with SMW installed (do check out from the master and then make the default store SQLStore3) then import from a highly used SMW wiki (which I assume you own if you have read this far :P) and then after the import is done run a refresh.php to refresh all your SMW content. After that I know you are gonna comment here about the bugs 😀
Hi, the midterm evaluation has come (time flies doesn’t it?) and finally I am able to put my code to master branch. I had expected this to be a tiresome job with lots of “Git merge conflicts” but thanks to my mentor Markus who suggested to keep the older code along with the newer one to support easier upgradation for users; this led to only copying the new files into a separate folder on the store. Here’s the link to this latest commit of mine https://gerrit.wikimedia.org/r/#/c/14773/
But beware when you read that line on the page below which says “+6647, -7” . This isn’t true actually :p as the whole lot of code isn’t my work, its mostly inherited code from the older store, with my tweaking at various places. I am looking forward to the next tasks upcoming now as I shall discuss with my mentor, maybe its time for some caching now 🙂
This Update got a little late, but let me sum it up.
New Stuff – UnitTests, DIHandler classes, new table for Booleans
And now the rants start 😉
Before I applied to GSoC I had showed interest in Jeroen’s proposal for writing UnitTests for SMW, now that I chose GreenSMW he is still making me write tests ;), which is actually very useful to test broken stuff rather than our initial attempts of testing by running a wiki. So, when I get stuck for a while waiting for some commit to merge I write tests. With this I am able to make very good use of my “paid time”. I recommend this for all GSoCers 😀
Now back to the Booleans, after a little research on Boolean datatype I finally settled on using TINYINT(1) for MySQL and BOOLEAN for PgSQL, I checked this stuff and it works. Performance wise we are using very little space for Booleans now and this also makes reading and writing such values faster.
I was suprised that this is not a popular datatype for SMW users after I discovered a bug in SMW’s handling of Booleans.
Also, we discussed about releasing the storerewrite version and though Markus suggested to release it as SQLStore3 keeping also SQLStore2, I and Jeroen pressed on releasing it as just SQLStore, without any redundant code from SQLStore2. There might be a migration script to help users upgrade without any side effects but only if time permits.