This week I looked at one of the performance issue that affects each and every SMW installation on every page write. Yes, every time a user edits a page and saves it back a lot is done unnecessarily.
Every time SMW goes through all of the text, deletes all the data it had for that page and rewrites the DB with the new data you entered; but what if you only fixed a small typo you had just made? or added new text that has no semantic data associated?? SMW should wisely update your DB instead of blindly doing it tasks. So now we have a hash based approach where we verify if something is new by comparing it with the old stuff’s hash and only update the DB when some Semantic Data changed. This results in very few queries on a page edit. Currently I am working on  a branch of SMW and this change is visible at https://gerrit.wikimedia.org/r/#/c/10706/
When all this stuff is merged with the master branch of SMW and released you can upgrade it and say Yay! 😉

This week I did mostly splitting of the store code, this was essential for further work to go on smoothly. Not much to explain here 😛

Here’s the link to the change https://gerrit.wikimedia.org/r/#/c/10400/

So, The first week of GSoC is over and we have a new feature. fixedProperties – these allows the site admins to allow some (highly used) properties to be stored in separate tables by listing them in a simple PHP array.

Coming Up : Next I am splitting the core db access code into multiple files (for better readability) and next is the inception of the new class SMWSemanticDataCache which will be a subject-centered cache of all SMW data that could be used for faster retrieval of SMW data rather than querying all the tables for each page.

SMW has been doing too many unnecessary deletes and inserts into your database lately and this has been worrying you too much? Don’t you worry about it now, I have wrote this little code (will link to it after the code cleanup day) that stops SMW from doing this behavior.

 

So what does this new SMW do? It will now check if something has changed for a page and only rewrite the values if something has really changed (Yes, it is just a small feature but look at some profiling data and you will see the difference it does). Here’s some information I gathered for a page with about 14 property-value tuples :

  1. The newer version did about 14 less queries (including 1 update, 1 insert ,8 delete and remaining select queries).
  2. Memory used by older version 38578056 and by newer version 36118576
  3. Time taken by older version 774.183 and by newer version 753.942

 

Though, I don’t expect property-values to really change much in a wiki page but there will always be new additions, so I think that SMW must also check for each property one-by-one and perform updates rather than rewrite everything.

Will this happen? you will find out in another blog from me, till then have a good day 🙂

We discussed ways to improve the performance at the ground level, partitioning the tables by properties and creating new tables (duplicate ones) with partition based on subjects. The documentation is again at http://www.semantic-mediawiki.org/wiki/SQLStore_Update. This let to an interesting idea of separating the smw_id table into  two new tables for  subjects (pages) and properties separately, and this will enable us to store additional information about properties such as their table name (as now we shall have different ones for some properties) and  datatype of their values.
Some of the places that helped us are http://backchannel.org/blog/friendfeed-schemaless-mysql and http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Currently, I am ready to create these tables, any further additions will be done on the fly, as otherwise we will never get to code actually. I won’t rewrite the setup.php now and will do it once everything is finalized (also then we will think of the migration script).

I and Markus set today our priorities for the various small work segments on my project and database update won the race, I will shortly be making updates to the SMW databases as per mentioned at http://semantic-mediawiki.org/wiki/SQLStore_Update

My EndSem Exams have almost ended, and today I will start to work on the project.
My next few objectives will be finding the places where we can monitor the write queries done on page edit, and check if we are doing the same query again, if yes we don’t do it and save a lot of time and resources 🙂
Here’s the full list of objectives http://www.google-melange.com/gsoc/project/google/gsoc2012/nischayn22/26001

Exams, tests and vivas all knocking on my doors together; there are lots of them.

Will continue working in small bits till 7th May.

Right now on working on a bug https://bugzilla.wikimedia.org/show_bug.cgi?id=31269 .It’s interesting to work on bugs with some votes, so you know you made something useful to someone.

Thanks to Yuvipanda for  reminding me of Invalidating Cache, Possible solutions are

  1. expiry time (easy)
  2. dependency check ( eg. some new properties are added so we mark the cache of Special:Properties as invalid )

For once I have used memcache , which in absence of it will use objectcache. Let’s see if Markus and Jeroen have any suggestions. We can roll back easily, however this serves our purpose and is working 🙂