Thursday, April 30, 2009

Nakakoji et al. (2005) Understanding the nature of collaboration in open-source software development

So this is another paper in what I had hoped was my review of social psychology inspired analysis and design of online communities. However this paper just cites the Ling et al (2005) paper in passing and its main focus is on a preliminary analysis of 65 months of postings to the open source image processing sofware GIMP mailing list.

This paper reminds me of the Barcellini et al., (2005) paper on analysis of the Python mailing lists postings. The authors don't really draw any conclusions and some of their graphs are a little difficult to interpret. However there are some interesting approaches with connect with my own analysis of evolving vocabulary use in the disCourse system. Figure 8 above shows the results of a principle components analysis on the subjects of the mail posting and presents how the topics have changed over time. An interesting approach, but it seems odd that the most consistently active components are not labelled with text in the diagram; making it unclear what we can actually draw from this representation.

Another interesting look in the data can be seen in this next image where the vocabulary terms used are broken up by the different types of user in the system. Again the graph is a little difficult to interpret, and no clear conclusions can be drawn. The authors suggest that this approach might be used to identify the roles of users within the community, which is an interesting idea but needs to be fleshed out a little further. The results of some effective means of classification could feed into Nabeth and Roda's autonomous agents approach, but it still seems to me that the best first step is to feed information like this back to the community ...

Overall an interesting paper in that it presents a number of different ways of analyzing mail list postings. The authors suggest that much analysis has been focused on code, and that there is need for more focus on social-technical systems; although since that has been the focus of my own lit review, it seems like the author's might well benefit from reviewing the last 10 years proceedings of the HICSS persistent conversations workshop which has lots of this kind of stuff. The fact that the authors don't reference seminal work like that of Sacks and Marc Smith from Microsoft research indicates that there is lots more literature they could connect with.

I ran their references through my scholar system, but only grabbed 9 of 14 possible refs, which was disappointing. Partly due to inconsistent formatting. Also some of the references switch to initial before surname after the first author; and trying to fix that I ran into ruby regex stack overflow errors. That said I have increased the precision of my title regex and I have set it up so that the title is now linked to the paper ref from Google Scholar, which is more intuitive - clicking on the title takes you to the paper itself, e.g. in ACM Portal or IEEE Xplore, or even the PDF if available, and clicking on Cited by X takes you to the Google Scholar citation list. Of course mucking around with the regex is a terribly time sync. My latest thoughts are the need to explicitly describe the different citation formats and re-build the regex thinking of them explicitly.
[1] Aoki, A., Hayashi, K., Kishida, K., Nakakoji, K., Nisinaka, Y., Reeves, B., Takashima, A., Yamamoto, Y., A
Case Study of the Evolution of Jun: an Object-Oriented Open-Source 3D Multimedia Library,
(Cited by 42) Proceedings of International Conference on Software Engineering (ICSE2001), Toronto, CA., IEEE Computer Society, Los
Alamos, CA., pp.524-533, May, 2001.
[2] Butler, B.S., Membership Size, Communication Activity,
and Sustainability: A Resource-Based Model of Online Social Structures, Information Systems Research,
(Cited by 184) v.12 n.4,p.346-362, December 2001.
[3] Butler, B., Sproull, L., Kiesler, S., Kraut, R. . Community effort in online groups: Who does the work and why? In Leadership at a distance, Weisband, S., Atwater, L. (Eds.), Laurence Erlbaum, 2005 (forthcoming).
[4] Cosley, D., Frankowski, D., Kiesler, S., Terveen, L, Riedl, J., How Oversight Improves Member-Maintained Communities, (Cited by 32) Proceedings of CHI 2005, pp.11-20, ACM Press, 2005.
[5] Giaccardi, E., Fogli, D., Beyond Usability Evaluation in Meta-Design: A Socio-Technical Perspective, IJHCS, (Cited by 4) (submitted).
[6] Hippel, E.v., von Krogh, G.v., Open Source Software and the "Private-Collective" Innovation Model: Issues for Organization Science, Organization Science, Vol (Cited by 13).14, No.2, pp.209-223, March-April, 2003.
[7] Lakhani., K.R., Hippel, E.v. , How open source software works free user-to-user assistance (Cited by 524). Research Policy, Special Issue on Open Source Software Development, 32, pp. 923- 943, 2003.
[8] Ling, K., Beenen, G., Ludford, P., Wang, X., Chang, K., Cosley, D., Frankowski, D., Terveen, L., Rashid, A. M., Resnick, P., and Kraut, R. Using social psychology to motivate contributions to online communities (Cited by 173). Journal of Computer-Mediated Communication, Vol.10, No.4, 2005.
[9] Nakakoji, K., Y. Yamamoto, Y. Nishinaka, K. Kishida, and Y. Ye., Evolution Patterns of Open-Source Software Systems and Communities, Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2002), pp.76-85, 2002.
[10] Nakakoji, K., Takashima, A., Yamamoto, Y., Cognitive Effects of Animated Visualization in Exploratory Visual Data Analysis, Information Visualisation 2001, IEEE Computer Society, Los Alamos, CA (Cited by 4)., pp.77-84, July, 2001.
[11] Pangaro, P., Participative systems. Manuscript, 2000, Available at:
[12] Preece, J and Krichmar, D. M. Online communities. Jacko, J. and Sears, A. (Eds.) The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, pp.596-620, Lawrence-Erlbaum, 2003.
[13] SourceForge,
[14] Ye, Y., Kishida. K., Toward an Understanding of the Motivation of Open Source Software Developers, Proceedings of 2003 International Conference on Software Engineering (ICSE2003), Portland, Oregon, pp. 419-429, May 3-10, 2003.
[15] Ye, Y., Nakakoji, K., Yamamoto, Y., Kishida, K., The Co-Evolution of Systems and Communities in Free and Open Source Software Development, in Free (Cited by 24)/Open Source Software Development, S. Koch (Ed.), Chap.3, pp.59-82, Idea Group Publishing, Hershey, PA., 2004
Cited by 4 [ATGSATOP]

Wednesday, April 29, 2009

Using Artificial Agents to Stimulate Participation in Virtual Communities (2005)

This is a position paper on how one might use artificial agents to encourage individuals to get more involved with online communities. This is another paper that I am reading in my review of social psychology inspired analysis and design of online communities.

There is a subsequent paper called "A Social Network Platform for Vocational Learning in the ITM Worldwide Network" which has yet to be cited by anyone but does present some results. There is also an earlier paper cited by 50 [ATGSATOP] on "Using conversational agents to support the adoption of knowledge sharing practices" from which the image above is taken.

So the paper that I actually have is a bit slim, and I should probably read the other papers to make valid comments; however I printed that one out a while ago, it's the one I read and I don't want to waster that paper. In general I am skeptical of the use of "agents" in this kind of setting, but the paper does have some interesting references. They talk about research across many different fields into the mechanics of knowledge exchange in groups: knowledge management and organization (Cothrel and Williams, 1999); Computer Supported Collaborative Work (Majchrzak et al., 2003); complexity (Reed, 1999); social computing (Erickson et al., 2002); sociology and communication (Ridings and Gefen, 2004), and psycho-sociology (Beenen et al., 2004).

This was the point at which I really wanted to be able to dump this papers list of references into google scholar and this was the beginning of some serious procrastination as I built a system to do that, but here are the results (given two or three tweaks to the regex and the addition of lead author to the google scholar search to increase change of hitting the right paper - am also thinking that the paper title should link to the paper itself where possible - [Cited by] can link to list of citations):
Angehrn A. A., 2004. Designing Intelligent Agents for Virtual Communities (Cited by 9). INSEAD CALT Report 11-2004
Beenen, G. et al., 2004. Using Social Psychology to Motivate Contributions to Online Communities (Cited by 166). Proceedings of ACM CSCW 2004 Conference on Computer Supported Cooperative Work, Chicago, IL. 2004
Blanchard A. and Markus L., 2002. Sense of Virtual Community-Maintaining the Experience of Belonging (Cited by 58). Proceedings of the 35th HICSS Conference -Volume 8, Hawaii
Chan, C. et al. M., 2004. Recognition and Participation in a Virtual Community: A Case Study (Cited by 4). Proceedings of the 37th HICSS Conference, Hawaii.
Cialdini, R. B., and Sagarin, B. J., 2005. Interpersonal influence (Cited by 39). T. Brock & M. Green (Eds.), Persuasion: Psychological insights and perspectives. (pp. 143-169). Newbury Park, CA: Sage Press
Cothrel J. and Williams R., 1999. On-Line Communities: Helping Them Form and Grow (Cited by 92). Journal of Knowledge Management, 3(1), 54-65, March 1999.
Erickson, T., et al., 2002. Social Translucence: Designing Social Infrastructures that Make Collective Activity Visible (Cited by 122).
Communications of the ACM (Special issue on Community, ed. J. Preece), Vol. 45, No. 4, pp. 40-44, 2002.
Hall, H., 2001. Social exchange for knowledge exchange (Cited by 46). Paper presented at Managing knowledge: conversations and critiques, University of Leicester Management Centre, 10-11 April 2001.
Koh J. and Kim Y.-G., 2003. Sense of Virtual Community: A Conceptual Framework and Empirical Validation (Cited by 47). International Journal of Electronic Commerce, Volume 8, Number 2, Winter 2003-4, pp. 75.
Kinshuk and Lin T., 2004. Cognitive profiling towards formal adaptive technologies in web-based learning communities (Cited by 8). Int. J. Web Based Communities, Vol. 1, No. 1, 2004 103
Majchrzak A. Et al., 2003. Computer-Mediated Inter-Organizational Knowledge-Sharing: Insights from a Virtual Team Innovating Using a Collaborative Tool (Cited by 72). Information Resources Management Journal, Vol. 13, No. 1
Reed, D., 1999. That Sneaky Exponential: Beyond Metcalfe's Law to the Power of Community Building (Cited by 43). Context Magazine, spring 1999,
Rheingold, H., 1993. The virtual community: Homesteading on the electronic frontier (Cited by 2926). Reading, MA: Addison-Wesley.
Roda, C., et al., 2003. Using Conversational Agents to Support the Adoption of Knowledge Sharing Practices (Cited by 50). Interacting with Computers, Elsevier, Vol. 15, Issue 1, pp. 57-89, January.
Rogers, E. M., 1995. Diffusion of Innovations (Cited by 20243)(Fourth Edition). New York, Free Press
Sharratt, M; Usoro, A., 2003. Understanding Knowledge-Sharing in Online Communities of Practice (Cited by 34). Electronic Journal of Knowledge Management, 1(2), December 2003.
Thaiupathump, C., Bourne, J., & Campbell, J. (1999). Intelligent agents for online learning (Cited by 37). Journal of Asynchronous Learning Networks, 3(2).
Thibaut, J. W., and Kelley, H. H., 1959. The social psychology of groups (Cited by 3391); New York: Wiley.
Tung, L., et al. 2001. An Empirical Investigation of Virtual Communities and Trust (Cited by 14). Proceedings of the 22nd International Conference on Information Systems, 2001, pp. 307-320
This allows me to parse the references in their section on social cognition literature in a completely different way from before. [of course immediately I want to know what is the average number of citations to expect given the year of the paper, and use that to adjust the coloration]. The authors describe a set of principles derived from social theories including the establishment of the following:
  • A climate of trust (Tung et al., 2001 - [14])
  • A sense of community (Blanchard & Markus, 2002 - [58]; Koh & Kim, 2003 - [47])
  • A feeling of recognition for member actions (Chan et al., 2004 - [4])
They also refer to Social Exchange theory (Thibaut & Kelly, 1959 - [3391]), Influence theory (Cialdini & Sagarin, 2004 - [39]), Critical Mass theory (Reed, 1999 - [43]) Social Translucence (Erickson et al., 2002 - [122]) and the Theory of Innovation (Rogers, 1995 - [20243]). Based on just the text I was flustered feeling that I needed to read all these papers, but the citation counts at least give me a way to distinguish papers that are likely to be key ones in their respective fields.

The main point made by the authors in the paper is that software agents should be able to make use of social cognition theories to select from different types of interventions in order to stimulate participation in an online community. This would include understanding the phase of community membership an individual is in and the type of member they are. The authors concede that there are many challenges to such a system, including user acceptance of the agents themselves. This seems like the real challenge - the interventions are presumably limited to automatically generated emails, or posts in discussions, and the danger is that these will be considered spam by users. For all the effort it would take to build such a system it seems to me that a better effect would almost always be achieved by investing the time/money in community facilitators who would take a genuine interest in all members of the community and contact them as necessary to try and stimulate participation. Automated interventions are likely to be regarded with contempt - I think, although that is my intuition - would be good if I could cite studies indicating that. Ling et al (2005) [or Beenen et al., 2004 in this paper ] used bulk emails to elicit different behaviours in ratings community. I guess it all depends on the community.

Lead authors homepage

Cited by 8 at time of post according to Google Scholar

Further Procrastination related to Journal Citation Reports

So I finally found the right place to look up journal impact factors. The ISI Web of Knowledge Journal Citation Reports service. However you need to subscribe to the service, and most universities do, so I have access through University of Hawaii. I was particularly interested to discover that they have an API for their service, and that someone has already written a ruby client for it:

I also found some interesting discussions by people who would like to sort PubMed records by impact factor:

a non-subscription based site for access to slightly older impact factors:

and a series of articles on impact factors, cite-ranks and other fun stuff:

It seems that impact factor ratings are not without controversy. Anyhow, I haven't really digested all this, and I should probably stop procrastinating and get on with my literature review, but it makes me think that I could modify my Google Scholar system to lookup citation counts both on google scholar and also through ISI and then compare them to see how different they were, which might be realllllly interesting, and possibly generate enough data for a conference paper ...

Jiminy: A Scalable Incentive-Based Architecture for Improving Rating Quality (Kotsovinos et al., 2006)

This paper was one of a set I printed out as part of my look at social psychology inspired approaches to online communities. It turns out it is not particularly inspired by social psychology, but rather just references a few other papers that do.

The authors take a mathematical approach to try and determine the honesty of users contributing to a ratings system, and they test their approach on the GroupLens data set of a million movie ratings. The paper is also concerned with the scalability of their algorithm, but I am less interested in this side of the paper at the moment.

The authors define a "nose length" in their Pinocchio model, which is an indication of how honest a user is being in their ratings. The "nose length" is calculated from the log-likelihood of the user ratings of films based on all the other ratings for the same film by other users. Basically the more that you diverge from how others have rated things, the more likely you are to be classified as dishonest. This "nose-length" or Z value is calculated for all users in the GroupLens data set as shown in the figure above. The authors show that inserted data for made-up dishonest users clearly falls outside of the majority of nose-lengths for real users; and go on to assert that this demonstrates that their model allows them to detect dishonesty.

I think this in an interesting paper, but I am not convinced that the model is necessarily detecting dishonesty. It seems clear that it can be used to detect robot like behaviour and distinguish that from the human data set, but the authors have no real information on which humans were behaving honestly and which were not. The real test of this system would be to see how it operated with a real community of raters, but the evaluation presented in this paper is of scalability with mock data. Of course getting to play with a real community is of course the trick, and so this shouldn't be seen as too serious a flaw in the paper, but I think the authors should be careful about declaring that they can distinguish honesty from dishonesty.

One of the main things the paper made me think was the way community's of raters provide a data set that is much more amenable to mathematical analysis. In an online community with discussions and opinions it would be much much more challenging to be trying to automatically detect levels of honesty, and one might question the benefit of doing so. Makes me think that the field of rating community research (collaborative filtering?) is quite different from the field of online community research - arguably a subset, but then I think it also falls in the machine learning camp. In terms of providing evidence for the validity of one online community design pattern over another this paper doesn't help me much. It presents a design pattern; but the empirical validation is missing in terms of what effect it would have on a real community.

Cited by 2 at time of post according to Google Scholar

Tuesday, April 28, 2009

Distributing Android Emulator with installed applications

So I had some trouble trying to ask questions on some google groups about how to distribute an android emulator with a pre-installed application. My questions didn't show up after posting through the web interface, and only appeared to show up later after I emailed the same questions - fortunately Google Groups collected them all together, and here are links to the for the record and my convenience :-)

My post to android developers

My post to android beginners

I hate being a duplicate, cross-poster, but really really these things didn't show up on either list for several days.

Anyhow, once the posts showed up I got help from several kind folks. It seems that various problems I had been having were associated with using the older sdk, and once I upgraded to 1.5r1 everything seemed to get sorted; although it did take combining the various suggestions to finally produce a distributable package. The package ended up looking like the above image; and the bash script I used for startup contained the following:

./emulator -sysdir . -datadir . -skindir . -skin HVGA -sdcard sd256m.img
after I had created an sdcard image using the following syntax:
./mksdcard -l SD256M 256M sd256m.img
I had to install my application by downloading it through the browser, but finally everything seems to be working. I'll try and make the package and application available to everyone soon.

Citation parsing regular expression breakthrough

So after staring at another paper in a new field where I wasn't sure which of the papers it cited should be the ones to read, I said to myself, I am not going to let these regular expressions get the better of me. By adjusting the granularity of my unit tests I cranked our a new regular expression that would reliably extract the title of the papers from my test set of seven citations. Here's that expression so far:

SURNAME = '[A-Z][a-z]{1,}'
INITIALS = '((\s[A-Z](\.|,|\.,))(\s?[A-Z](\.|,|\.,))*)'
TITLE = '(([A-Za-z:,\r\n]{2,}\s?){3,})'
REGEX = /([^e][^d][^s][^\.]\s|\d+\.?\s|^)(#{SURNAME},?#{INITIALS})(\s?(,|and|&|,\s?and)?

Now I am sure that this can be improved upon, but with a little web interface I have cooked up I can take the following:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988.
and turn it into this:
1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes (Cited by 78).” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context (Cited by 52)” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble (Cited by 284).” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings (Cited by 822). New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action (Cited by 408). Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value (Cited by 210). New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center (Cited by 14). New York: Doubleday, 1988.

Which I think is pretty damn useful. I'm getting about a 70% hit rate on other lists of references and I'm sure that can be improved. There are also changes that I might make to the color gradation. At the moment I'm just setting the red value from 0 to 255 based on number of citations, and everything with more than 255 citations doesn't get any redder. I'd like to set it up so that the color was normalised, so that the highest citation count in the references corresponds to red and all the gradations are in between, and ideally I'd like to slide between red and white instead of red and black and have the background color change rather than the text, but that's all icing on the cake really.

What I'd most like to see is this as a web service that everyone could use, and an ongoing group effort to improve the regex further and get as many title matches as possible. If interested please add your vote to the Google Scholar feature request.

Got myself blocked from Google as a robot

So in the process of developing my scholar system that uses Google scholar to get citation counts and automatically insert them into (and color) lists of scientific references, I managed to trip the Google robot blocker. Of course I'd love to be using their API, but Google has not opened Google Scholar as part of their API yet. I've been blogging about this for a few days - I think it would be a hugely popular move. The reason I got myself blocked yesterday was that I finally worked out some regex to semi-reliably strip the titles from random scientific references. As I was debugging the colorization process I was hitting Google Scholar a bit too frequently and then the queries starting failing. Got the above image in my browser, which very kindly allowed be to re-access Google via the web once I'd typed in the captcha.

Of course by ruby script was still blocked so I gave up work for the evening, but it was working again this morning - thanks Google - and so I immediately implemented a caching mechanism so that I don't hit Google Scholar each time I go through a debug cycle. Of course this means that releasing what I've created as a service would be problematic as it wouldn't scale. This is another reason why it would be beneficial for Google to release an API for Google scholar and allow systems to access through authenticated keys to distinguish them from robots. And then the services that I and others have built could be available to lots of other academics. Stay tuned for my next blog post in which I'll post some images from my new service ... (not really such a cliff-hanger - gotta have lunch first :-)

Monday, April 27, 2009

Example of need for Google Scholar API

Found an example of a script (ncbi-scholar) that would benefit from a Google Scholar API:

It adds number of google citations to references returned by PubMed - would love to see this available for all referencing systems.

Related to my earlier posting on coloring citations.

Second Life Landlord Stories: cross floor intrusion

So one of the first issues I had to address as a Second Life landlord was cross-floor intrusion. Objects from a person on the 3rd floor were extending down to the Library Sciences space on the 2nd floor. Sending an image and a simple request was all that it took to get this sorted, but it raises interesting issues about awareness of spaces. Second Life does not seem wonderfully set up to take account of multi-story buildings - many things are controlled by land settings making it impractical to directly control things like who has access to different floors, since access is controlled by land region ...

Sunday, April 26, 2009

Blog everything

So I was just thinking that all the posts I make to all the different technical forums should be posted here in my blog first, and then linked and copy and pasted into the forum. That way I have a central record of all my questions, and people who are not looking in all the disparate forums can see what I am thinking about.

I also have this feeling that there should be some way to automate or formalize that relation, by using javascript to pull the contents of my blog post into the forums themselves and support things like having responses in the forums also appear as responses in my blog ... hmm ... I think try it by hand first and see what develops.

Makes be think of that external conversation tool that plugs into blogs (that Robert Brewer mentioned) and also the idea of news discussion being trapped in particular news sites. Would be nice to be able to encounter the discussion elsewhere ...

Saturday, April 25, 2009

Consciousness and Suspension of Disbelief

Was talking to Tom Gammarino, author of "Big in Japan" this evening and was discussing the odd phenomena of when you are reading a book or watching a show, and the first half chapter or first episode seems very contrived and unreal to start, and then at some point it "clicks" and the world being presented becomes somehow believable.

I experienced this phenomena recently in what has turned out to be the excellent "Prince of Nothing" series by R. Scott Bakker; and previously when watching the first episode of the unparalleled Firefly tv show.

I started thinking that this might be a similar sort of thing to the process of consciousness turning disparate sense data into some sort of cohesive whole. I believe it is fairly commonly accepted amongst psychologists that we perceive disparate chunks of sense data and that there is kind of a leap of faith, or jump, whereby consciousness turns those chunks into what seems like an unbroken continuous reality. I would be very interested to know if there was any research on how much sense data and what quality of it was required before the jump to believable universe was achieved ...

Thursday, April 23, 2009

Cobb (1997) Measurable Learning from Hands-on Concordancing

I'm reading this paper as part of a meta-analysis of different approaches to teaching vocabulary. This paper describes a study making use of a software that provides the learner with many 'concordances'. Concordances are examples of words being used in context. You can see an example of concordance use in the image above. In this example from the PET200 system the learner is being asked to guess the appropriate word in the paragraph "had been done to make ? that the project", and they are getting help from a series of other sentences that take the same word in the blanked out space, e.g. "Make __ that your home is really ...", etc.

This was a particularly interesting study as the author designed the experiment so that the students using the software over 12 weeks got concordance support only on alternate weeks and got just definition support the remaining weeks. Comparing the results of weekly tests the author was able to show a specific benefit to supplying concordancing information. My only real concern about this study is that the precise nature of the weekly tests is not disclosed; they are described as spelling tests and novel text tasks. The danger with experiments of this kind is that the results will depend heavily on the type of test. So for example if the weekly tests were concordance based, e.g. look at these concordances and then spell the appropriate word, versus definition based tests, e.g. look at this definition and then spell the appropriate word, one can easily imagine that changing the test type would show an advantage of concordance based learning over definition based learning and vice versa. Would be good to contact the original author, but he does have some other publications we should check with first.

Original Paper
79 Citations according to Google Scholar

Monday, April 20, 2009

Citation Coloring with Google Scholar

So following on from my post on the Google Scholar API and PaperCube I just wrote a little ruby script to screen scrape the first cited by link and number of hits of google scholar. I'm hoping to create an example of the kinds of things that would become easier if a Google Scholar API was published. Here's the script:

#!/usr/bin/ruby -w
require 'open-uri'
require 'pp'

query = ARGV[0].gsub(/\W/,'+')

open("{query}%22&btnG=Search") do |f|
f.each do |line|
link = line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,1]
unless link.nil?
pp link
pp line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,2]

pp 'usage: scholar.rb <paper-title>'
This works by doing a search on a paper title and assuming the first hit will be the correct paper, and then grabbing the number of citations and link to the page of citations using regular expression matching. Clearly this would be much cleaner if there was an API to hit.

Now I just need to create a web interface or a PDF document javascript plugin (some example scripts here), so that I can achieve my goal of being able to color-highlight all the references in a academic document so that the most highly cited ones stand out. I think this would be really useful for quickly homing in on the key papers in a field. I am not sure if a PDF document plugin can access the web, so I went ahead and created a little rails app that will accept a list of papers and then try and look them all up in Google Scholar using the above script.

It's not working wonderfully well yet as it turns out it is pretty difficult to write a generic regular expression that will extract all the titles from a list of references. My two approaches so far are as follows:



The former tries to use the fact that authors' surnames usually appears first, and the latter tries a bottom up approach by trying to extract something that is title like, i.e. space separated words. Neither is working quite as well as I would like at the moment. I have to work on other stuff now, so I will come back to this in a few days, but any input on creating more robust regex for title extraction would be most welcome. Here is the set of references I have been testing on:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social
Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge
Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V.
Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E.
(1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors
in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press,
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988

MacBook Pro Failure

So my MacBook Pro starting experiencing kernel panics last Thursday. I did the special PRAM reboot, and things seemed fine, but then the problem re-occurred. I went to the mac store to see if I could get an appointment with a Genius, but the earliest that day conflicted with picking up my son from school, so I ended up buying a new MacBook Pro, because I felt I couldn't afford to be without laptop. Certainly an impulse buy, and now that I investigate kernal panics some more (e.g. Mac OSX Kernel Panic FAQ, Apple's Kernel Panic Page, New Screen of Death for Mac OSX) it seems like I might have been able to resolve things without shelling out $3K for a new MBP. However, I feel like I really needed a reliable computer under me, and at least two other MBPs in our lab have got kerput recently on two far more experienced OSX hacks than me.

Of course then I ran into all sorts of issues with the data transfer. My mail client Thunderbird and the software I use to run windows (VMWare fusion) had various issues on the new MBP. In the former case there was some kind of theme issue that eventually got resolved, and in the latter a re-install of the latest version of the software seemed to handle things, so everything is pretty much resolved now, but not before I spent two days trying different user account transfers trying to rule out the hypothesis that it was using an ethernet cable instead of a firewire cable that had caused some of these problems. There was a scary moment when after re-installing OSX on the new MBP when I couldn't get the transfer working from the old machine, or from my time machine; but I got through this after my good friend Ken Mayer suggested a superduper backup to an external LaCie drive, which worked. As a result I was able to transfer everything to the new machine from the external drive, and I seem to be out of the woods ... fingers crossed.

Thursday, April 16, 2009

Google Scholar API and PaperCube

A Google Scholar API doesn't exist yet, but some people are asking for it. Reading a paper last night I had an idea for what would be a cool use of a Google Scholar API, which is taking all the citations in a reference section of a paper and adding in a citation count. I wonder if that could be done even without a formal API?

What would be great would be a color coding of the references list at the end of the paper that picked out the papers with the most citations ...

Interestingly, reading the full list of the google scholar API request discussion I found mention of PaperCube, a SproutCore/SVG system that runs in a browser developed by an Apple Engineer and Master's student at Santa Clara called Peter Bergstrom. This is much much cooler than my concept, and if it was running on top of a Google Scholar API would be a total killer app for academics. Being able to navigate the web of academic citations is what every academic must currently do in their head, knowing which are the key papers that have been cited in a field. At the moment, moving into a new field is hit and miss as often the papers you get back from a search by title may be of relatively little importance to the field. The existing citation counts in Google Scholar are helpful, but you still have to conduct multiple searches to build up a picture of the field; not to mention that the workings of the Google Scholar citation calculation are black box. If we are lucky Google will release the API, make the citation calculation transparent, and something like PaperCube will run on top and make all academic:s lives that little bit easier ....

An empirical investigation of knowledge creation in electronic networks of practice (Chou & Chang, 2008)

I'm reading this paper as part of my look at social psychology inspired analysis and design of online communities.

In this paper the authors are concerned with knowledge creation in electronic networks. They describe knowledge as "information combined with experience, context, interpretation and reflection", which goes a little against the situated cognition position of knowledge as process rather than entity; however the authors go on to discuss how knowledge creation involves social and collaborative processes.

The authors present a paradox that while electronic networks allow information to be shared quickly with many other individuals, and may foster knowledge creation; the lack of directives or organizational routines in networks that span organizational boundaries may hinder efficient knowledge creation. In order to better understand the knowledge creation process in electronic networks the authors develop a series of hypotheses based on the theories of social capital (Lin, 2001) and the theory of planned behaviour (TPB; Ajzen, 1991).

They test these hypotheses by creating a questionnaire on attitudes to knowledge creation (KC). The reliability of this instrument is checked by getting responses from a mixed group of industries in Taiwan. A refined questionnaire was then distributed to members of a legal professional association that made extensive use of electronic networks, and the results from this survey were analyzed and used to create a model of the relationships between the different constructs, as shown in the diagram above.

The hypotheses in the paper are about how different constructs such as reputation and centrality of an individual affect attitudes towards KC, intention to conduct KC and KC behavior itself. The majority of the author's hypotheses are born out by the results of the survey (and an analysis based on posting logs). Some hypotheses are contradicted such as those that knowledge self-efficacy and reputation positively effect attitudes towards KC. Also, network centrality appears to impede the intention to create KC.

Overall this paper includes a rigorous approach to developing a survey instrument and analyzing it in terms of specific constructs. The approach is similar to the one I read about in Ma & Agarwal (2007) and I am guessing must be a standard approach in the business/organizational sciences. This approach clearly avoids some of the standard criticisms leveled at data collected in surveys, but it does not appear to avoid the problem that the survey results rely on self-assessment. An individual might believe they have a positive attitude to knowledge creation , but the reality may be different. Some of the constructs used rely on relatively objective analysis (i.e. centrality), so it would be beneficial to have further analysis to determine the relationship between individual's perceptions of their behaviors and actual behaviors. Also I'd like to see the actual questionnaire, but presumably it was in Taiwanese, and so there would be no simple way for me to get a feel for the "reliable constructs" expressed within it.

This paper (cited by 2 according to Google Scholar at time of this post) appears to be following in the footsteps of the highly cited Wasko & Faraj (2005) and so it would probably be instructive to read that paper.


Ajzen, I. (1991), “The Theory of Planned Behavior,” Organizational Behavior and Human Decision Processes, 50(2), pp. 179-211.

Lin, N. (2001), Social Capital, Cambridge University Press, Cambridge, UK.

Ma, M., and Agarwal, R. (2007), “Through a Glass Darkly: Information Technology Design, Identify Verification, and Knowledge Contribution in Online Communities,” Information Systems Research, 18(1), pp. 42-67.

Wasko, M. M., and Faraj, S. (2005), “Why Should I Share? Examining Social Capital and Knowledge Contribution in Electronic Networks of Practice,” MIS Quarterly, 29(1), pp. 35-57.

Wednesday, April 15, 2009

Constructing Networks of Action-Relevant Episodes (CN-ARE): An In Situ Research Methodology (Barab et al., 2001)

Describes a way of analyzing the spread of concepts during learning where the focus is on the participation trajectories of multiple actors rather than on the minds of individual learners. Trajectories are represented as networks of activity involving material, individual and social components. To this end observations of learner interactions are broken into episodes which are coded and then represented as nodes in a network.

'Situated Cognition' theory informs this approach; the idea that knowing refers to an activity and not a thing; that it is always contextualized and not abstract. This made me ask isn't there some sort of knowledge that we like to have in abstract form, e.g. addition, but I guess even if a learner has abstracted the general concept of addition it is still contextualized in terms of the extent to which the culture in which they find themselves values mathematical skill, and the situations in which it is expected to be used; i.e. any cognition is a complex social phenomenon.

The authors present a number of methodological contexts for their own methodology.
  • Interaction Analysis - developing coding schemes to describe interactions
  • Network theory - mapping things into nodes and links
  • Activity theory - seeing things in terms of nested activity systems, i.e. participants using components to act on objects
  • Actor Network theory - tracing interactions
It seems like the nodes are little pseudo-'activity theory' elements, with each node being defined in terms of an issue at hand, an initiator, a participant, a resource and a practice. Seems to me that trying to code the practice up front could be premature; I would have thought what the practice was would emerge after subsequent analysis of a network. The diagram above shows the initiators of particular practices as circled numbers, and other participants as just numbers. Lines show relations, and shading indicating some common theme.

Nodes or Action-Relevant Episodes are delimited by a change in theme, activity, or initiator. In the example in the paper of a virtual solar system related class activity a new node might be created when the issue at hand switched from eclipses to animation, the practice from modeling to Socratic questioning, or from one student to another. A large CN-ARE diagram is presented showing the tracer network for eclipse-related nodes, but it is not clear to me what insight is necessarily derived from using this particular representation. It is clearly helpful to have the detailed coding of the learner's activities if one wishes to understand the learning process rather than just the end results, but it would be nice if there was an explicit example of how being able to highlight nodes of a particular type in CN-ARE led to particular insights. Although the authors do refer to another paper where they:
Barab, Barnett, Yamagata-Lynch, Squire, and Keating (2002) used the CN–ARE methodology to identify the frequency of occurrences related to a particular tracer, and then used Activity Theory (Engeström, 1987) to contextualize these in terms of the larger context (activity system).
So perhaps I should read that, although a quick skim of it reveals that, as they say in the above quote, that this is about frequency of occurrences, and activity systems analysis; rather than about the network diagrams. I can intuit the value of the network diagrams, but would really like to see a concrete example of an insight that they helped support, i.e. one that wouldn't have been easily discovered through another representation.

Cited by 39 at time of post according to Google scholar

Monday, April 13, 2009

The Human Infrastructure of Cyberinfrastructure (Lee et al., 2006)

This paper describes how the authors conducted an 18-month ethnographic study of a large-scale distributed biomedical cyberinfrastructure project. Based on this study the authors argue that 'human infrastructure' is shaped by traditional, as well as new, team and organizational structures. The authors suggest 'human infrastructure' as an alternative perspective to 'distributed teams' as a way of understanding distributed collaboration.

The authors use a focus on human infrastructure to describe the social conditions and activities that constitute the emergence of infrastructure. The idea is that recently there have been projects that try to promote the development of cyber-infrastucture, e.g. shared databases, IT to support multi-institutional collaboration; big efforts to create applications, software and tools to support big science. The human infrastructure consists of the programmers, researchers, developers and so forth who have experience in the difficulties of developing tools to operate in such complex environments.

This paper focuses on FBIRN (Functional Biomedical Informatics Research Network), a multi-institutional project with the goal of developing tools to make multi-site functional MRI (Magnetic Resonance Imaging) studies a common research practice. The authors describe different social groups in FBIRN, including the importance of traditional organizations (e.g. hospitals and universities), and the lack of clarity on the part of individuals as to their degree of membership in particular virtual groupings (e.g. task forces and working groups). Individuals have a restricted view of the whole project, which the authors suggest may be advantageous since the complexity of the entire project is too great for any one individual to follow. Individuals also make extensive use of personal networks and networks arising from other related projects. Some examples of the human infrastructure perspective appear to be:
  • New Practices, Old Conventions: Creation of new clinical assessment tool from multiple existing previous ones
  • Experimentation and Negotiation: Over time FBIRN developed a new process for developing experiments
  • Sharing Data: both legal and ontological issues and the challenges of keeping everyone up to date on the development of shared standards
They describe the amorphous dynamic nature of the human infrastructure, and that the formation of BIRN is recursive. However I am not sure about the use of this term. They say:
"The infrastructure is used by people to negotiate work, and in response to these interactions, the shape of the infrastructure itself is continually negotiated and changed."
and cite an example of how a statistics working group was formed out of a calibration working group, and how the individuals in the statistics working group were operating independently until a re-organization by the working group chairs. However none of this suggests a recursive structure. A dynamic and reflective one certainly, but not recursive, but perhaps I am taking the term too literally.

The authors say that they are employing 'human infrastructure' as a lens to understand the work of FBIRN; a new way to understand organizational work, in contrast to traditional organizational structures, distributed teams or networks. However, even having read the paper twice I am a little unclear about what the 'human infrastructure' perspective is supposed to be. I guess I am unfamiliar with the existing literature of traditional organizational structures. Having read Wenger's Communities of Practice I tend to think of all human endeavours in this general fashion, i.e. there being a messy cludge of multiple types of networks and contexts. The authors argue that theirs is a departure from a picture of traditional, hierarchical organizations being replacced with dynamic, networked organizational forms, in that they see the story of FBIRN making sense by combing perspectives from traditional organizations, distributed teams and personal networks.

The authors make a few recommendations such as encouraging the embracement of fluid organizational structures, that different groups will require different sort of organizational and instrumental support. The problem with any recommendations is that there does not appear to be any assessment in the paper regarding the effectiveness of the FBIRN organisation. Early on the paper we are told that the project has:
"successfully developed de novo tools for multi-site functional MRI studies, for data collection, management, sharing, and analysis"
However, we only have the word of the authors of the paper that FBIRN's achievements should be viewed in a positive light. One might ask whether the same results could not have been achieved faster and more efficiently with other types of organizational structures? Personally I would much rather work in a fluid organizations, but I wonder what basis there is for arguing that it is necessarily better.

There is mention in the paper of how some individuals found the non-traditional organization challenging in so far as they couldn't hold collaborators accountable. Although there was suggestion of some alternative techniques being developed to address this issue nothing in the paper constitutes evidence that the particular organization of FBIRN is necessarily a good one. The authors are applying an ethnomethodological approach of attending physical and virtual meetings as observers, reading emails and conducting interviews. While this clearly allows large amounts of data to be gathered, and the study is ongoing, I am not sure what I can really say with much certainty having read the paper. My personal interest would be in what technologies were being used for collaboration, e.g. teleconferencing and virtual meeting software, wikis, email lists etc. Not that I don't want to focus on the human infrastructure, but I want to understand how the humans are using the different type of technology to serve the individual goals as they relate to their traditional organizational role, their personal networks and their other project affiliations.

Perhaps I am being too hard on an unfinished study that is as yet only a conference paper, but while it was very interesting to hear about how FBIRN operates, I have not taken away a clear understanding of what it means to focus on human infrastructure as opposed to distributed teams; which I think is the key perspective shift that the paper is advocating. I guess I would need to read literature on analysis of distributed teams to achieve a better understanding.

The success of projects of such a scale would appear to be very hard to assess. How does one know that the collaborations taking place are successful? Presumably interviewing all the participants and finding that they all agree that the collaboration is successful is one approach, but I would want some additional, perhaps more objective measure. Maybe there is no such measure. However in the absence of some measure of success or failure it seems difficult to argue that one perspective is necessarily better than another.

Thinking about my own interest in design patterns for virtual organizations, this paper makes me feel that perhaps I am focusing much too small. Is the group management interface in a wiki really going to have much impact on big distributed collaborations on things like FBIRN? The big socio-technical decisions would appear to be more at the level of what mailing lists to set up, which out of the box content management system to use (e.g. wikimedia or drupal) and how frequently to have face to face meetings etc.

[N.B. This paper has been cited by 21 according to google scholar at time of this posting.]

Monday, April 6, 2009

impressed by getsatisfaction automated answer search

I find myself impressed by the automated answer search in the getsatisfaction customer feedback management system . It checks an existing database of problems so that you may be able to find the solution to your problem without having to actually post yourself. This seems like a great idea to avoid multiple posts asking the same question and repeated answers from admins or expert users saying "please refer to existing answer at blah blah blah", which I have noticed a lot on things like MySQL forums. I was considering a similar approach for the disCourse discussion tool at one point, but never got round to implementing it. It seems like a community pattern with great potential. I can't find any scholarly papers that refer to it, and otherwise my experience at getsatisfaction has not been amazing, to the extent that I haven't managed to get answers to the issues I am having related to mapufacture's geopress project. I would be very interested to hear about any similar features available in other web applications or other software; especially if there was some citable academic research on the same.

Social and Economic Incentives in Google Answers

"Social and Economic Incentives in Google Answers" is a research paper by Sheizaf Rafaeli, Daphne R. Raban and Gilad Ravid; presented at the ACM Workshop on Sustaining Community: The role and design of incentive mechanisms in online systems, Sanibel Island. As of this post it has been cited by 10 other authors according to Google Scholar; although 6 of these are self citations.

The authors describe their own analysis of pricing and behaviour in the Google Answers system and compare with the analysis of Edelman (2004). Edelman found that more specialized answerers earned less per-hour. Edelman explained that when a researcher stays within a particular field they lose opportunities in other fields. The workshop paper refers to some interesting economic properties of information such as:
Information is expensive to produce and cheap to reproduce (Bates, 1989; Shapiro
and Varian, 1999).
... (information's) value is revealed only after consumption (Shapiro and Varian, 1999; Van Alstyne, 1999).
They also note that:
Behavioral research revealed that the value of information is derived from perceptions of at least three central elements: cost, quality, and ownership (Toften and Olsen, 2004; Raban and Rafaeli, 2005).
The researchers' initial inspection of the data suggested a correlation between economic incentive and amount of questions answered; while tips were only very weakly correlated, and the socially constructed ratings were not correlated at all.

Of course I should have been reading and citing the author's subsequent journal paper: Rafaeli, S., Raban, D., Ravid, G. How Social Motivation Enhances Economic Activity and Incentives in the Google Answers Knowledge Sharing Market. Int. Journal of Knowledge and Learning 3, 1 (2007), 1--11, but I had the earlier conference paper printed out ... but then again the technical paper ends rather abruptly with the conclusion that:
... when interaction is present, the social parameters of rating and comments contribute incentives to the formation of participation, beyond the role of economic incentives ...
Which I didn't quite follow from the conference paper, but this gets explained in more detail in the subsequent journal paper. It seems that if one considers only those questions that generated a discussion (at least one comment) then tips and social ratings are correlated to the question being answered. More detailed analysis in the journal paper of the mean ratio of comment per answer indicated that comments from the community given before answers from experts is correlated with the likelihood of answers from experts, but only at the level of individual experts. This is in contrasts with a reduced number of questions answered by experts where comments are present, if the analysis is not at the expert level, but across the entire site. The authors explain this at the site level as follows:
... if sufficient help was provided by a comment, there is less need or room for an answer. Ethical experts will not post a paid answer where an informative comment was submitted.
whereas the converse relationship at the individual expert level in this fashion:
Experts seem to be drawn to questions that generate much interest, comments, activity and tend to answer those questions more often. This may fill a social need but may also serve an economic purpose of enhancing the expert's reputation by getting exposure to more eyeballs.
This contradiction at the different levels of analysis only makes sense to me when one takes into account that comments may, or may not, answer the question posed. If the answer is contained in the comments they will serve as a dis-incentive to expert answers, whereas comments that do not answer the question are indicative of interest in the answer and as such serve as an incentive.

The authors consider two possible explanations to this and the related finding of increased levels of tipping for individual experts when questions have many comments:
... comments may enhance the overall perceived quality of all knowledge provided, as answer or comments, so the asker becomes more inclined to tip ... (or) may be that the asker feels some social pressure by the presence of the comment contributors which leads him/her to provide a tip as a social norm.
It seems to me that the latter explanation is the more likely, and the authors go on to cite a number of papers on the subject of social facilitation, which suggests they also lean towards a similar explanation. However it is difficult to imagine an objective experiment or analysis which would tease these two possible explanations apart. Although one might experiment with auto-generated comments or something similar.

This makes me think that critical mass on an individual piece of information or across an entire site may be a function of the likelihood with which individuals perceive their activity as being observed and/or approved of my others. For example, if lots of others, particularly existing colleagues of respected individuals, are undertaking certain activities in regard to a post, or towards a site, then this increases the likelihood of an individual participating, and when this likelihood (combined with frequency of encounter) reaches a tipping point over the system as a whole then the critical mass needed to achieve a thriving online community will be achieved.

I originally found this paper because it cited Ling et al. (2005), which I consider a seminal study on motivation in online communities. I am particularly interested in how certain design patterns influence participation in online communities, and Ling et al was one of the first papers that I found that took an experimental approach to this subject; something I perceive to be missing from the existing research on online community design patterns.

The journal paper has a couple of non self citations at the time of this post

There do appear to have been a few other analyses of the Google Answers since this paper was published

Interestingly google answers shut down on December 1st 2006. Other services such as Mahalo Answers, and UClue have sprung up in its place. There were also some non-monetary services that ran in parallel with Google Answers and continue to thrive such as Yahoo Answers and Answer Bag. Wikipedia and this blog post have a good discussion of some of the reasons for the shutdown.

Thursday, April 2, 2009

Collaborating with Web Applications

There are many web applications available that try to support online collaboration. I have been involved for several years in building and maintaining one system that was originally in PHP and was migrated to Rails, and is currently stuck at Rails 1.0, while the Rails world has moved on, most recently to Rails 2.3.

Over time I have become aware of many other systems such as Plone, Drupal and hosted services like basecamp and huddle. I often draw a parallel as follows:

PHP -> Zend -> Drupal
Ruby -> Rails -> Radiant
Python -> Zope -> Plone

Where at the base we have a programming language, next we have a framework that makes it easy to produce web applications in that particular language, and then finally we have systems that run as complete content management system (CMS) out of the box, support some basic funcationality like login, wikis etc., and also support the addition of extra functionality through some plugin framework.

Of course the picture is not as simple as this model in that firstly I don't think that Drupal runs on Zend (although there is a zend module for Drupal), and there are many frameworks for each language and many CMSs for each language and framework. However the named elements in the above schema are ones that have become close to the defacto standard in their bracket, and it is instructive to think about the pros and cons of setting up a CMS or collaboration system using each.

Writing something from scratch in PHP, Ruby, Python or other language is really a hell of a lot of work and you will be maintaining it yourself, and it will be difficult to get open source community traction given the other existing frameworks written in those languages, so really the options are pick a framework, or pick a CMS. Picking a CMS means less up front work and potentially getting support from the community associated with it, but the particular CMS may not match your current needs. Fortunately these systems can be extended through plugins, but if the underlying system is not a good match you'll have trouble. Working directly in the frameworks to make your own application give you a lot of flexibility but there tends to be a lot of re-inventing the wheel through making things like password reset functionality, which you will then have to maintain. One interesting possibility here is to incorporate a framework like google connect or facebook connect:
or even to go for, potentially closed source, alternatives like basecamp or huddle. Interestingly huddle is now available through LinkedIn and Facebook.

and there are interesting possibilities for using twitter

Alternatives to basecamp according to include and there is an even bigger list here:

Using a hosted service leaves you at the mercy of the hosted service, and so it is tempting to set up an existing CMS, but these require maintenance too, however with plugin frameworks they do allow a degree of customization. To get more flexibility one can build in a framework like rails, but as I have found out the real challenge there is keeping things up to date, which is not as simple as it sounds. In our project we stuck with Rails 1.0 in order to keep pushing features out and avoid having to spend time adapting to a changing codebase, but we built up a lot of technical debt in the process, and maintaining our Rails 1.0 is now a big handicap. Also the lack of a pluggable framework means that involving other coders is messy, although that said I haven't really experimented with the pluggable systems enough to correctly analyze the associated pitfalls.

Anyway, so this is a rather incoherent post, but I wanted to collect together my thoughts and associated links on one page ...