Citizendium Forums
April 19, 2014, 09:24:58 UTC *
Welcome, Guest. Please login or register.

Login with username, password and session length
News: POSTING RULES FOR MAIN CZ BOARDS: (1) The CZ Forums are Citizens-only (a "Citizen" is a Citizendium member). Non-Citizens may use only the "Non-Citizen comments" board, but still must register before posting (it's easy!). Non-Citizen posts elsewhere will be summarily deleted. (2) All must use their own real names. To edit your displayed name, click on Profile > Account Related Settings. (3) Citizens must link to their CZ user pages. To edit your signature, click on Profile > Forum Profile Information.
Click here to return to the wiki E-mail support
 
   Home   Help Search Login Register  
Pages: [1] 2
  Print  
Author Topic: Gene articles and bots  (Read 19064 times)
Andrew Su
Forum Member
**
Posts: 11


« on: March 26, 2007, 18:18:37 UTC »

All,

I recently approached the folks in the Molecular and Cellular Biology project at Wikipedia about a proposal to create, in an automated way, stubs for ~10,000 mammalian genes in parallel.  I posted a summary of the proposal on my wikipedia user page with links to some of the primary discussions:

http://en.wikipedia.org/wiki/User:AndrewGNF
http://en.wikipedia.org/wiki/User:ProteinBoxBot (updated link 4/3/2007)

In short, I proposed organizing structured data (synonyms and aliases, genome locations, gene function, etc.) from many public databases, and creating stub pages with infoboxes summarizing these data.  (These stubs ideally will serve as seeds for people to contribute the more non-structured data for which wikis are a great tool.)  In a parallel non-wiki project, we have done the data integration effort so now we are exploring how to create gene stubs that are most useful for the Wikipedia community.

The work-in-progress gene stub example is here: http://en.wikipedia.org/w/index.php?title=IL2-inducible_T-cell_kinase

The opening of the citizendium project presents the possibility of doing this effort here in addition to or instead of at Wikipedia.  As I see it, this possibility rests on two initial questions -- would this be desirable to the CZ community, and is there a bot policy to facilitate this?

Comments/suggestions are welcome...

Cheers,
-andrew
« Last Edit: April 04, 2007, 01:54:55 UTC by Andrew Su » Logged

Chris Day
Forum Regular
*****
Posts: 1068



« Reply #1 on: March 26, 2007, 18:49:38 UTC »

All,

I recently approached the folks in the Molecular and Cellular Biology project at Wikipedia about a proposal to create, in an automated way, stubs for ~10,000 mammalian genes in parallel.  I posted a summary of the proposal on my wikipedia user page with links to some of the primary discussions:

http://en.wikipedia.org/wiki/User:AndrewGNF

In short, I proposed organizing structured data (synonyms and aliases, genome locations, gene function, etc.) from many public databases, and creating stub pages with infoboxes summarizing these data.  (These stubs ideally will serve as seeds for people to contribute the more non-structured data for which wikis are a great tool.)  In a parallel non-wiki project, we have done the data integration effort so now we are exploring how to create gene stubs that are most useful for the Wikipedia community.

The work-in-progress gene stub example is here: http://en.wikipedia.org/w/index.php?title=IL2-inducible_T-cell_kinase

The opening of the citizendium project presents the possibility of doing this effort here in addition to or instead of at Wikipedia.  As I see it, this possibility rests on two initial questions -- would this be desirable to the CZ community, and is there a bot policy to facilitate this?

Comments/suggestions are welcome...

Cheers,
-andrew

Hey Andrew, i have been watching your discussions with Tim and co and I think your idea is excellent. At present there are no automated bots running on citizendium, that i know of, but this would be required for the proposal.  We have also considered using the public databases as a start to establish the tree of life related articles (mooted in a seperate thread). I'll go back and re-read your proposals again to get myself up-to-speed.  Notice that the copyright licenses are slightly different here.
Logged

Zachary Pruckowski
Forum Communicator
****
Posts: 933


« Reply #2 on: March 26, 2007, 19:14:25 UTC »

Hey Andrew, i have been watching your discussions with Tim and co and I think your idea is excellent. At present there are no automated bots running on citizendium, that i know of, but this would be required for the proposal.  We have also considered using the public databases as a start to establish the tree of life related articles (mooted in a seperate thread). I'll go back and re-read your proposals again to get myself up-to-speed.  Notice that the copyright licenses are slightly different here.

Correct.  There are currently no bots* on CZ.  Software to do what you're describing exists, and can be ported from Wikipedia versions.  If you decide you want to do this, we'll have to set it up and run it from CZ's servers, simply because (no offense) it could otherwise be a security risk.  I think that we'd prefer access at the wiki level versus access at the DB level, which should be fine for your purposes.

* = Jason (our technical lead) hates the word "bot", so we'd probably call it something else.  Bot has a very negative connotation in the IT field.
Logged

Andrew Su
Forum Member
**
Posts: 11


« Reply #3 on: March 26, 2007, 23:28:59 UTC »

Quote
Notice that the copyright licenses are slightly different here.

Hmmm, didn't notice that originally but this prompted me to do a little reading.  Is it correct to say that there is no consensus yet on the exact license, including use by commercial institutions?  Although our intent on this project is to be as open and "academic" as possible, GNF itself is not non-profit.  If we would be prohibited from incorporating CZ content into our gene portal, then this is pretty much a non-starter...  Anyway, if there is a specific license agreement that I just haven't found, please point me in the right direction...

Quote
we'll have to set it up and run it from CZ's servers, simply because (no offense) it could otherwise be a security risk. 

Not sure how it'd be a security risk, since the b*t would essentially be screen scraping and inheriting the permissions of its user account.  But anyway, no objection in principle for running off of CZ's servers.  And of course, we're happy to refer to this hypothetical b*t by whatever name is perferred... Wink

Cheers,
-andrew
Logged

Zachary Pruckowski
Forum Communicator
****
Posts: 933


« Reply #4 on: March 27, 2007, 00:20:44 UTC »

Quote
we'll have to set it up and run it from CZ's servers, simply because (no offense) it could otherwise be a security risk. 

Not sure how it'd be a security risk, since the b*t would essentially be screen scraping and inheriting the permissions of its user account.  But anyway, no objection in principle for running off of CZ's servers.  And of course, we're happy to refer to this hypothetical b*t by whatever name is perferred... Wink

I may have misunderstood you.  I thought you meant that the bot would also be creating articles.  If it just wants to read, then it can live wherever it wants.  If it wants write access however, then security considerations come into play (not that we don't trust you, just that we need to keep tabs on any sort of automated editing).
Logged

Andrew Su
Forum Member
**
Posts: 11


« Reply #5 on: March 27, 2007, 00:36:54 UTC »

Sorry, I think I've muddied the water here.  Yes, the bot would be creating and editing articles.  By "screen scraping" I meant that it would be editing via CGI get and post within the context of a bot user account (as opposed to any sort of API or or DB-level access).  And I completely understand the rationale of tracking and regulating bots in a similar way to WP -- absolutely no objection here.

Anyway, this is all jumping the gun.  Before the technical/regulatory issues, I think first we need to determine if this proposal is desirable within the scope of the CZ Biology project.  And for that, I'm excited to hear feedback from Chris and the other Biology authors and editors...
Logged

Chris Day
Forum Regular
*****
Posts: 1068



« Reply #6 on: March 31, 2007, 16:50:20 UTC »

And for that, I'm excited to hear feedback from Chris and the other Biology authors and editors...

Hi Andrew, so i was looking at your templates in wikipedia and they look excellent. It is a good idea connecting with the gene ontology keywords. It is a great stating point for unifying the information on different genes. The best thing is that the updating is then not dependant on CZ or WP but on the respective host sites. Thgis means everything is kept as updated as possible.

Is your long term goal to stick with mice and humans? Or is there a way to tie in the mess of gene nomenclature for all species?
« Last Edit: March 31, 2007, 16:53:52 UTC by Chris Day » Logged

Andrew Su
Forum Member
**
Posts: 11


« Reply #7 on: April 01, 2007, 21:54:05 UTC »

Quote
Is your long term goal to stick with mice and humans? Or is there a way to tie in the mess of gene nomenclature for all species?

Our institute's focus (and my personal interest) is on mammalian biology, and the database that we've developed to collate all gene annotation from the public domain is focused on human and mouse.  So yes, for the forseeable future, our emphasis will be on those two organisms.  Technically speaking, I think if someone else were more interested in adding content for another organism, it probably would be pretty straightforward for that person to adapt their data to use the bot that we develop.  Scientifically, however, I think this is a pretty sticky issue that will need further thought.  Between human and mouse, it's relatively easy to assign orthologs (the "same gene" in different species), but as you get to more and more distant organisms one of course has to deal with vast gene family expansion, functional "drift", etc.  Anyway, for this reason (and because we need to start somewhere), we're just going to commit to doing mouse and human right now...
Logged

Chris Day
Forum Regular
*****
Posts: 1068



« Reply #8 on: April 02, 2007, 01:16:41 UTC »

Anyway, for this reason (and because we need to start somewhere), we're just going to commit to doing mouse and human right now...

I agree that this is the best approach but i was thinking along the lines of crop plants, later. You are at Novartis, right? Or is that my mistake?
Logged

Andrew Su
Forum Member
**
Posts: 11


« Reply #9 on: April 02, 2007, 17:57:10 UTC »

Quote
Quote
Anyway, for this reason (and because we need to start somewhere), we're just going to commit to doing mouse and human right now...

I agree that this is the best approach but i was thinking along the lines of crop plants, later. You are at Novartis, right? Or is that my mistake?

GNF is a research institute by the Novartis Research Foundation and separate from Novartis' internal pharamceutical research.  To my knowledge, Novartis is no longer in the agricultural business after having merged with Astra Zeneca's agribusiness several years back...   But I'm not sure if integrating with the plant genomics community would be easier or harder than other animal model organisms.  The plant-specific genes are easy -- they just become new gene entries.  The basic cellular machinery which is shared would be sticky for the same reason as before -- how does one describe the true orthologs?
Logged

Chris Day
Forum Regular
*****
Posts: 1068



« Reply #10 on: April 03, 2007, 14:56:58 UTC »

how does one describe the true orthologs?

Rignt, and its worse in plants due to the frequent polyploidy events.
Logged

Larry Sanger
Founding Editor-in-Chief
Forum Regular
*****
Posts: 1830



WWW
« Reply #11 on: April 04, 2007, 02:25:13 UTC »

I see no good reason not to do this, except that the license might be problematic, from the sounds of it.  For a good while I've been leaning toward CC-by-sa-nc, but I came from a position where we'd use the GFDL for everything.  I'm now leaning back toward the latter position.  The advantages and disadvantages are all hard to weigh all at once, but I'm confident we'll make a wise decision.  Bear in mind that, even if we did decide to go with CC-by-sa-nc, you could export just the articles your b*t Smiley creates, because you'd be sharing the copyright with CZ.

I think that a project plan needs to be worked out and examined in some detail, however.  Is it necessary to make the articles editable at all?  Is the idea that a bot would create short, standardized articles based on shared data, which human beings would then add to?  Isn't this a bit problematic in that the bot would be able to run only once, since a second edition would automatically overwrite whatever human beings added?  What's the plan to deal with this problem, anyway?

Furthermore, there is the problem whether there are enough people available, even in the long run, to transform your bot-generated "stubs" into "encyclopedia articles."  Are there so many geneticists in the world that we can expect them to have interesting things to say about 10K genes on CZ?  If not, perhaps we shouldn't make any of the pages created by the bots editable at all (i.e., protect them all).  If someone wanted to write about a particular important gene, he would have to make a separate article.  And in that case, the bot-generated gene reference work could live in a separate namespace (think [[Gene:IL2-inducible_T-cell_kinase]]).

There's also one disadvantage of the plan, which is that "random page" would become almost useless--after your bot ran, most of the articles in the database would be gene articles.  But this is also just a temporary inconvenience.  If it adds 10K articles, that will dilute the database for only a year or two.  A possibility is that we could exclude the bot articles from "random page" somehow.

Maybe you can write up a project plan, answering these and other obvious questions, that we can examine?
« Last Edit: April 04, 2007, 02:34:05 UTC by Larry Sanger » Logged

My CZ user page: http://en.citizendium.org/wiki/User:Larry_Sanger
Please link to your CZ user page in your signature, too!
To do that, click on Profile > Forum Profile Information.
Andrew Su
Forum Member
**
Posts: 11


« Reply #12 on: April 04, 2007, 05:12:00 UTC »

Bear in mind that, even if we did decide to go with CC-by-sa-nc, you could export just the articles your b*t Smiley creates, because you'd be sharing the copyright with CZ.

Yeah, but I think we'd only be able to export the part of the article we contributed, and not take advantage of the community's efforts to enhance the article.  Anyway, I'll be very interested to see how this pans out.  If it goes CC-by-sa-nc, then probably I'll try to use the same bot on both CZ and WP, but the emphasis on debugging and customization would be on WP... 

I think that a project plan needs to be worked out and examined in some detail, however.  Is it necessary to make the articles editable at all?  Is the idea that a bot would create short, standardized articles based on shared data, which human beings would then add to?  Isn't this a bit problematic in that the bot would be able to run only once, since a second edition would automatically overwrite whatever human beings added?  What's the plan to deal with this problem, anyway?

I plan to limit the bot edits to a specific infobox.  The example page (http://en.wikipedia.org/w/index.php?title=IL2-inducible_T-cell_kinase) has evolved quite a bit since I first posted it and I think it's pretty close to a final v1.0 target.  All the edits that the bot makes will confined to that infobox, so it won't write over most contributions to the unstructured free-text section.  I plan on putting a comment in the infobox that changes will be overwritten by bot updates, and also post instructions on how to add a comment that will tell the bot to skip the update.

Furthermore, there is the problem whether there are enough people available, even in the long run, to transform your bot-generated "stubs" into "encyclopedia articles."  Are there so many geneticists in the world that we can expect them to have interesting things to say about 10K genes on CZ?  If not, perhaps we shouldn't make any of the pages created by the bots editable at all (i.e., protect them all).  If someone wanted to write about a particular important gene, he would have to make a separate article.  And in that case, the bot-generated gene reference work could live in a separate namespace (think [[Gene:IL2-inducible_T-cell_kinase]]).

I would strongly advocate not protecting the articles or putting them in a separate namespace.  I believe there's a huge community of geneticists and molecular biologists with whom this would catch on.  First, researchers write peer-reviewed articles all the time on a particular protein family, but it's a select few that get invited to write these review articles or have the patience to do it all.  Second, I plan on putting in reciprocal links between the infobox and the web application that caused us to collate all the data in the first place (http://symatlas.gnf.org/SymAtlas).  We get ~40K hits and ~3K users per week, so I hope to lead this crowd to the CZ/WP efforts.  Anyway, like most things on CZ/WP, I think this effort would only get bigger with time.  My hope is that this will nucleate some sort of critical mass...

Maybe you can write up a project plan, answering these and other obvious questions, that we can examine?

I just put together an initial set of specs for the bot at http://en.wikipedia.org/wiki/User:ProteinBoxBot.  This info will eventually be used for the WP bot approval process.  The discussion above noted that there is no official CZ bot policy.  I guess CZ is small enough (now) such that perhaps we wouldn't need one to proceed, as long as it's clear what the game plan is...  Anyway, happy to provide more details or answer more questions. 
Logged

Jason "Electrawn" Potkanski
Forum Participant
***
Posts: 158


I eat vandals like treats.


« Reply #13 on: April 04, 2007, 05:45:28 UTC »

To counterbalance this, the first bot we allow is to import the articles from 1911 Britannica. The article texts are in the public domain and will always be. It should be easy to find the Public Domain version that was uploaded to wikipedia or via other websites. Legal precedent says that a copy of a work that is in the public domain is in the public domain. Like say a picture of a page in the 1911 Britannica. That also makes http://www.1911encyclopedia.org/ fair game, contrary to whatever the policy in their disclaimers and terms attempts to limit.

-Jason Potkanski
Logged
Andrew Su
Forum Member
**
Posts: 11


« Reply #14 on: April 04, 2007, 18:02:48 UTC »

To counterbalance this, the first bot we allow is to import the articles from 1911 Britannica. The article texts are in the public domain and will always be. It should be easy to find the Public Domain version that was uploaded to wikipedia or via other websites. Legal precedent says that a copy of a work that is in the public domain is in the public domain. Like say a picture of a page in the 1911 Britannica. That also makes http://www.1911encyclopedia.org/ fair game, contrary to whatever the policy in their disclaimers and terms attempts to limit.

Jason, want to be sure I understand the relevance of the example above.  Do you mean to point out that since my proposed bot is gathering things from the public domain, then there will be no restrictions on how GNF (as a commercial entity) can use that data?  If so, then it's a point well taken, but my primary concern is the unstructured "free-text" content that comes in *after* we seed these protein stubs.  For people who first contribute to CZ, under CC-by-sa-nc we would not be able to put that content on our site.  And if that's the case, then there is no incentive (and really a disincentive) for me to steer our SymAtlas community to CZ.  Better to link our site to WP where we will be able to incorporate their content into our portal (and CZ of course could do the same).  But it would not work in reverse.  For most people this distinction may not be relevant, but for GNF it is...
Logged

Pages: [1] 2
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.15 | SMF © 2011, Simple Machines Valid XHTML 1.0! Valid CSS!