Recently, we decided to run a little crowdsourcing experiment. We wanted to take data from the IMLS Museum Universe Data File, and add links + images to Wikipedia articles for each museum. Unfortunately, we didn’t have a ‘crowd’ to call on, so to speak, and it’s a difficult task to program primarily due to differences in spelling, duplicates and etc. Small discrete tasks like this are perfect for crowdsourcing. We utilised a service called Crowd Flower, and were quite happy with the quality of the results.
So how did we do it?
Finding the crowd
Some institutions may be lucky and have a great team of supports ready to jump in and to do the grunt work of a crowd sourced project — but an equal amount don’t. The task is also a bit boring, we aren’t digitising maps or restaurant menus. So we turned to a commercial service. There are a ton of services out there to do crowd sourcing and we choose Crowd Flower as I had looked at them a while ago for some other work.
Setting up the data
The process was pretty easy. We exported some data from the source file that we wanted to use in our questions (museum name and address). We forgot to add our unique ID which was bit of a hiccup, but luckily the combination of name and address was good enough to match records post processing. We decided to export all California museums from the file — that’s 1700 records in total.
We imported the data and then set up a questionnaire for the crowd to answer:
- Does this museum have a Wikipedia page?
- What is the URL of the Wikipedia page?
- What is the URL of the image?
Testing the questions
When you first run the job you set up several validation answers that the crowd fills out to make sure they are providing the correct answers. We set up 7 museums and gave the ‘correct’ answer to each one. We then ran 100 museums through the crowd to get responses. Responses came thick and fast with some ‘colourful’ responses to our test questions. It turns out there are multiple URL’s to a Wikipedia page:
We had only put in one of the possible responses so we had to go back and amend our test answers to not fail those people. We also added further instructions to clarify what we were after. For example we said that people should try slight variations of the museum name to find a match. “Mountain View Cemetery” instead of “Mountainview Cemetery” as one example.
Now we should be able to go back to the workers and update them but I can’t seem to work out how. TBH the Crowd Flower rep hasn’t answered my questions on this which is a bit disappointing.
Running it for real
So we sorted out our test questions and turned on the tap. We had set the price at 10c per record but I decided to drop that to 5 cents and see what happened. We decided on only having 2 people look at each record although you can set this higher for better accuracy. After all this was trying to get the best value for money. There is of course and ethical quality issue here and decisions around pricing are hard.
We submitted 1700 museums. Of those 435 (26%) of the museums have a Wikipedia page. Of those 397 (22% of the total) have an image.
The total cost was 57USD or 13 cents per wikipedia page link and 14.35 cents per image.
They processed 231 records per hour.
And the quality? So far it seems pretty good. You can download an aggregate file which chooses the ‘best’ result or you can download all the results and do your own analysis. Are you going to trust this with your precious collections database? I would seriously consider using these types of services as one tool in your kit.
We’re now using the Wikipedia API to pull in the image metadata, download the image and add the record to the museum data file. We will also look at adding the actual wikipedia metadata to the records as well to help fill in some missing fields. We’re hoping to open source the museum data file at some point along with the work we’ve been doing aggregating other museums from around the world.
PS: You can download the results of the test here – Wikipedia Test File
Header Image: Crowd of soldiers watching a boxing match at the New Zealand Divisional Sports, Authie. Royal New Zealand Returned and Services’ Association :New Zealand official negatives, World War 1914-1918. Ref: 1/2-013325-G. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/22910690