Corpus collection guidelines
1. Request students to fill in a learner profile
The VESPA learner profile has been created in order to provide researchers with information about contributors which will enable meaningful conclusions to be drawn from the results obtained when the corpus is analysed. Using the profile, it will be possible both to draw general conclusions about advanced learner writing, and also to examine subsections e.g. Spanish mother tongue learners, learners who speak some English at home, learners for whom German is the second language and English is the third language. It will also be possible to examine more sociolinguistic aspects such as for instance male/female comparisons. If the corpus is used as a basis for developing specifically adapted teaching tools, the potential advantages of this facility are clear.
The VESPA learner profile is available in two forms. International partners can either:
Each partner will have to attribute a code to each student and ask them to use this code when they fill in the learner profile (and to be very careful to type it correctly!). A student code consists of 3 letters for the institution + 4 digits for the student. Thus, at the Université catholique de Louvain, we give students codes starting with:
A student should only be given one code (and not a code per course!) if (s)he contributes several texts to the VESPA corpus. This is the only way we'll be able to identify several texts written by the same student while ensuring anonymity.
2. Collect the right type of material
The corpus will consist entirely of L2 academic writing in a wide range of:
Texts should be at least 500 words long (e.g. lab reports) but may be much longer (e.g. MA dissertations). They should be handed in in electronic format. This reduces the time spent typing up student texts and minimises the risk of introducing errors into the text.
Work should be entirely the students' own, i.e. no help should be sought from third parties, but reference tools such as dictionaries and grammar books are acceptable (use of reference tools should be indicated on the learner profile questionnaire). Texts produced by more than one student (e.g. collaborative work) and revised versions of texts (e.g. following teachers' comments) should not be included in the corpus.
Argumentative, descriptive and narrative subjects are not of interest. For this reason, the following types of titles should be avoided:
3. Text format
Student texts are usually submitted to the VESPA corpus as Microsoft Word documents. However, this format proves impractical for efficient processing of a corpus. The documents need to be converted to plain text format, which in turn requires pre-processing them to avoid loss of relevant information.
A number of computer tools, viz. Word macros and Perl scripts, enabling semi-automatic and automatic processing of the texts collected were developed by A. Heuboeck (Reading University, UK) to facilitate the encoding and mark-up process. The VESPA macros and Perl scripts are largely based on what was developed for the British Academic Written English (BAWE) corpus (cf. Ebeling & Heuboeck 2007; Heuboeck et al. 2008).
Concerning the encoding of the VESPA corpus, a decision was made to apply the encoding standard proposed by the Text Encoding Initiative (TEI).
There are 3 main steps involved in the preparation of student texts for the VESPA corpus:
An interface for interactive manual annotation (Step 1) was developed in the form of a series of Word macros, written in Visual Basic and making use of graphical user interface possibilities. This interface has been set up to guide the tagger through the annotation process step by step.
As put by Ebeling & Heuboeck, it facilitates the human tagger’s task in various respects:
Step 2 relies on a Word macro that partners just need to run on a batch of VESPA files to convert them to XML format.
When they have a batch of VESPA texts that have gone through Steps 1 and 2, partners should run the files through the Perl scripts.
| 27/01/2012 |