by Mat Terrett
My interest in the use of corpora for EAP began when I first heard about Coxhead’s (2000) Academic Word List and was further piqued when I read Hyland’s (2009) corpus-based argument for greater specificity in EAP teaching. This provided a focus for my own research through which I came to believe that an appropriate corpus could provide a very useful tool for teaching and learning EAP vocabulary. In this blog post I will share some of my experience in corpus linguistics and provide a ‘walkthrough’ example of how corpus tools can be used by both EAP teachers and students.
What is a corpus?
Corpus, which literally means body, is a collection of texts – a body of texts. There are big corpora (with Google being the biggest) and smaller corpora that comprise collections of specific kinds of text (e.g. a collection of the works of one author). A reference corpus is an existing corpus against which a researcher (or EAP teacher or student) can check how particular words or sequences of words are used.
Why use a corpus?
Before any demonstration as to how a reference corpus can be used, it would be useful to consider why an EAP teacher might want to use one. Essentially, it allows us to test our own intuition about how lexical items are used and provides a tool for students to be able to check their own vocabulary use. For example, as one of my former EAP colleagues in China explained:
When [students] have written something that just is not an English expression, I say to them, ‘have you tried putting this through Google…to see what comes up?’ And either it just doesn’t come up or it comes up all Chinese language websites and they realise they’re not using it appropriately…very quick – just using the web as your very quick reference corpus…You have to be careful with that obviously because there are a lot of people using wrong, horrible grammar and vocabulary.
(see also Robb, 2003)
Wouldn’t it be useful if there were specific corpora against which we could reference EAP lexis? Well, the quote above was from 2010 in a university where only a few EAP teachers (out of a total of 35) reported to be aware of corpora. Even then, had we known about it, more specific reference corpora were available online that would have eliminated my colleague’s concern regarding ‘wrong, horrible grammar and vocabulary.’ The remainder of this blog post will look at one such corpus – the British Academic Written English (BAWE) corpus which is accessible via the following link: https://the.sketchengine.co.uk/open/
BAWE is particularly useful for EAP instructors because it represents target written language, especially for those based in the UK. The corpus available on Sketch 文 Engine consists of 6,968,089 words distributed across 2,761 texts. There are roughly equal numbers of positively assessed student assignments ranging from first year undergraduate to Masters level across four broad disciplinary groupings (Nesi & Gardner, 2012:8; Click for more details). Thus, the corpus gives the EAP lecturer and students the opportunity to test their intuition regarding the use of lexical resources against an authentic target corpus – successful student papers submitted to UK universities.
This means that there is no need for EAP teachers to argue about whether a certain word or lexical bundle (strings or sequences of words that frequently occur together) is appropriate in disciplinary writing. We can simply check it using BAWE as the reference corpus. For the purpose of the demonstrative ‘walkthrough’ below, I have chosen to offer an answer to a question raised in training sessions on the pre-sessional course here at the University of Bristol in 2016 – the place of first person pronouns in academic writing.
Conducting a Simple query
First navigate to the Sketch 文 Engine and select BAWE from the list of options. Then simply enter the word or lexical chunk that you want to investigate. Note that on the screenshot below (Figure 1) I have decoded the rather user-unfriendly filter options for Text types, which can be used to make the reference corpus more specific (e.g. by looking only at the papers written in the physical sciences).
Figure 1 (click on figures to view larger image)
After running this search, teachers who tell students that academic writing does not use we are in for a bit of a shock. The simple query search includes instances of us as well as we and combined they occur 15,718 times (or 1,885.50 per million words) in the corpus. These words are in fact used at a much greater frequency than some words EAP teachers often actively encourage students to use (try searching furthermore and you’ll find it occurs 1,319 times or 158.22 per million words, and in conclusion occurs 428 times or 51.34 per million words). This indicates that the instruction not to use first person pronouns is overly simplistic.
KWIC and Concordancing
Having discovered that we is frequently used in successful academic assignments, teachers and students could usefully explore how we and us are used in context. To do this, look at the key word in context (KWIC) which appears in red embedded in the listed concordance lines. These concordance lines can be studied and patterns of usage observed (Figure 2).
For some direction on interpreting the functions of we, I recommend Tang and John (1999) who categorise ‘the writer identity in student academic writing through the first person pronoun’. They identify 6 different functions, including positioning the writer as: representative, guide through the essay, architect of the essay, recounter of the research process, opinion-holder and originator (with representative and guide being the most frequent).
Visualising data and Collocations tool
To further analyse the use of the word we, use the options down the left hand side of the Sketch 文Engine screen. Listed under Frequency, the Text types option will give you graphical data (Figure 3) of the breakdown of usage across different types of discipline, text genre and author so you can, for example, discover that:
- we is used more often by 1st and 2nd year undergraduates than 3rd years
- it is used most often in Philosophy and Mathematics, but very rarely in Planning
- it is used most often in the ‘Methodology recount + Narrative recount’ genres
- it is used in greater relative frequency by L1 Welsh and Mongolian speakers
For an EAP teacher, this provides plenty of data about different specific contexts in which we is used in academic writing that can be usefully explored with students.
Another useful tool is Collocations and if the default settings are used, it is very clear that we collocates in the BAWE corpus with can and see, which suggests the lexical bundle we can see is frequently used (this could be tested by entering the whole lexical chunk into the Simple query and running another search as in Figure 4 below).
Using the Filters
Any of the filters can be used to narrow the range of the corpus (see Figure 1) so it is possible to discover, for example, that we occurs 623 times (or 74.73 per million) in Social Sciences – Economics and 98 times (11.76 per million) in Physical Sciences – Chemistry. When comparing the statistics between the narrowed focus of the filtered corpus, it is important to use the per million figure rather than the raw score so as to take account for the fact that employing different filters could create mini-corpora of significantly different sizes (convert per million into per thousand or a percentage if it makes it easier to conceptualise).
To search for a specific word form (e.g. exclude us) click on query type – word and enter your search item in the word form field (Figure 5). This returns 13,222 instances of we (1,586.08 per million).
It may also be useful to know that you can control the word form of a search item to some extent just by being aware of what you type into the simple query. For example, without using any filters a search of maintains will give you only instances of maintains, whereas searching maintain will return instances of multiple word forms (maintains, maintained, maintaining). You can also use * to indicate missing letters which allows you to enter the basic stem of a given word and broaden your search for different word forms (try it by running a search for analyse and analys*).
Whilst BAWE provides a very useful reference corpus for exploring the use of lexis in successful academic student writing, it is important to remember that it is still non-expert writing and may contain grammar, spelling and other language errors. It cannot be assumed that the assignments are models of excellent writing, only that they are of sufficiently good quality to have been deemed successful. Furthermore, BAWE is clearly not exhaustive so if a particular search item does not return many hits, this does not necessarily mean that it is not academic. There could be other explanations such as a lack of papers on a particular topic area or insufficient instances of a particular word to reveal a comprehensive list of collocates. Nevertheless, if a search item turns up few or no instances and no other reason is identifiable, it certainly does suggest that it might be sensible to avoid its usage.
Taking it further
There ends my demonstration of how the BAWE reference corpus can be used by EAP teachers and students. Given the high numbers of Chinese students who currently make up the student body on EAP pre-sessionals, EAP teachers might want to run searches on common clichés like every coin has two sides, what’s more and with the development of the society/technology/economy.
For teachers who are really interested in exploring corpora further, whole texts can be compared to each other using tools available on Compleat Lexical Tutor. This website also allows you to conduct Key Word Analyses against a selection of reference corpora. Alternatively, a copy of the original BAWE text files can be requested (or search for a suitable corpus that is readily available online and download it) to provide a reference corpus against which to compare your own corpus (e.g. collection of your own students’ work or a collection of reading input teaching materials) using software like Wordsmith Tools or Antconc. This, I believe, would be particularly useful for course developers and those interested in constructing a vocabulary-based EFL curriculum.
I’ll finish with a link to my own research that compared IELTS style reading input against a BAWE corpus narrowed by disciplinary field, but be warned – it is a heavy read! If you do brave a look, chapter five is where I present my results from the corpus-driven text analysis. I should also add that my EdD supervisor wrote a whole book dedicated to Chinese Students’ Writing in English: Implications from a corpus-driven study – useful material for pre-sessional providers.
Sinclair (2004) Developing Linguistic Corpora: a Guide to Good Practice
Tom Cobb’s Compleat Lexical Tutor
Mark Davies’ Word and Phrase
More tools & websites related to corpus linguistics
Coxhead, A. (2000) ‘A New Academic Word List’, TESOL Quarterly, 34(2), pp.213-238.
Hyland, K. (2009) ‘Writing in the disciplines: Research evidence for specificity’, Taiwan International ESP Journal, Vol.1, No.1, pp.5-22.
Nesi, H. & Gardner, S. (2012) Genres across the disciplines student writing in higher education, Cambridge University Press.
Robb, T. (2003) ‘Google as a Quick ‘n Dirty Corpus Tool’, The Electronic Journal for English as a Second Language, 7:2, http://www.tesl-ej.org/wordpress/issues/volume7/ej26/ej26int/
Tang, R. and John, S. (1999) ‘The ‘I’ in identity: Exploring writer identity in student academic writing through the first person pronoun’, English for Specific Purposes 18, pp.S23-S39.