CELFS Teaching and Learning Network

Applying Corpus Tools to EAP Instruction

by Mat Terrett

My interest in the use of corpora for EAP began when I first heard about Coxhead’s (2000) Academic Word List and was further piqued when I read Hyland’s (2009) corpus-based argument for greater specificity in EAP teaching. This provided a focus for my own research through which I came to believe that an appropriate corpus could provide a very useful tool for teaching and learning EAP vocabulary. In this blog post I will share some of my experience in corpus linguistics and provide a ‘walkthrough’ example of how corpus tools can be used by both EAP teachers and students.

What is a corpus?

Corpus, which literally means body, is a collection of texts – a body of texts. There are big corpora (with Google being the biggest) and smaller corpora that comprise collections of specific kinds of text (e.g. a collection of the works of one author). A reference corpus is an existing corpus against which a researcher (or EAP teacher or student) can check how particular words or sequences of words are used.

Why use a corpus?

Before any demonstration as to how a reference corpus can be used, it would be useful to consider why an EAP teacher might want to use one. Essentially, it allows us to test our own intuition about how lexical items are used and provides a tool for students to be able to check their own vocabulary use. For example, as one of my former EAP colleagues in China explained:

When [students] have written something that just is not an English expression, I say to them, ‘have you tried putting this through Google…to see what comes up?’ And either it just doesn’t come up or it comes up all Chinese language websites and they realise they’re not using it appropriately…very quick – just using the web as your very quick reference corpus…You have to be careful with that obviously because there are a lot of people using wrong, horrible grammar and vocabulary.

(see also Robb, 2003)

Wouldn’t it be useful if there were specific corpora against which we could reference EAP lexis? Well, the quote above was from 2010 in a university where only a few EAP teachers (out of a total of 35) reported to be aware of corpora. Even then, had we known about it, more specific reference corpora were available online that would have eliminated my colleague’s concern regarding ‘wrong, horrible grammar and vocabulary.’ The remainder of this blog post will look at one such corpus – the British Academic Written English (BAWE) corpus which is accessible via the following link: https://the.sketchengine.co.uk/open/

BAWE is particularly useful for EAP instructors because it represents target written language, especially for those based in the UK. The corpus available on Sketch 文 Engine consists of 6,968,089 words distributed across 2,761 texts. There are roughly equal numbers of positively assessed student assignments ranging from first year undergraduate to Masters level across four broad disciplinary groupings (Nesi & Gardner, 2012:8; Click for more details). Thus, the corpus gives the EAP lecturer and students the opportunity to test their intuition regarding the use of lexical resources against an authentic target corpus – successful student papers submitted to UK universities.

This means that there is no need for EAP teachers to argue about whether a certain word or lexical bundle (strings or sequences of words that frequently occur together) is appropriate in disciplinary writing. We can simply check it using BAWE as the reference corpus. For the purpose of the demonstrative ‘walkthrough’ below, I have chosen to offer an answer to a question raised in training sessions on the pre-sessional course here at the University of Bristol in 2016 – the place of first person pronouns in academic writing.


For clarity I have used italics for lexical items explored through the corpus tools and bold to signify words with operational functions on the website.

Conducting a Simple query

First navigate to the Sketch 文 Engine and select BAWE from the list of options. Then simply enter the word or lexical chunk that you want to investigate. Note that on the screenshot below (Figure 1) I have decoded the rather user-unfriendly filter options for Text types, which can be used to make the reference corpus more specific (e.g. by looking only at the papers written in the physical sciences).


Figure 1 (click on figures to view larger image)

After running this search, teachers who tell students that academic writing does not use we are in for a bit of a shock. The simple query search includes instances of us as well as we and combined they occur 15,718 times (or 1,885.50 per million words) in the corpus. These words are in fact used at a much greater frequency than some words EAP teachers often actively encourage students to use (try searching furthermore and you’ll find it occurs 1,319 times or 158.22 per million words, and in conclusion occurs 428 times or 51.34 per million words). This indicates that the instruction not to use first person pronouns is overly simplistic.

KWIC and Concordancing

Having discovered that we is frequently used in successful academic assignments, teachers and students could usefully explore how we and us are used in context. To do this, look at the key word in context (KWIC) which appears in red embedded in the listed concordance lines. These concordance lines can be studied and patterns of usage observed (Figure 2).


Figure 2

For some direction on interpreting the functions of we, I recommend Tang and John (1999) who categorise ‘the writer identity in student academic writing through the first person pronoun’. They identify 6 different functions, including positioning the writer as: representative, guide through the essay, architect of the essay, recounter of the research process, opinion-holder and originator (with representative and guide being the most frequent).

Visualising data and Collocations tool

To further analyse the use of the word we, use the options down the left hand side of the Sketch 文Engine screen. Listed under Frequency, the Text types option will give you graphical data (Figure 3) of the breakdown of usage across different types of discipline, text genre and author so you can, for example, discover that:

  • we is used more often by 1st and 2nd year undergraduates than 3rd years
  • it is used most often in Philosophy and Mathematics, but very rarely in Planning
  • it is used most often in the ‘Methodology recount + Narrative recount’ genres
  • it is used in greater relative frequency by L1 Welsh and Mongolian speakers

For an EAP teacher, this provides plenty of data about different specific contexts in which we is used in academic writing that can be usefully explored with students.


Figure 3

Another useful tool is Collocations and if the default settings are used, it is very clear that we collocates in the BAWE corpus with can and see, which suggests the lexical bundle we can see is frequently used (this could be tested by entering the whole lexical chunk into the Simple query and running another search as in Figure 4 below).


Figure 4

Using the Filters

Any of the filters can be used to narrow the range of the corpus (see Figure 1) so it is possible to discover, for example, that we occurs 623 times (or 74.73 per million) in Social Sciences – Economics and 98 times (11.76 per million) in Physical Sciences – Chemistry. When comparing the statistics between the narrowed focus of the filtered corpus, it is important to use the per million figure  rather than the raw score so as to take account for the fact that employing different filters could create mini-corpora of significantly different sizes (convert per million into per thousand or a percentage if it makes it easier to conceptualise).

To search for a specific word form (e.g. exclude us) click on query type – word and enter your search item in the word form field (Figure 5). This returns 13,222 instances of we (1,586.08 per million).


Figure 5

It may also be useful to know that you can control the word form of a search item to some extent just by being aware of what you type into the simple query. For example, without using any filters a search of maintains will give you only instances of maintains, whereas searching maintain will return instances of multiple word forms (maintains, maintained, maintaining). You can also use * to indicate missing letters which allows you to enter the basic stem of a given word and broaden your search for different word forms (try it by running a search for analyse and analys*).

A Caveat

Whilst BAWE provides a very useful reference corpus for exploring the use of lexis in successful academic student writing, it is important to remember that it is still non-expert writing and may contain grammar, spelling and other language errors. It cannot be assumed that the assignments are models of excellent writing, only that they are of sufficiently good quality to have been deemed successful. Furthermore, BAWE is clearly not exhaustive so if a particular search item does not return many hits, this does not necessarily mean that it is not academic. There could be other explanations such as a lack of papers on a particular topic area or insufficient instances of a particular word to reveal a comprehensive list of collocates. Nevertheless, if a search item turns up few or no instances and no other reason is identifiable, it certainly does suggest that it might be sensible to avoid its usage.

Taking it further

There ends my demonstration of how the BAWE reference corpus can be used by EAP teachers and students. Given the high numbers of Chinese students who currently make up the student body on EAP pre-sessionals, EAP teachers might want to run searches on common clichés like every coin has two sides, what’s more and with the development of the society/technology/economy.

For teachers who are really interested in exploring corpora further, whole texts can be compared to each other using tools available on Compleat Lexical Tutor. This website also allows you to conduct Key Word Analyses against a selection of reference corpora. Alternatively, a copy of the original BAWE text files can be requested (or search for a suitable corpus that is readily available online and download it) to provide a reference corpus against which to compare your own corpus (e.g. collection of your own students’ work or a collection of reading input teaching materials) using software like Wordsmith Tools or Antconc. This, I believe, would be particularly useful for course developers and those interested in constructing a vocabulary-based EFL curriculum.

I’ll finish with a link to my own research that compared IELTS style reading input against a BAWE corpus narrowed by disciplinary field, but be warned – it is a heavy read! If you do brave a look, chapter five is where I present my results from the corpus-driven text analysis. I should also add that my EdD supervisor wrote a whole book dedicated to Chinese Students’ Writing in English: Implications from a corpus-driven study – useful material for pre-sessional providers.

Additional Resources

Sinclair (2004) Developing Linguistic Corpora: a Guide to Good Practice

Tom Cobb’s Compleat Lexical Tutor

Mark Davies’ Word and Phrase

More tools & websites related to corpus linguistics


Coxhead, A. (2000) ‘A New Academic Word List’, TESOL Quarterly, 34(2), pp.213-238.

Hyland, K. (2009) ‘Writing in the disciplines: Research evidence for specificity’, Taiwan International ESP Journal, Vol.1, No.1, pp.5-22.

Nesi, H. & Gardner, S. (2012) Genres across the disciplines student writing in higher education, Cambridge University Press.

Robb, T. (2003) ‘Google as a Quick ‘n Dirty Corpus Tool’, The Electronic Journal for English as a Second Language, 7:2, http://www.tesl-ej.org/wordpress/issues/volume7/ej26/ej26int/

Tang, R. and John, S. (1999) ‘The ‘I’ in identity: Exploring writer identity in student academic writing through the first person pronoun’, English for Specific Purposes 18, pp.S23-S39.


4 Responses to “Applying Corpus Tools to EAP Instruction”

  1. eflnotes

    Hi nice post and great reminder of the open corpora available on SkE interface.
    Your readers maybe interested in the G+ CL community http://bit.ly/CLcomm where various news/tools/issues etc related to CL in language teaching/learning are posted.


  2. Simon Smith

    Very interesting post with great tips for using BAWE and helpful annotations of the interface. Your ideas seem to mainly target teachers, but students quite like it too (in small doses, though; and they have to be clear that they are not allowed to copy chunks!).

    I’m not really clear what the issue is (in PSE circles) regarding ‘we/us’. Academic writers (whether professionals, especially in hard sciences, or students) are sometimes advised to avoid the use of ‘I/me’. In some circumstances, this may lead writers to refer to themselves, a single author, as ‘we/us’, as if they were royalty. I think it is this practice that EAP/PSE should be focusing on stamping out, and I think that that is probably why the idea of coaching students not to use the first person came about.

    In BAWE, ‘we/us’ seems to be used mostly to refer to either people in general, or the collective group consisting of the writer and the potential readers of the piece of work. Logically, both these groups include the author and legitimately can be labelled ‘we/us’.

    ‘I/me’, in BAWE as anywhere else, is used to refer to the author alone. Its frequency is barely lower than ‘we/us’, interestingly It is found 597 times per million in Arts and Humanities, and about half that in Physical Sciences (where in fact the word ‘I’ is sometimes used for variables and other things, not as a pronoun.

    So we can conclude that an inclusive ‘we/us’ is fine in academic writing, and a self-referential ‘I/me’ is also perfectly acceptable.

    My caveat would be this, however: Most of the contributors to the BAWE corpus are native English speakers, and the majority of those that are not have received their secondary education in the UK. Getting the nuance and tone of first person usage correct in writing is not easy for international students, and would need lots and lots of examples, the presentation of which could be tedious. Arguably, then, a simple blanket ban could be the simplest approach…

    (Have you ever managed to get EAP students to use ‘Besides,’ correctly? I simply ban its use!)

    Liked by 1 person

    • MAT

      Interesting that you mention ‘besides’ because I had a couple of students on the pre-sessional who used the word incorrectly and I asked them to look at the concordance lines generated through BAWE. To my surprise, they were able to articulate the subtleties of how the word is used and then decided that they had actually meant something like ‘furthermore’. They both decided to avoid using ‘besides’ so the result was the same as my banning it but with the advantage that they took ownership of that decision. I might try that again next year…


  3. MAT

    Update: I just went through the example using ‘we’ in class and, whilst the students were reeling in shock having discovered that it is used with some frequency in academic writing, I followed this with a search for ‘As we all know’ (a very common expression in foundation year/IELTS prep Chinese student writing) and the resultant ‘no hits’ clearly demonstrated its not a problem with the word(s) but the usage.



