Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home Tools CEA Corpus del Español Actual / The Corpus of Contemporary Spanish

Corpus del Español Actual / The Corpus of Contemporary Spanish

Español - English

 

Corpus del Español Actual (CEA) / The Corpus of Contemporary Spanish

Powered by CQPweb

To access CEA use username 'guest' and password 'guest'

How to cite this corpus:
Carlos Subirats and Marc Ortega. 2012. Corpus del Español Actual <http://spanishfn.org/tools/cea/english>

 

Description of the Corpus:

The Corpus del Español Actual (the Corpus of Contemporary Spanish) contains 540 million words, which have been lemmatized and tagged with detailed part-of-speech information. The CEA is made up of the following texts:

The CEA was tagged using an online Spanish dictionary containing 635,000 wordforms, which was automatically generated from a dictionary of 86,000 single-word lemmas (e.g., unir, inmoralidad, allí) and 26,000 multiword lemmas (e.g., muerte cerebral, carga de profundidad, de armas tomar) (Subirats 1989, 1992, 1994a, 1994b;  Mogorrón 1994; Garrido 1999; Bobes 2000). Tag disambiguation was carried out with intersecting finite-state automata using lexical and syntactic information (Subirats 1998, Subirats and Ortega 2000, 2001, Ortega in progress).

Searching the CEA:

The query interface for the CEA is CQPweb, which uses some of the components of the IMS Open Corpus Workbench (CWB), a set of open-source tools for managing and searching large corpora -- including the Corpus Query Processor (CQP). To learn more about how to use CQPweb, you can consult the IMS's brief description of the regular-expression syntax used by the CQP and their list of sample queries. If you wish to define your query in terms of grammatical and inflectional categories, you can use the part-of-speech tags listed on the CEA's Corpus Tags page.

CQPweb can be used to search for words, lemmas, or constructions. (The menu on the left-hand side of the query page lists all of the functions of CQPweb.) Searches on single wordforms are the most straightforward; you can simply type the word you are looking for in the query window. To query on lemmas or constructions, you must use regular expressions. For example, to see the instances of all wordforms associated with the lemma amar, you would use the regular expression (ER) [lemma="amar"]. To find instances of a complex construction consisting of any wordform in the lemma sorprender, followed by the preposition a, followed by zero to five other words, followed by an infinitive verb, you would use the regular expression [lemma="sorprender"] [word="a"] [] {0,5} [pos="V.*INF"]. CQPweb can also be used to calculate frequencies within query results and to create collocation lists for detailed contextual analysis. Query results can be downloaded, with or without part-of-speech tagging.

Acknowledgments:

The compilation of the dictionary of single-word and multiword lemmas and of the dictionary of inflected forms as well as the development of the initial tagging applications were carried out by the Computational Linguistics Laboratory at the Universidad Autónoma de Barcelona (Autonomous University of Barcelona) in Spain with financial support from the Spanish Ministry of Education, Culture, and Sports (grant numbers CAICYT PB85-371, CICYT PB87-780, and PB92-0635) and the Ministry of Public Works and Transportation (grant TIC90-403). The development of the finite-state transducers used in the dictionary generation, the creation of the integrated system for lemmatization and part-of-speech tagging of single- and multiword units, and the development of the system for lexical and syntactic analysis based on finite-state automata and transducers was carried out with the financial support of the Ministry of Education (grants TIC96-0804  and TIC1999-0753).


References: