Corpus del Español Actual / The Corpus of Contemporary Spanish
Español - English
Corpus del Español Actual (CEA) / The Corpus of Contemporary Spanish
Powered by CQPweb
To access CEA use username 'guest' and password 'guest'
How to cite this corpus:
Carlos Subirats and Marc Ortega. 2012. Corpus del Español Actual <http://spanishfn.org/tools/cea/english>
Description of the Corpus:
The Corpus del Español Actual (the Corpus of Contemporary Spanish) contains 540 million words, which have been lemmatized and tagged with detailed part-of-speech information. The CEA is made up of the following texts:
- The Spanish part of the eleven-language parallel corpus Europarl: European Parliament Proceedings Parallel Corpus, v. 6 (1996-2010);
- The Spanish portion of the trilingual Wikicorpus, v. 1.0, which was extracted from a snapshot of Wikipedia (2006); and
- The Spanish part of the seven-language parallel corpus MultiUN: Multilingual UN Parallel Text 2000-2009, a corpus made up of the resolutions of the United Nations.
The CEA was tagged using an online Spanish dictionary containing 635,000 wordforms, which was automatically generated from a dictionary of 86,000 single-word lemmas (e.g., unir, inmoralidad, allí) and 26,000 multiword lemmas (e.g., muerte cerebral, carga de profundidad, de armas tomar) (Subirats 1989, 1992, 1994a, 1994b; Mogorrón 1994; Garrido 1999; Bobes 2000). Tag disambiguation was carried out with intersecting finite-state automata using lexical and syntactic information (Subirats 1998, Subirats and Ortega 2000, 2001, Ortega in progress).
Searching the CEA:
The query interface for the CEA is CQPweb, which uses some of the components of the IMS Open Corpus Workbench (CWB), a set of open-source tools for managing and searching large corpora -- including the Corpus Query Processor (CQP). To learn more about how to use CQPweb, you can consult the IMS's brief description of the regular-expression syntax used by the CQP and their list of sample queries. If you wish to define your query in terms of grammatical and inflectional categories, you can use the part-of-speech tags listed on the CEA's Corpus Tags page.
CQPweb can be used to search for words, lemmas, or constructions. (The menu on the left-hand side of the query page lists all of the functions of CQPweb.) Searches on single wordforms are the most straightforward; you can simply type the word you are looking for in the query window. To query on lemmas or constructions, you must use regular expressions. For example, to see the instances of all wordforms associated with the lemma amar, you would use the regular expression (ER) [lemma="amar"]. To find instances of a complex construction consisting of any wordform in the lemma sorprender, followed by the preposition a, followed by zero to five other words, followed by an infinitive verb, you would use the regular expression [lemma="sorprender"] [word="a"] [] {0,5} [pos="V.*INF"]. CQPweb can also be used to calculate frequencies within query results and to create collocation lists for detailed contextual analysis. Query results can be downloaded, with or without part-of-speech tagging.
Acknowledgments:
The compilation of the dictionary of single-word and multiword lemmas and of the dictionary of inflected forms as well as the development of the initial tagging applications were carried out by the Computational Linguistics Laboratory at the Universidad Autónoma de Barcelona (Autonomous University of Barcelona) in Spain with financial support from the Spanish Ministry of Education, Culture, and Sports (grant numbers CAICYT PB85-371, CICYT PB87-780, and PB92-0635) and the Ministry of Public Works and Transportation (grant TIC90-403). The development of the finite-state transducers used in the dictionary generation, the creation of the integrated system for lemmatization and part-of-speech tagging of single- and multiword units, and the development of the system for lexical and syntactic analysis based on finite-state automata and transducers was carried out with the financial support of the Ministry of Education (grants TIC96-0804 and TIC1999-0753).
References:
- Bobes, Eulàlia de. 2000. Gramática electrónica de las locuciones verbales. Laboratorio de Lingüística Informática, Universidad Autónoma de Barcelona.
- Garrido, Paloma. 1999. Estudio sintáctico del adverbio fijo en predicados comparativos. Estudios de Lingüística del Español 7.
- Ríos, Antonio. 1999. La transcripción fonética automática del Diccionario Electrónico de Formas Simples Flexivas del Español: Un estudio fonológico en el léxico. Estudios de Lingüística del Español 4.
- Mogorrón, Pedro.1994. Estudio contrastivo de las frases 'ser/estar + Prep X’ en español y ‘être + Prep X' en francés. Universidad de Valencia doctoral dissertation.
- Ortega, Marc. 2000. Transductores en el análisis léxico y sintáctico de un texto. Departamento de Informática y Laboratorio de Lingüística Informática, Universidad Autónoma de Barcelona.
- Subirats, Carlos.1987. El Diccionario Electrónico del Español. Procesamiento del Lenguaje Natural. Número monográfico sobre las III Jornadas SEPLN, Boletín No. 5, pp. 63-72.
- Subirats, Carlos. 1989. Verbal morphology in the Electronic Dictionary of Spanish. Lingvisticae Investigationes 13.1: 179-201.
- Subirats, Carlos. 1992. Verbal, nominal and adjectival inflection in the Electronic Dictionary of Simple Forms of Spanish. Lingvisticae Investigationes 16.2: 345-371.
- Subirats, Carlos. 1994a. Sistema de Diccionarios Electrónicos del Español. Actas del Congreso de la Lengua Española, Sevilla, 1992. Madrid: Instituto Cervantes. 316-330.
- Subirats, Carlos. 1994b. La flexión nominal en el Diccionario Electrónico de Formas Compuestas del Español. Lingua Franca 1: 63-69.
- Subirats, Carlos. 1998. Automatic extraction of textual information in Spanish. Language Design: Journal of Theoretical and Experimental Linguistics 1: 1-13.
- Subirats, Carlos; Ortega, Marc. 2000. Tratamiento automático de la información textual en español mediante bases de información lingüística y transductores. Estudios de Lingüística del Español 10.
- Subirats, Carlos; Ortega, Marc. 2001. Extracción automática de información de grandes corpus. In Josse De Kock (ed.), Lingüística con corpus: Catorce aplicaciones sobre el español. Salamanca: Ediciones Universidad de Salamanca. 155-175.