Korpus C4 - Category (en-gb)

Welcome to Corpus C4 – the joint corpus of the Digitales Wörterbuchs der deutschen Sprache des 20. Jahrhunderts (DWDS), the Austrian Academy Corpus (AAC), the Korpus Südtirol and the Swiss Text Corpus (CHTK).

You will find general information about Corpus C4 on these pages – as well as links to the corpus search interface.

Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)

DWDS aims at developing a dictionary system based on large and balanced electronic corpora. The project is planned to be realised in three consecutive phases: Building of text corpora, data processing with computational linguistic techniques and lexicographical work. The publicly available data is linguistically pre-processed and can be accessed via Internet (www.dwds.de). Since 2007, DWDS has been an Academy Project of the BBAW.

Austrian Academy Corpus (AAC)

The Austrian Academy Corpus (AAC) is compiled in Austria in a national corpus programme especially paying attention to the aspects of "preservation of cultural heritage" and "text technology". At its completion the AAC will contain about 1 billion running words.

Korpus Südtirol

The 'Korpus Südtirol' initiative aims at the collection, filing and corpus linguistic processing of South Tyrolean texts in order to make them available to the public and to document the use of written German in South Tyrol. It was started in September 2005 by researchers of the Free University of Bolzano, the European Academy Bozen/Bolzano and the University of Innsbruck.

By now, the text corpus thus developed allows for the evaluation of the available authentic language data according to different criteria, so as to be used for studies in historical linguistics, socio-linguistics, contact linguistics and language varieties. It will also serve as a basis for language teaching and consulting. The texts collected in South Tyrol, as well as further corpora in other languages, can be investigated through a user-friendly corpus search interface, which offers the opportunity to strengthen people's language awareness within the multilingual environment in South Tyrol.

Swiss Text Corpus (CHTK)

The Swiss sub-project Swiss Text Corpus (CHTK) contains texts in German written by Swiss authors in the 20th century. This digital collection is structured in analogy to the partner projects in Germany, Austria and Italy (using the same formal, time and content criteria). It is a balanced representation of standard German vocabulary in Switzerland and can serve as a base resource for specifically Swiss lexicographical needs.

The Swiss Text Corpus was built by a research group of the Deutsches Seminar of the University of Basel and was funded mainly by the Swiss National Science Foundation. Since 2014 it has been hosted by the Schweizerisches Idiotikon with financial support of the Swiss Academy of Humanities and Social Sciences.

Sub-corpus DWDS

	documents/ words 1900-1924		documents/ words 1925-1949		documents/ words 1950-1974		documents/ words 1975-1999		total
functional texts	51	1'003'921	14	1'016'655	22	1'002'770	13	1'002'308	100
science	26	1'018'092	11	1'013'052	15	1'013'032	14	1'030'364	66
fiction	58	1'507'331	49	1'500'076	14	1'501'462	22	1'500'048	143
journalistic texts	NA	1'508'012	NA	1'503'202	NA	1'500'002	NA	1'500'000	NA
total		5'037'356		5'032'985		5'017'266		5'032'720

Sub-corpus AAC

	documents/ words 1900-1924		documents/ words 1925-1949		documents/ words 1950-1974		documents/ words 1975-1999		total
functional texts	2	106'275	1	27'137	0	0	0	0	3
factual texts	23	1'811'433	16	794'136	1	27'037	0	0	40
fiction	4	153'754	0	0	0	0	0	0	4
journalistic texts	2	363'309	4	355'019	0	0	0	0	6
	5	320'344	2	92'399	1	65'275	0	0	6
total	36	2'755'115	23	1'268'691	2	92'312	0	0	61

Sub-corpus Korpus Südtirol

	documents/ words 1900-1924		documents/ words 1925-1949		documents/ words 1950-1974		documents/ words 1975-1999		total
functional texts	946	259'842	0	0	8	152'209	2	10'280	956
factual texts	0	0	0	0	3	70'955	3	126'489	6
fiction	0	0	1	52'975	0	0	21	222'546	22
journalistic texts	0	0	0	0	58	75'040	383	447'463	441
total	946	259'842	1	52'975	69	298'204	409	806'778	1'425

Sub-corpus CHTK

	documents/ words 1900-1924		documents/ words 1925-1949		documents/ words 1950-1974		documents/ words 1975-1999		total
functional texts	1'042	1'170'099	1'465	1'267'731	969	1'193'200	1'417	1'087'395	4'893
factual texts	167	1'450'562	433	2'052'909	804	1'954'529	276	1'891'373	1'680
fiction	188	1'116'820	50	1'248'911	159	1'122'447	59	1'149'111	456
journalistic texts	833	513'728	1'107	1'020'160	993	982'098	1'929	1'135'426	4'862
total	2'230	4'251'209	3'055	5'589'711	2'925	5'252'274	3'681	5'263'305	11'891

Project structure

Members of the Corpus-C4 initiative are the Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) from Berlin, the Austrian Academy Corpus (AAC) from Vienna, the Korpus Südtirol from Bolzano/Bozen and the Swiss Text Corpus (CHTK) from Basel/Zurich.

The project aims at providing access to a balanced corpus of 20th century Standard German, especially taking into account regional variation. At the moment, the corpus contains 20 million running words from DWDS, 4.1 million words from AAC, 1.7 million words from Korpus Südtirol and 20 million words from CHTK.

Corpus query system

A special technical feature of Corpus C4 is its distributed query system. All sub-project host their data on their own servers and only at query time results are collected from the different servers and put together on a single page. Hence, the Corpus C4 as a whole is a virtual corpus. From a technological view this is mainly achieved using the functions of the linguistic search engine DDC developed by the Berlin sub-project DWDS.

Information

Category (en-gb)